[PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
@ 2024-04-16 14:42 Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
                   ` (26 more replies)
  0 siblings, 27 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

VFIO device state transfer is currently done via the main migration channel.
This means that transfers from multiple VFIO devices are done sequentially
and via just a single common migration channel.

Such way of transferring VFIO device state migration data reduces
performance and severally impacts the migration downtime (~50%) for VMs
that have multiple such devices with large state size - see the test
results below.

However, we already have a way to transfer migration data using multiple
connections - that's what multifd channels are.

Unfortunately, multifd channels are currently utilized for RAM transfer
only.
This patch set adds a new framework allowing their use for device state
transfer too.

The wire protocol is based on Avihai's x-channel-header patches, which
introduce a header for migration channels that allow the migration source
to explicitly indicate the migration channel type without having the
target deduce the channel type by peeking in the channel's content.

The new wire protocol can be switch on and off via migration.x-channel-header
option for compatibility with older QEMU versions and testing.
Switching the new wire protocol off also disables device state transfer via
multifd channels.

The device state transfer can happen either via the same multifd channels
as RAM data is transferred, mixed with RAM data (when
migration.x-multifd-channels-device-state is 0) or exclusively via
dedicated device state transfer channels (when
migration.x-multifd-channels-device-state > 0).

Using dedicated device state transfer multifd channels brings further
performance benefits since these channels don't need to participate in
the RAM sync process.

These patches introduce a few new SaveVMHandlers:
* "save_live_complete_precopy_async{,wait}" handlers that allow device to
  provide its own asynchronous transmission of the remaining data at the
  end of a precopy phase.

  The "save_live_complete_precopy_async" handler is supposed to start such
  transmission (for example, by launching appropriate threads) while the
  "save_live_complete_precopy_async_wait" handler is supposed to wait until
  such transfer has finished (for example, until the sending threads
  have exited).

* "load_state_buffer" and its caller qemu_loadvm_load_state_buffer() that
  allow providing device state buffer to explicitly specified device via
  its idstr and instance id.

* "load_finish" the allows migration code to poll whether a device-specific
  asynchronous device state loading operation had finished before proceeding
  further in the migration process (with associated condition variable for
  notification to avoid unnecessary polling).

A VFIO device migration consumer for the new multifd channels device state
migration framework was implemented with a reassembly process for the multifd
received data since device state packets sent via different multifd channels
can arrive out-of-order.

The VFIO device data loading process happens in a separate thread in order
to avoid blocking a multifd receive thread during this fairly long process.

Test setup:
Source machine: 2x Xeon Gold 5218 / 192 GiB RAM
                Mellanox ConnectX-7 with 100GbE link
                6.9.0-rc1+ kernel
Target machine: 2x Xeon Platinum 8260 / 376 GiB RAM
                Mellanox ConnectX-7 with 100GbE link
	        6.6.0+ kernel
VM: CPU 12cores x 2threads / 15 GiB RAM / 4x Mellanox ConnectX-7 VF

Migration config: 15 multifd channels total
                  new way had 4 channels dedicated to device state transfer
                  x-return-path=true, x-switchover-ack=true

Downtime with ~400 MiB VFIO total device state size:
                                                       TLS off     TLS on
	migration.x-channel-header=false (old way)    ~2100 ms   ~2300 ms
	migration.x-channel-header=true (new way)     ~1100 ms   ~1200 ms
	IMPROVEMENT                                       ~50%       ~50%

This patch set is also available as a git tree:
https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio

Avihai Horon (7):
  migration: Add x-channel-header pseudo-capability
  migration: Add migration channel header send/receive
  migration: Add send/receive header for main channel
  migration: Allow passing migration header in migration channel
    creation
  migration: Add send/receive header for postcopy preempt channel
  migration: Add send/receive header for multifd channel
  migration: Enable x-channel-header pseudo-capability

Maciej S. Szmigiero (19):
  multifd: change multifd_new_send_channel_create() param type
  migration: Add a DestroyNotify parameter to
    socket_send_channel_create()
  multifd: pass MFDSendChannelConnectData when connecting sending socket
  migration/postcopy: pass PostcopyPChannelConnectData when connecting
    sending preempt socket
  migration/options: Mapped-ram is not channel header compatible
  vfio/migration: Add save_{iterate,complete_precopy}_started trace
    events
  migration/ram: Add load start trace event
  migration/multifd: Zero p->flags before starting filling a packet
  migration: Add save_live_complete_precopy_async{,wait} handlers
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: Add load_finish handler and associated functions
  migration: Add x-multifd-channels-device-state parameter
  migration: Add MULTIFD_DEVICE_STATE migration channel type
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Convert multifd_send_pages::next_channel to atomic
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add migration_has_device_state_support()
  vfio/migration: Multifd device state transfer support - receive side
  vfio/migration: Multifd device state transfer support - send side

 hw/core/machine.c              |   1 +
 hw/vfio/migration.c            | 530 ++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events           |  15 +-
 include/hw/vfio/vfio-common.h  |  25 ++
 include/migration/misc.h       |   5 +
 include/migration/register.h   |  67 +++++
 migration/channel.c            |  68 +++++
 migration/channel.h            |  17 ++
 migration/file.c               |   5 +-
 migration/migration-hmp-cmds.c |   7 +
 migration/migration.c          | 110 ++++++-
 migration/migration.h          |   6 +
 migration/multifd-zlib.c       |   2 +-
 migration/multifd-zstd.c       |   2 +-
 migration/multifd.c            | 512 ++++++++++++++++++++++++++-----
 migration/multifd.h            |  62 +++-
 migration/options.c            |  66 ++++
 migration/options.h            |   2 +
 migration/postcopy-ram.c       |  81 ++++-
 migration/ram.c                |   1 +
 migration/savevm.c             | 112 +++++++
 migration/savevm.h             |   7 +
 migration/socket.c             |   6 +-
 migration/socket.h             |   3 +-
 migration/trace-events         |   3 +
 qapi/migration.json            |  16 +-
 26 files changed, 1626 insertions(+), 105 deletions(-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 02/26] migration: Add migration channel header send/receive Maciej S. Szmigiero
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: Avihai Horon <avihaih@nvidia.com>

Add x-channel-header pseudo-capability which indicates that a header
should be sent through migration channels.

The header is the first thing to be sent through a migration channel and
it allows the destination to differentiate between the various channels
(main, multifd and preempt).

This eliminates the need to deduce the channel type by peeking in the
channel's content, which can be done only on a best-effort basis. It
will also allow other devices to create their own channels in the
future.

This patch only adds the pseudo-capability and sets it to false always.
The following patches will add the actual functionality, after which it
will be enabled..

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/machine.c     | 1 +
 migration/migration.h | 3 +++
 migration/options.c   | 9 +++++++++
 migration/options.h   | 1 +
 4 files changed, 14 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 37ede0e7d4fd..fa28c49f55b7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -37,6 +37,7 @@ GlobalProperty hw_compat_8_2[] = {
     { "migration", "zero-page-detection", "legacy"},
     { TYPE_VIRTIO_IOMMU_PCI, "granule", "4k" },
     { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "64" },
+    { "migration", "channel_header", "off" },
 };
 const size_t hw_compat_8_2_len = G_N_ELEMENTS(hw_compat_8_2);
 
diff --git a/migration/migration.h b/migration/migration.h
index 8045e39c26fa..a6114405917f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -450,6 +450,9 @@ struct MigrationState {
      */
     uint8_t clear_bitmap_shift;
 
+    /* Whether a header is sent in migration channels */
+    bool channel_header;
+
     /*
      * This save hostname when out-going migration starts
      */
diff --git a/migration/options.c b/migration/options.c
index bfd7753b69a5..8fd871cd956d 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -100,6 +100,7 @@ Property migration_properties[] = {
                       clear_bitmap_shift, CLEAR_BITMAP_SHIFT_DEFAULT),
     DEFINE_PROP_BOOL("x-preempt-pre-7-2", MigrationState,
                      preempt_pre_7_2, false),
+    DEFINE_PROP_BOOL("x-channel-header", MigrationState, channel_header, true),
 
     /* Migration parameters */
     DEFINE_PROP_UINT8("x-compress-level", MigrationState,
@@ -381,6 +382,14 @@ bool migrate_zero_copy_send(void)
 
 /* pseudo capabilities */
 
+bool migrate_channel_header(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return false;
+    return s->channel_header;
+}
+
 bool migrate_multifd_flush_after_each_section(void)
 {
     MigrationState *s = migrate_get_current();
diff --git a/migration/options.h b/migration/options.h
index ab8199e20784..1144d72ec0db 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -52,6 +52,7 @@ bool migrate_zero_copy_send(void);
  * check, but they are not a capability.
  */
 
+bool migrate_channel_header(void);
 bool migrate_multifd_flush_after_each_section(void);
 bool migrate_postcopy(void);
 bool migrate_rdma(void);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 02/26] migration: Add migration channel header send/receive
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 03/26] migration: Add send/receive header for main channel Maciej S. Szmigiero
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: Avihai Horon <avihaih@nvidia.com>

Add functions to send and receive migration channel header.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
[MSS: Mark MigChannelHeader as packed, remove device id from it]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/channel.c    | 59 ++++++++++++++++++++++++++++++++++++++++++
 migration/channel.h    | 14 ++++++++++
 migration/trace-events |  2 ++
 3 files changed, 75 insertions(+)

diff --git a/migration/channel.c b/migration/channel.c
index f9de064f3b13..a72e85f5791c 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -21,6 +21,7 @@
 #include "io/channel-socket.h"
 #include "qemu/yank.h"
 #include "yank_functions.h"
+#include "options.h"
 
 /**
  * @migration_channel_process_incoming - Create new incoming migration channel
@@ -93,6 +94,64 @@ void migration_channel_connect(MigrationState *s,
     error_free(error);
 }
 
+int migration_channel_header_recv(QIOChannel *ioc, MigChannelHeader *header,
+                                  Error **errp)
+{
+    uint64_t header_size;
+    int ret;
+
+    ret = qio_channel_read_all_eof(ioc, (char *)&header_size,
+                                   sizeof(header_size), errp);
+    if (ret == 0 || ret == -1) {
+        return -1;
+    }
+
+    header_size = be64_to_cpu(header_size);
+    if (header_size > sizeof(*header)) {
+        error_setg(errp,
+                   "Received header of size %lu bytes which is greater than "
+                   "max header size of %lu bytes",
+                   header_size, sizeof(*header));
+        return -EINVAL;
+    }
+
+    ret = qio_channel_read_all_eof(ioc, (char *)header, header_size, errp);
+    if (ret == 0 || ret == -1) {
+        return -1;
+    }
+
+    header->channel_type = be32_to_cpu(header->channel_type);
+
+    trace_migration_channel_header_recv(header->channel_type,
+                                        header_size);
+
+    return 0;
+}
+
+int migration_channel_header_send(QIOChannel *ioc, MigChannelHeader *header,
+                                  Error **errp)
+{
+    uint64_t header_size = sizeof(*header);
+    int ret;
+
+    if (!migrate_channel_header()) {
+        return 0;
+    }
+
+    trace_migration_channel_header_send(header->channel_type,
+                                        header_size);
+
+    header_size = cpu_to_be64(header_size);
+    ret = qio_channel_write_all(ioc, (char *)&header_size, sizeof(header_size),
+                                errp);
+    if (ret) {
+        return ret;
+    }
+
+    header->channel_type = cpu_to_be32(header->channel_type);
+
+    return qio_channel_write_all(ioc, (char *)header, sizeof(*header), errp);
+}
 
 /**
  * @migration_channel_read_peek - Peek at migration channel, without
diff --git a/migration/channel.h b/migration/channel.h
index 5bdb8208a744..95d281828aaa 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -29,4 +29,18 @@ int migration_channel_read_peek(QIOChannel *ioc,
                                 const char *buf,
                                 const size_t buflen,
                                 Error **errp);
+typedef enum {
+    MIG_CHANNEL_TYPE_MAIN,
+} MigChannelTypes;
+
+typedef struct QEMU_PACKED {
+    uint32_t channel_type;
+} MigChannelHeader;
+
+int migration_channel_header_send(QIOChannel *ioc, MigChannelHeader *header,
+                                  Error **errp);
+
+int migration_channel_header_recv(QIOChannel *ioc, MigChannelHeader *header,
+                                  Error **errp);
+
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index f0e1cb80c75b..e48607d5a6a2 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -198,6 +198,8 @@ migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma)
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
 migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname, void *err)  "ioc=%p ioctype=%s hostname=%s err=%p"
+migration_channel_header_send(uint32_t channel_type, uint64_t header_size) "Migration channel header send: channel_type=%u, header_size=%lu"
+migration_channel_header_recv(uint32_t channel_type, uint64_t header_size) "Migration channel header recv: channel_type=%u, header_size=%lu"
 
 # global_state.c
 migrate_state_too_big(void) ""


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 03/26] migration: Add send/receive header for main channel
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 02/26] migration: Add migration channel header send/receive Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type Maciej S. Szmigiero
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: Avihai Horon <avihaih@nvidia.com>

Add send and receive migration channel header for main channel.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
[MSS: Rename main channel -> default channel where it matches the current term]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/channel.c   |  9 +++++
 migration/migration.c | 82 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index a72e85f5791c..0e3f51654752 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -81,6 +81,13 @@ void migration_channel_connect(MigrationState *s,
                 return;
             }
         } else {
+            /* TODO: Send header after register yank? Make a QEMUFile variant? */
+            MigChannelHeader header = {};
+            header.channel_type = MIG_CHANNEL_TYPE_MAIN;
+            if (migration_channel_header_send(ioc, &header, &error)) {
+                goto out;
+            }
+
             QEMUFile *f = qemu_file_new_output(ioc);
 
             migration_ioc_register_yank(ioc);
@@ -90,6 +97,8 @@ void migration_channel_connect(MigrationState *s,
             qemu_mutex_unlock(&s->qemu_file_lock);
         }
     }
+
+out:
     migrate_fd_connect(s, error);
     error_free(error);
 }
diff --git a/migration/migration.c b/migration/migration.c
index 86bf76e92585..0eb5b4f4f5a1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -869,12 +869,39 @@ void migration_fd_process_incoming(QEMUFile *f)
     migration_incoming_process();
 }
 
+static bool migration_should_start_incoming_header(bool main_channel)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!mis->from_src_file) {
+        return false;
+    }
+
+    if (migrate_multifd()) {
+        return multifd_recv_all_channels_created();
+    }
+
+    if (migrate_postcopy_preempt() && migrate_get_current()->preempt_pre_7_2) {
+        return mis->postcopy_qemufile_dst != NULL;
+    }
+
+    if (migrate_postcopy_preempt()) {
+        return main_channel;
+    }
+
+    return true;
+}
+
 /*
  * Returns true when we want to start a new incoming migration process,
  * false otherwise.
  */
 static bool migration_should_start_incoming(bool main_channel)
 {
+    if (migrate_channel_header()) {
+        return migration_should_start_incoming_header(main_channel);
+    }
+
     /* Multifd doesn't start unless all channels are established */
     if (migrate_multifd()) {
         return migration_has_all_channels();
@@ -894,7 +921,22 @@ static bool migration_should_start_incoming(bool main_channel)
     return true;
 }
 
-void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
+static void migration_start_incoming(bool main_channel)
+{
+    if (!migration_should_start_incoming(main_channel)) {
+        return;
+    }
+
+    /* If it's a recovery, we're done */
+    if (postcopy_try_recover()) {
+        return;
+    }
+
+    migration_incoming_process();
+}
+
+static void migration_ioc_process_incoming_no_header(QIOChannel *ioc,
+                                                     Error **errp)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
@@ -951,13 +993,39 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         }
     }
 
-    if (migration_should_start_incoming(default_channel)) {
-        /* If it's a recovery, we're done */
-        if (postcopy_try_recover()) {
-            return;
-        }
-        migration_incoming_process();
+    migration_start_incoming(default_channel);
+}
+
+void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
+{
+    MigChannelHeader header = {};
+    bool default_channel = false;
+    QEMUFile *f;
+    int ret;
+
+    if (!migrate_channel_header()) {
+        migration_ioc_process_incoming_no_header(ioc, errp);
+        return;
+    }
+
+    ret = migration_channel_header_recv(ioc, &header, errp);
+    if (ret) {
+        return;
+    }
+
+    switch (header.channel_type) {
+    case MIG_CHANNEL_TYPE_MAIN:
+        f = qemu_file_new_input(ioc);
+        migration_incoming_setup(f);
+        default_channel = true;
+        break;
+    default:
+        error_setg(errp, "Received unknown migration channel type %u",
+                   header.channel_type);
+        return;
     }
+
+    migration_start_incoming(default_channel);
 }
 
 /**


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 03/26] migration: Add send/receive header for main channel Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create() Maciej S. Szmigiero
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This function is called only with MultiFDSendParams type param so use this
type explicitly instead of using an opaque pointer.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 2802afe79d0d..039c0de40af5 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1132,13 +1132,13 @@ out:
     error_free(local_err);
 }
 
-static bool multifd_new_send_channel_create(gpointer opaque, Error **errp)
+static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 {
     if (!multifd_use_packets()) {
-        return file_send_channel_create(opaque, errp);
+        return file_send_channel_create(p, errp);
     }
 
-    socket_send_channel_create(multifd_new_send_channel_async, opaque);
+    socket_send_channel_create(multifd_new_send_channel_async, p);
     return true;
 }
 


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create()
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket Maciej S. Szmigiero
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Makes managing the memory easier.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c      | 2 +-
 migration/postcopy-ram.c | 2 +-
 migration/socket.c       | 6 ++++--
 migration/socket.h       | 3 ++-
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 039c0de40af5..4bc912d7500e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1138,7 +1138,7 @@ static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
         return file_send_channel_create(p, errp);
     }
 
-    socket_send_channel_create(multifd_new_send_channel_async, p);
+    socket_send_channel_create(multifd_new_send_channel_async, p, NULL);
     return true;
 }
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index eccff499cb20..e314e1023dc1 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1715,7 +1715,7 @@ int postcopy_preempt_establish_channel(MigrationState *s)
 void postcopy_preempt_setup(MigrationState *s)
 {
     /* Kick an async task to connect */
-    socket_send_channel_create(postcopy_preempt_send_channel_new, s);
+    socket_send_channel_create(postcopy_preempt_send_channel_new, s, NULL);
 }
 
 static void postcopy_pause_ram_fast_load(MigrationIncomingState *mis)
diff --git a/migration/socket.c b/migration/socket.c
index 9ab89b1e089b..6639581cf18d 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -35,11 +35,13 @@ struct SocketOutgoingArgs {
     SocketAddress *saddr;
 } outgoing_args;
 
-void socket_send_channel_create(QIOTaskFunc f, void *data)
+void socket_send_channel_create(QIOTaskFunc f,
+                                void *data, GDestroyNotify data_destroy)
 {
     QIOChannelSocket *sioc = qio_channel_socket_new();
+
     qio_channel_socket_connect_async(sioc, outgoing_args.saddr,
-                                     f, data, NULL, NULL);
+                                     f, data, data_destroy, NULL);
 }
 
 QIOChannel *socket_send_channel_create_sync(Error **errp)
diff --git a/migration/socket.h b/migration/socket.h
index 46c233ecd29e..114ab34176aa 100644
--- a/migration/socket.h
+++ b/migration/socket.h
@@ -21,7 +21,8 @@
 #include "io/task.h"
 #include "qemu/sockets.h"
 
-void socket_send_channel_create(QIOTaskFunc f, void *data);
+void socket_send_channel_create(QIOTaskFunc f,
+                                void *data, GDestroyNotify data_destroy);
 QIOChannel *socket_send_channel_create_sync(Error **errp);
 
 void socket_start_incoming_migration(SocketAddress *saddr, Error **errp);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create() Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket Maciej S. Szmigiero
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This will allow passing additional parameters there in the future.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/file.c    |  5 ++-
 migration/multifd.c | 95 ++++++++++++++++++++++++++++++++++-----------
 migration/multifd.h |  4 +-
 3 files changed, 80 insertions(+), 24 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index ab18ba505a1d..34dfbc4a5a2d 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -62,7 +62,10 @@ bool file_send_channel_create(gpointer opaque, Error **errp)
         goto out;
     }
 
-    multifd_channel_connect(opaque, QIO_CHANNEL(ioc));
+    ret = multifd_channel_connect(opaque, QIO_CHANNEL(ioc), errp);
+    if (!ret) {
+        object_unref(OBJECT(ioc));
+    }
 
 out:
     /*
diff --git a/migration/multifd.c b/migration/multifd.c
index 4bc912d7500e..58a18bb1e4a8 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1010,34 +1010,76 @@ out:
     return NULL;
 }
 
-static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
-
-typedef struct {
+struct MFDSendChannelConnectData {
+    unsigned int ref;
     MultiFDSendParams *p;
     QIOChannelTLS *tioc;
-} MultiFDTLSThreadArgs;
+};
+
+static MFDSendChannelConnectData *mfd_send_channel_connect_data_new(MultiFDSendParams *p)
+{
+    MFDSendChannelConnectData *data;
+
+    data = g_malloc0(sizeof(*data));
+    data->ref = 1;
+    data->p = p;
+
+    return data;
+}
+
+static void mfd_send_channel_connect_data_free(MFDSendChannelConnectData *data)
+{
+    g_free(data);
+}
+
+static MFDSendChannelConnectData *
+mfd_send_channel_connect_data_ref(MFDSendChannelConnectData *data)
+{
+    unsigned int ref_old;
+
+    ref_old = qatomic_fetch_inc(&data->ref);
+    assert(ref_old < UINT_MAX);
+
+    return data;
+}
+
+static void mfd_send_channel_connect_data_unref(gpointer opaque)
+{
+    MFDSendChannelConnectData *data = opaque;
+    unsigned int ref_old;
+
+    ref_old = qatomic_fetch_dec(&data->ref);
+    assert(ref_old > 0);
+    if (ref_old == 1) {
+        mfd_send_channel_connect_data_free(data);
+    }
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(MFDSendChannelConnectData, mfd_send_channel_connect_data_unref)
+
+static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
 
 static void *multifd_tls_handshake_thread(void *opaque)
 {
-    MultiFDTLSThreadArgs *args = opaque;
+    g_autoptr(MFDSendChannelConnectData) data = opaque;
+    QIOChannelTLS *tioc = data->tioc;
 
-    qio_channel_tls_handshake(args->tioc,
+    qio_channel_tls_handshake(tioc,
                               multifd_new_send_channel_async,
-                              args->p,
-                              NULL,
+                              g_steal_pointer(&data),
+                              mfd_send_channel_connect_data_unref,
                               NULL);
-    g_free(args);
 
     return NULL;
 }
 
-static bool multifd_tls_channel_connect(MultiFDSendParams *p,
+static bool multifd_tls_channel_connect(MFDSendChannelConnectData *data,
                                         QIOChannel *ioc,
                                         Error **errp)
 {
+    MultiFDSendParams *p = data->p;
     MigrationState *s = migrate_get_current();
     const char *hostname = s->hostname;
-    MultiFDTLSThreadArgs *args;
     QIOChannelTLS *tioc;
 
     tioc = migration_tls_client_create(ioc, hostname, errp);
@@ -1053,19 +1095,21 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p,
     trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
     qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
 
-    args = g_new0(MultiFDTLSThreadArgs, 1);
-    args->tioc = tioc;
-    args->p = p;
+    data->tioc = tioc;
 
     p->tls_thread_created = true;
     qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
-                       multifd_tls_handshake_thread, args,
+                       multifd_tls_handshake_thread,
+                       mfd_send_channel_connect_data_ref(data),
                        QEMU_THREAD_JOINABLE);
     return true;
 }
 
-void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc)
+bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc,
+                             Error **errp)
 {
+    MultiFDSendParams *p = data->p;
+
     qio_channel_set_delay(ioc, false);
 
     migration_ioc_register_yank(ioc);
@@ -1075,6 +1119,8 @@ void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc)
     p->thread_created = true;
     qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
                        QEMU_THREAD_JOINABLE);
+
+    return true;
 }
 
 /*
@@ -1085,7 +1131,8 @@ void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc)
  */
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 {
-    MultiFDSendParams *p = opaque;
+    MFDSendChannelConnectData *data = opaque;
+    MultiFDSendParams *p = data->p;
     QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
     Error *local_err = NULL;
     bool ret;
@@ -1101,13 +1148,12 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
                                        migrate_get_current()->hostname);
 
     if (migrate_channel_requires_tls_upgrade(ioc)) {
-        ret = multifd_tls_channel_connect(p, ioc, &local_err);
+        ret = multifd_tls_channel_connect(data, ioc, &local_err);
         if (ret) {
             return;
         }
     } else {
-        multifd_channel_connect(p, ioc);
-        ret = true;
+        ret = multifd_channel_connect(data, ioc, &local_err);
     }
 
 out:
@@ -1134,11 +1180,16 @@ out:
 
 static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 {
+    g_autoptr(MFDSendChannelConnectData) data = NULL;
+
+    data = mfd_send_channel_connect_data_new(p);
+
     if (!multifd_use_packets()) {
-        return file_send_channel_create(p, errp);
+        return file_send_channel_create(data, errp);
     }
 
-    socket_send_channel_create(multifd_new_send_channel_async, p, NULL);
+    socket_send_channel_create(multifd_new_send_channel_async, g_steal_pointer(&data),
+                               mfd_send_channel_connect_data_unref);
     return true;
 }
 
diff --git a/migration/multifd.h b/migration/multifd.h
index c9d9b0923953..fd0cd29104c1 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -250,6 +250,8 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
     p->iovs_num++;
 }
 
-void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
+struct MFDSendChannelConnectData;
+typedef struct MFDSendChannelConnectData MFDSendChannelConnectData;
+bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc, Error **errp);
 
 #endif


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (5 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation Maciej S. Szmigiero
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This will allow passing additional parameters there in the future.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/postcopy-ram.c | 68 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e314e1023dc1..94fe872d8251 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1617,14 +1617,62 @@ void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
     trace_postcopy_preempt_new_channel();
 }
 
+typedef struct {
+    unsigned int ref;
+    MigrationState *s;
+} PostcopyPChannelConnectData;
+
+static PostcopyPChannelConnectData *pcopy_preempt_connect_data_new(MigrationState *s)
+{
+    PostcopyPChannelConnectData *data;
+
+    data = g_malloc0(sizeof(*data));
+    data->ref = 1;
+    data->s = s;
+
+    return data;
+}
+
+static void pcopy_preempt_connect_data_free(PostcopyPChannelConnectData *data)
+{
+    g_free(data);
+}
+
+static PostcopyPChannelConnectData *
+pcopy_preempt_connect_data_ref(PostcopyPChannelConnectData *data)
+{
+    unsigned int ref_old;
+
+    ref_old = qatomic_fetch_inc(&data->ref);
+    assert(ref_old < UINT_MAX);
+
+    return data;
+}
+
+static void pcopy_preempt_connect_data_unref(gpointer opaque)
+{
+    PostcopyPChannelConnectData *data = opaque;
+    unsigned int ref_old;
+
+    ref_old = qatomic_fetch_dec(&data->ref);
+    assert(ref_old > 0);
+    if (ref_old == 1) {
+        pcopy_preempt_connect_data_free(data);
+    }
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(PostcopyPChannelConnectData, pcopy_preempt_connect_data_unref)
+
 /*
  * Setup the postcopy preempt channel with the IOC.  If ERROR is specified,
  * setup the error instead.  This helper will free the ERROR if specified.
  */
 static void
-postcopy_preempt_send_channel_done(MigrationState *s,
+postcopy_preempt_send_channel_done(PostcopyPChannelConnectData *data,
                                    QIOChannel *ioc, Error *local_err)
 {
+    MigrationState *s = data->s;
+
     if (local_err) {
         migrate_set_error(s, local_err);
         error_free(local_err);
@@ -1645,18 +1693,19 @@ static void
 postcopy_preempt_tls_handshake(QIOTask *task, gpointer opaque)
 {
     g_autoptr(QIOChannel) ioc = QIO_CHANNEL(qio_task_get_source(task));
-    MigrationState *s = opaque;
+    PostcopyPChannelConnectData *data = opaque;
     Error *local_err = NULL;
 
     qio_task_propagate_error(task, &local_err);
-    postcopy_preempt_send_channel_done(s, ioc, local_err);
+    postcopy_preempt_send_channel_done(data, ioc, local_err);
 }
 
 static void
 postcopy_preempt_send_channel_new(QIOTask *task, gpointer opaque)
 {
     g_autoptr(QIOChannel) ioc = QIO_CHANNEL(qio_task_get_source(task));
-    MigrationState *s = opaque;
+    PostcopyPChannelConnectData *data = opaque;
+    MigrationState *s = data->s;
     QIOChannelTLS *tioc;
     Error *local_err = NULL;
 
@@ -1672,14 +1721,15 @@ postcopy_preempt_send_channel_new(QIOTask *task, gpointer opaque)
         trace_postcopy_preempt_tls_handshake();
         qio_channel_set_name(QIO_CHANNEL(tioc), "migration-tls-preempt");
         qio_channel_tls_handshake(tioc, postcopy_preempt_tls_handshake,
-                                  s, NULL, NULL);
+                                  pcopy_preempt_connect_data_ref(data),
+                                  pcopy_preempt_connect_data_unref, NULL);
         /* Setup the channel until TLS handshake finished */
         return;
     }
 
 out:
     /* This handles both good and error cases */
-    postcopy_preempt_send_channel_done(s, ioc, local_err);
+    postcopy_preempt_send_channel_done(data, ioc, local_err);
 }
 
 /*
@@ -1714,8 +1764,12 @@ int postcopy_preempt_establish_channel(MigrationState *s)
 
 void postcopy_preempt_setup(MigrationState *s)
 {
+    PostcopyPChannelConnectData *data;
+
+    data = pcopy_preempt_connect_data_new(s);
     /* Kick an async task to connect */
-    socket_send_channel_create(postcopy_preempt_send_channel_new, s, NULL);
+    socket_send_channel_create(postcopy_preempt_send_channel_new,
+                               data, pcopy_preempt_connect_data_unref);
 }
 
 static void postcopy_pause_ram_fast_load(MigrationIncomingState *mis)


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (6 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel Maciej S. Szmigiero
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: Avihai Horon <avihaih@nvidia.com>

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
[MSS: Rewrite using MFDSendChannelConnectData/PostcopyPChannelConnectData]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c      | 14 ++++++++++++--
 migration/postcopy-ram.c | 14 ++++++++++++--
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 58a18bb1e4a8..8eecda68ac0f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -18,6 +18,7 @@
 #include "exec/ramblock.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "channel.h"
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
@@ -1014,15 +1015,20 @@ struct MFDSendChannelConnectData {
     unsigned int ref;
     MultiFDSendParams *p;
     QIOChannelTLS *tioc;
+    MigChannelHeader header;
 };
 
-static MFDSendChannelConnectData *mfd_send_channel_connect_data_new(MultiFDSendParams *p)
+static MFDSendChannelConnectData *mfd_send_channel_connect_data_new(MultiFDSendParams *p,
+                                                                    MigChannelHeader *header)
 {
     MFDSendChannelConnectData *data;
 
     data = g_malloc0(sizeof(*data));
     data->ref = 1;
     data->p = p;
+    if (header) {
+        memcpy(&data->header, header, sizeof(*header));
+    }
 
     return data;
 }
@@ -1110,6 +1116,10 @@ bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc,
 {
     MultiFDSendParams *p = data->p;
 
+    if (migration_channel_header_send(ioc, &data->header, errp)) {
+        return false;
+    }
+
     qio_channel_set_delay(ioc, false);
 
     migration_ioc_register_yank(ioc);
@@ -1182,7 +1192,7 @@ static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 {
     g_autoptr(MFDSendChannelConnectData) data = NULL;
 
-    data = mfd_send_channel_connect_data_new(p);
+    data = mfd_send_channel_connect_data_new(p, NULL);
 
     if (!multifd_use_packets()) {
         return file_send_channel_create(data, errp);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 94fe872d8251..53c90344acce 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -19,6 +19,7 @@
 #include "qemu/osdep.h"
 #include "qemu/madvise.h"
 #include "exec/target_page.h"
+#include "channel.h"
 #include "migration.h"
 #include "qemu-file.h"
 #include "savevm.h"
@@ -1620,15 +1621,20 @@ void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
 typedef struct {
     unsigned int ref;
     MigrationState *s;
+    MigChannelHeader header;
 } PostcopyPChannelConnectData;
 
-static PostcopyPChannelConnectData *pcopy_preempt_connect_data_new(MigrationState *s)
+static PostcopyPChannelConnectData *pcopy_preempt_connect_data_new(MigrationState *s,
+                                                                   MigChannelHeader *header)
 {
     PostcopyPChannelConnectData *data;
 
     data = g_malloc0(sizeof(*data));
     data->ref = 1;
     data->s = s;
+    if (header) {
+        memcpy(&data->header, header, sizeof(*header));
+    }
 
     return data;
 }
@@ -1673,6 +1679,10 @@ postcopy_preempt_send_channel_done(PostcopyPChannelConnectData *data,
 {
     MigrationState *s = data->s;
 
+    if (!local_err) {
+        migration_channel_header_send(ioc, &data->header, &local_err);
+    }
+
     if (local_err) {
         migrate_set_error(s, local_err);
         error_free(local_err);
@@ -1766,7 +1776,7 @@ void postcopy_preempt_setup(MigrationState *s)
 {
     PostcopyPChannelConnectData *data;
 
-    data = pcopy_preempt_connect_data_new(s);
+    data = pcopy_preempt_connect_data_new(s, NULL);
     /* Kick an async task to connect */
     socket_send_channel_create(postcopy_preempt_send_channel_new,
                                data, pcopy_preempt_connect_data_unref);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (7 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 10/26] migration: Add send/receive header for multifd channel Maciej S. Szmigiero
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: Avihai Horon <avihaih@nvidia.com>

Add send and receive migration channel header for postcopy preempt
channel.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
[MSS: Adapt to rewritten migration header passing commit]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/channel.h      | 1 +
 migration/migration.c    | 5 +++++
 migration/postcopy-ram.c | 5 ++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/migration/channel.h b/migration/channel.h
index 95d281828aaa..c59ccedc7b6b 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -31,6 +31,7 @@ int migration_channel_read_peek(QIOChannel *ioc,
                                 Error **errp);
 typedef enum {
     MIG_CHANNEL_TYPE_MAIN,
+    MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT,
 } MigChannelTypes;
 
 typedef struct QEMU_PACKED {
diff --git a/migration/migration.c b/migration/migration.c
index 0eb5b4f4f5a1..ac9ecf1f4f22 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1019,6 +1019,11 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         migration_incoming_setup(f);
         default_channel = true;
         break;
+    case MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT:
+        assert(migrate_postcopy_preempt());
+        f = qemu_file_new_input(ioc);
+        postcopy_preempt_new_channel(migration_incoming_get_current(), f);
+        break;
     default:
         error_setg(errp, "Received unknown migration channel type %u",
                    header.channel_type);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 53c90344acce..c7e9f7345970 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1775,8 +1775,11 @@ int postcopy_preempt_establish_channel(MigrationState *s)
 void postcopy_preempt_setup(MigrationState *s)
 {
     PostcopyPChannelConnectData *data;
+    MigChannelHeader header = {};
 
-    data = pcopy_preempt_connect_data_new(s, NULL);
+    header.channel_type = MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT;
+
+    data = pcopy_preempt_connect_data_new(s, &header);
     /* Kick an async task to connect */
     socket_send_channel_create(postcopy_preempt_send_channel_new,
                                data, pcopy_preempt_connect_data_unref);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 10/26] migration: Add send/receive header for multifd channel
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (8 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible Maciej S. Szmigiero
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: Avihai Horon <avihaih@nvidia.com>

Add send and receive migration channel header for multifd channel.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
[MSS: Adapt to rewritten migration header passing commit]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/channel.h   |  1 +
 migration/migration.c | 16 ++++++++++++++++
 migration/multifd.c   |  4 +++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/migration/channel.h b/migration/channel.h
index c59ccedc7b6b..4232ee649939 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -32,6 +32,7 @@ int migration_channel_read_peek(QIOChannel *ioc,
 typedef enum {
     MIG_CHANNEL_TYPE_MAIN,
     MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT,
+    MIG_CHANNEL_TYPE_MULTIFD,
 } MigChannelTypes;
 
 typedef struct QEMU_PACKED {
diff --git a/migration/migration.c b/migration/migration.c
index ac9ecf1f4f22..8fe8be71a0e3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1024,6 +1024,22 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         f = qemu_file_new_input(ioc);
         postcopy_preempt_new_channel(migration_incoming_get_current(), f);
         break;
+    case MIG_CHANNEL_TYPE_MULTIFD:
+    {
+        Error *local_err = NULL;
+
+        assert(migrate_multifd());
+        if (multifd_recv_setup(errp) != 0) {
+            return;
+        }
+
+        multifd_recv_new_channel(ioc, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+        break;
+    }
     default:
         error_setg(errp, "Received unknown migration channel type %u",
                    header.channel_type);
diff --git a/migration/multifd.c b/migration/multifd.c
index 8eecda68ac0f..c2575e3d6dbf 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1191,8 +1191,10 @@ out:
 static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 {
     g_autoptr(MFDSendChannelConnectData) data = NULL;
+    MigChannelHeader header = {};
 
-    data = mfd_send_channel_connect_data_new(p, NULL);
+    header.channel_type = MIG_CHANNEL_TYPE_MULTIFD;
+    data = mfd_send_channel_connect_data_new(p, &header);
 
     if (!multifd_use_packets()) {
         return file_send_channel_create(data, errp);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (9 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 10/26] migration: Add send/receive header for multifd channel Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Mapped-ram is only available for multifd migration without channel
header - add an appropriate check to migration options.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/options.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/migration/options.c b/migration/options.c
index 8fd871cd956d..abb5b485badd 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -1284,6 +1284,13 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (migrate_mapped_ram() &&
+        params->has_multifd_channels && migrate_channel_header()) {
+        error_setg(errp,
+                   "Mapped-ram only available for multifd migration without channel header");
+        return false;
+    }
+
     if (params->has_x_vcpu_dirty_limit_period &&
         (params->x_vcpu_dirty_limit_period < 1 ||
          params->x_vcpu_dirty_limit_period > 1000)) {


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (10 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: Avihai Horon <avihaih@nvidia.com>

Now that migration channel header has been implemented, enable it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/options.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/migration/options.c b/migration/options.c
index abb5b485badd..949d8a6c0b62 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -386,7 +386,6 @@ bool migrate_channel_header(void)
 {
     MigrationState *s = migrate_get_current();
 
-    return false;
     return s->channel_header;
 }
 


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (11 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 14/26] migration/ram: Add load start trace event Maciej S. Szmigiero
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way both the start and end points of migrating a particular VFIO
device are known.

Add also a vfio_save_iterate_empty_hit trace event so it is known when
there's no more data to send for that device.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 13 +++++++++++++
 hw/vfio/trace-events          |  3 +++
 include/hw/vfio/vfio-common.h |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 1149c6b3740f..bc3aea77455c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -394,6 +394,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    migration->save_iterate_run = false;
+    migration->save_iterate_empty_hit = false;
+
     if (vfio_precopy_supported(vbasedev)) {
         int ret;
 
@@ -515,9 +518,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     VFIOMigration *migration = vbasedev->migration;
     ssize_t data_size;
 
+    if (!migration->save_iterate_run) {
+        trace_vfio_save_iterate_started(vbasedev->name);
+        migration->save_iterate_run = true;
+    }
+
     data_size = vfio_save_block(f, migration);
     if (data_size < 0) {
         return data_size;
+    } else if (data_size == 0 && !migration->save_iterate_empty_hit) {
+        trace_vfio_save_iterate_empty_hit(vbasedev->name);
+        migration->save_iterate_empty_hit = true;
     }
 
     vfio_update_estimated_pending_data(migration, data_size);
@@ -542,6 +553,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     ssize_t data_size;
     int ret;
 
+    trace_vfio_save_complete_precopy_started(vbasedev->name);
+
     /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index f0474b244bf0..a72697678256 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,8 +157,11 @@ vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
 vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
+vfio_save_complete_precopy_started(const char *name) " (%s)"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_save_iterate_started(const char *name) " (%s)"
+vfio_save_iterate_empty_hit(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b9da6c08ef41..9bb523249e73 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -71,6 +71,9 @@ typedef struct VFIOMigration {
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
     bool initial_data_sent;
+
+    bool save_iterate_run;
+    bool save_iterate_empty_hit;
 } VFIOMigration;
 
 struct VFIOGroup;


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 14/26] migration/ram: Add load start trace event
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (12 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

There's a RAM load complete trace event but there wasn't its start equivalent.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/ram.c        | 1 +
 migration/trace-events | 1 +
 2 files changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 8deb84984f4a..cebb06480d6f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4223,6 +4223,7 @@ static int ram_load_precopy(QEMUFile *f)
                           RAM_SAVE_FLAG_ZERO);
     }
 
+    trace_ram_load_start();
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr;
         void *host = NULL, *host_bak = NULL;
diff --git a/migration/trace-events b/migration/trace-events
index e48607d5a6a2..396c0233cb8c 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) ""
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
+ram_load_start(void) ""
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (13 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 14/26] migration/ram: Add load start trace event Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers Maciej S. Szmigiero
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way there aren't stale flags there.

p->flags can't contain SYNC to be sent at the next RAM packet since syncs
are now handled separately in multifd_send_thread.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c2575e3d6dbf..7118c69a4d49 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -933,6 +933,7 @@ static void *multifd_send_thread(void *opaque)
         if (qatomic_load_acquire(&p->pending_job)) {
             MultiFDPages_t *pages = p->pages;
 
+            p->flags = 0;
             p->iovs_num = 0;
             assert(pages->num);
 
@@ -986,7 +987,6 @@ static void *multifd_send_thread(void *opaque)
                 }
                 /* p->next_packet_size will always be zero for a SYNC packet */
                 stat64_add(&mig_stats.multifd_bytes, p->packet_len);
-                p->flags = 0;
             }
 
             qatomic_set(&p->pending_sync, false);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (14 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

These SaveVMHandlers allow device to provide its own asynchronous
transmission of the remaining data at the end of a precopy phase.

The save_live_complete_precopy_async handler is supposed to start such
transmission (for example, by launching appropriate threads) while the
save_live_complete_precopy_async_wait handler is supposed to wait until
such transfer has finished (for example, until the sending threads
have exited).

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 31 +++++++++++++++++++++++++++++++
 migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index d7b70a8be68c..9d36e35bd612 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -102,6 +102,37 @@ typedef struct SaveVMHandlers {
      */
     int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+    /**
+     * @save_live_complete_precopy_async
+     *
+     * Arranges for handler-specific asynchronous transmission of the
+     * remaining data at the end of a precopy phase. When postcopy is
+     * enabled, devices that support postcopy will skip this step.
+     *
+     * @f: QEMUFile where the handler can synchronously send data before returning
+     * @idstr: this device section idstr
+     * @instance_id: this device section instance_id
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*save_live_complete_precopy_async)(QEMUFile *f,
+                                            char *idstr, uint32_t instance_id,
+                                            void *opaque);
+    /**
+     * @save_live_complete_precopy_async_wait
+     *
+     * Waits for the asynchronous transmission started by the of the
+     * @save_live_complete_precopy_async handler to complete.
+     * When postcopy is enabled, devices that support postcopy will skip this step.
+     *
+     * @f: QEMUFile where the handler can synchronously send data before returning
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*save_live_complete_precopy_async_wait)(QEMUFile *f, void *opaque);
+
     /* This runs both outside and inside the BQL.  */
 
     /**
diff --git a/migration/savevm.c b/migration/savevm.c
index 388d7af7cdd8..fa35504678bf 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1497,6 +1497,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     SaveStateEntry *se;
     int ret;
 
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+             se->ops->has_postcopy(se->opaque)) ||
+            !se->ops->save_live_complete_precopy_async) {
+            continue;
+        }
+
+        save_section_header(f, se, QEMU_VM_SECTION_END);
+
+        ret = se->ops->save_live_complete_precopy_async(f,
+                                                        se->idstr, se->instance_id,
+                                                        se->opaque);
+
+        save_section_footer(f, se);
+
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
             (in_postcopy && se->ops->has_postcopy &&
@@ -1528,6 +1549,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
                                     end_ts_each - start_ts_each);
     }
 
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+             se->ops->has_postcopy(se->opaque)) ||
+            !se->ops->save_live_complete_precopy_async_wait) {
+            continue;
+        }
+
+        ret = se->ops->save_live_complete_precopy_async_wait(f, se->opaque);
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (15 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 18/26] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 15 +++++++++++++++
 migration/savevm.c           | 25 +++++++++++++++++++++++++
 migration/savevm.h           |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 9d36e35bd612..7d29b7e0b559 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -257,6 +257,21 @@ typedef struct SaveVMHandlers {
      */
     int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+    /**
+     * @load_state_buffer
+     *
+     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @data: the data buffer to load
+     * @data_size: the data length in buffer
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
+                             Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/savevm.c b/migration/savevm.c
index fa35504678bf..2e4d63faca06 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3073,6 +3073,31 @@ int qemu_loadvm_approve_switchover(void)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp)
+{
+    SaveStateEntry *se;
+
+    se = find_se(idstr, instance_id);
+    if (!se) {
+        error_setg(errp, "Unknown idstr %s or instance id %u for load state buffer",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    if (!se->ops || !se->ops->load_state_buffer) {
+        error_setg(errp, "idstr %s / instance %u has no load state buffer operation",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 74669733dd63..c879ba8c970e 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp);
+
 #endif


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 18/26] migration: Add load_finish handler and associated functions
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (16 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter Maciej S. Szmigiero
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

load_finish SaveVMHandler allows migration code to poll whether
a device-specific asynchronous device state loading operation had finished.

In order to avoid calling this handler needlessly the device is supposed
to notify the migration code of its possible readiness via a call to
qemu_loadvm_load_finish_ready_broadcast() while holding
qemu_loadvm_load_finish_ready_lock.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 21 +++++++++++++++
 migration/migration.c        |  6 +++++
 migration/migration.h        |  3 +++
 migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
 migration/savevm.h           |  4 +++
 5 files changed, 86 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 7d29b7e0b559..f15881fc87cd 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -272,6 +272,27 @@ typedef struct SaveVMHandlers {
     int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
                              Error **errp);
 
+    /**
+     * @load_finish
+     *
+     * Poll whether all asynchronous device state loading had finished.
+     * Not called on the load failure path.
+     *
+     * Called while holding the qemu_loadvm_load_finish_ready_lock.
+     *
+     * If this method signals "not ready" then it might not be called
+     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
+     * while holding qemu_loadvm_load_finish_ready_lock.
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @is_finished: whether the loading had finished (output parameter)
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     * It's not an error that the loading still hasn't finished.
+     */
+    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/migration.c b/migration/migration.c
index 8fe8be71a0e3..e4f82695a338 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -234,6 +234,9 @@ void migration_object_init(void)
     qemu_cond_init(&current_incoming->page_request_cond);
     current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
 
+    g_mutex_init(&current_incoming->load_finish_ready_mutex);
+    g_cond_init(&current_incoming->load_finish_ready_cond);
+
     migration_object_check(current_migration, &error_fatal);
 
     blk_mig_init();
@@ -387,6 +390,9 @@ void migration_incoming_state_destroy(void)
         mis->postcopy_qemufile_dst = NULL;
     }
 
+    g_mutex_clear(&mis->load_finish_ready_mutex);
+    g_cond_clear(&mis->load_finish_ready_cond);
+
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index a6114405917f..92014ef4cfcc 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -227,6 +227,9 @@ struct MigrationIncomingState {
      * is needed as this field is updated serially.
      */
     unsigned int switchover_ack_pending_num;
+
+    GCond load_finish_ready_cond;
+    GMutex load_finish_ready_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 2e4d63faca06..30521ad3f340 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2994,6 +2994,37 @@ int qemu_loadvm_state(QEMUFile *f)
         return ret;
     }
 
+    qemu_loadvm_load_finish_ready_lock();
+    while (!ret) { /* Don't call load_finish() handlers on the load failure path */
+        bool all_ready = true;
+        SaveStateEntry *se = NULL;
+
+        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+            bool this_ready;
+
+            if (!se->ops || !se->ops->load_finish) {
+                continue;
+            }
+
+            ret = se->ops->load_finish(se->opaque, &this_ready, &local_err);
+            if (ret) {
+                error_report_err(local_err);
+
+                qemu_loadvm_load_finish_ready_unlock();
+                return -EINVAL;
+            } else if (!this_ready) {
+                all_ready = false;
+            }
+        }
+
+        if (all_ready) {
+            break;
+        }
+
+        g_cond_wait(&mis->load_finish_ready_cond, &mis->load_finish_ready_mutex);
+    }
+    qemu_loadvm_load_finish_ready_unlock();
+
     if (ret == 0) {
         ret = qemu_file_get_error(f);
     }
@@ -3098,6 +3129,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
     return 0;
 }
 
+void qemu_loadvm_load_finish_ready_lock(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    g_mutex_lock(&mis->load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_unlock(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    g_mutex_unlock(&mis->load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_broadcast(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    g_cond_broadcast(&mis->load_finish_ready_cond);
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index c879ba8c970e..85e8b882bd37 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -73,4 +73,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
 int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
                                   char *buf, size_t len, Error **errp);
 
+void qemu_loadvm_load_finish_ready_lock(void);
+void qemu_loadvm_load_finish_ready_unlock(void);
+void qemu_loadvm_load_finish_ready_broadcast(void);
+
 #endif


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (17 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 18/26] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:42 ` [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type Maciej S. Szmigiero
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This parameter allows specifying how many multifd channels are dedicated
to sending device state in parallel.

It is ignored on the receive side.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/migration-hmp-cmds.c |  7 +++++
 migration/options.c            | 51 ++++++++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 16 ++++++++++-
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 7e96ae6ffdae..37d71422fdc3 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -341,6 +341,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: %u\n",
             MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_CHANNELS),
             params->multifd_channels);
+        monitor_printf(mon, "%s: %u\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_X_MULTIFD_CHANNELS_DEVICE_STATE),
+            params->x_multifd_channels_device_state);
         monitor_printf(mon, "%s: %s\n",
             MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_COMPRESSION),
             MultiFDCompression_str(params->multifd_compression));
@@ -626,6 +629,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_multifd_channels = true;
         visit_type_uint8(v, param, &p->multifd_channels, &err);
         break;
+    case MIGRATION_PARAMETER_X_MULTIFD_CHANNELS_DEVICE_STATE:
+        p->has_x_multifd_channels_device_state = true;
+        visit_type_uint8(v, param, &p->x_multifd_channels_device_state, &err);
+        break;
     case MIGRATION_PARAMETER_MULTIFD_COMPRESSION:
         p->has_multifd_compression = true;
         visit_type_MultiFDCompression(v, param, &p->multifd_compression,
diff --git a/migration/options.c b/migration/options.c
index 949d8a6c0b62..a7f09570b04e 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -59,6 +59,7 @@
 /* The delay time (in ms) between two COLO checkpoints */
 #define DEFAULT_MIGRATE_X_CHECKPOINT_DELAY (200 * 100)
 #define DEFAULT_MIGRATE_MULTIFD_CHANNELS 2
+#define DEFAULT_MIGRATE_MULTIFD_CHANNELS_DEVICE_STATE 0
 #define DEFAULT_MIGRATE_MULTIFD_COMPRESSION MULTIFD_COMPRESSION_NONE
 /* 0: means nocompress, 1: best speed, ... 9: best compress ratio */
 #define DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL 1
@@ -138,6 +139,9 @@ Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-channels", MigrationState,
                       parameters.multifd_channels,
                       DEFAULT_MIGRATE_MULTIFD_CHANNELS),
+    DEFINE_PROP_UINT8("x-multifd-channels-device-state", MigrationState,
+                      parameters.x_multifd_channels_device_state,
+                      DEFAULT_MIGRATE_MULTIFD_CHANNELS_DEVICE_STATE),
     DEFINE_PROP_MULTIFD_COMPRESSION("multifd-compression", MigrationState,
                       parameters.multifd_compression,
                       DEFAULT_MIGRATE_MULTIFD_COMPRESSION),
@@ -885,6 +889,13 @@ int migrate_multifd_channels(void)
     return s->parameters.multifd_channels;
 }
 
+int migrate_multifd_channels_device_state(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.x_multifd_channels_device_state;
+}
+
 MultiFDCompression migrate_multifd_compression(void)
 {
     MigrationState *s = migrate_get_current();
@@ -1032,6 +1043,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->block_incremental = s->parameters.block_incremental;
     params->has_multifd_channels = true;
     params->multifd_channels = s->parameters.multifd_channels;
+    params->has_x_multifd_channels_device_state = true;
+    params->x_multifd_channels_device_state = s->parameters.x_multifd_channels_device_state;
     params->has_multifd_compression = true;
     params->multifd_compression = s->parameters.multifd_compression;
     params->has_multifd_zlib_level = true;
@@ -1091,6 +1104,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_x_checkpoint_delay = true;
     params->has_block_incremental = true;
     params->has_multifd_channels = true;
+    params->has_x_multifd_channels_device_state = true;
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
     params->has_multifd_zstd_level = true;
@@ -1198,6 +1212,37 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_multifd_channels &&
+        params->has_x_multifd_channels_device_state &&
+        params->x_multifd_channels_device_state > 0 &&
+        !migrate_channel_header()) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+                   "x_multifd_channels_device_state",
+                   "0 without channel header");
+        return false;
+    }
+
+    if (params->has_multifd_channels &&
+        params->has_x_multifd_channels_device_state &&
+        params->x_multifd_channels_device_state > 0 &&
+        params->has_multifd_compression &&
+        params->multifd_compression != MULTIFD_COMPRESSION_NONE) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+                   "x_multifd_channels_device_state",
+                   "0 with compression");
+        return false;
+    }
+
+    /* At least one multifd channel is needed for RAM data */
+    if (params->has_multifd_channels &&
+        params->has_x_multifd_channels_device_state &&
+        params->x_multifd_channels_device_state >= params->multifd_channels) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+                   "x_multifd_channels_device_state",
+                   "a value less than multifd_channels");
+        return false;
+    }
+
     if (params->has_multifd_zlib_level &&
         (params->multifd_zlib_level > 9)) {
         error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "multifd_zlib_level",
@@ -1381,6 +1426,9 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_channels) {
         dest->multifd_channels = params->multifd_channels;
     }
+    if (params->has_x_multifd_channels_device_state) {
+        dest->x_multifd_channels_device_state = params->x_multifd_channels_device_state;
+    }
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
@@ -1526,6 +1574,9 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_channels) {
         s->parameters.multifd_channels = params->multifd_channels;
     }
+    if (params->has_x_multifd_channels_device_state) {
+        s->parameters.x_multifd_channels_device_state = params->x_multifd_channels_device_state;
+    }
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
diff --git a/migration/options.h b/migration/options.h
index 1144d72ec0db..453999b0d28e 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -83,6 +83,7 @@ uint64_t migrate_max_bandwidth(void);
 uint64_t migrate_avail_switchover_bandwidth(void);
 uint64_t migrate_max_postcopy_bandwidth(void);
 int migrate_multifd_channels(void);
+int migrate_multifd_channels_device_state(void);
 MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 8c65b9032886..0578375cfcfd 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -858,6 +858,10 @@
 #     parallel.  This is the same number that the number of sockets
 #     used for migration.  The default value is 2 (since 4.0)
 #
+# @x-multifd-channels-device-state: Number of multifd channels dedicated
+#     to sending device state in parallel (ignored on the receive side).
+#     The default value is 0 (since TBD)
+#
 # @xbzrle-cache-size: cache size to be used by XBZRLE migration.  It
 #     needs to be a multiple of the target page size and a power of 2
 #     (Since 2.11)
@@ -940,7 +944,7 @@
            'avail-switchover-bandwidth', 'downtime-limit',
            { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
            { 'name': 'block-incremental', 'features': [ 'deprecated' ] },
-           'multifd-channels',
+           'multifd-channels', 'x-multifd-channels-device-state',
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level', 'multifd-zstd-level',
@@ -1066,6 +1070,10 @@
 #     parallel.  This is the same number that the number of sockets
 #     used for migration.  The default value is 2 (since 4.0)
 #
+# @x-multifd-channels-device-state: Number of multifd channels dedicated
+#     to sending device state in parallel (ignored on the receive side).
+#     The default value is 0 (since TBD)
+#
 # @xbzrle-cache-size: cache size to be used by XBZRLE migration.  It
 #     needs to be a multiple of the target page size and a power of 2
 #     (Since 2.11)
@@ -1165,6 +1173,7 @@
             '*block-incremental': { 'type': 'bool',
                                     'features': [ 'deprecated' ] },
             '*multifd-channels': 'uint8',
+            '*x-multifd-channels-device-state': 'uint8',
             '*xbzrle-cache-size': 'size',
             '*max-postcopy-bandwidth': 'size',
             '*max-cpu-throttle': 'uint8',
@@ -1298,6 +1307,10 @@
 #     parallel.  This is the same number that the number of sockets
 #     used for migration.  The default value is 2 (since 4.0)
 #
+# @x-multifd-channels-device-state: Number of multifd channels dedicated
+#     to sending device state in parallel (ignored on the receive side).
+#     The default value is 0 (since TBD)
+#
 # @xbzrle-cache-size: cache size to be used by XBZRLE migration.  It
 #     needs to be a multiple of the target page size and a power of 2
 #     (Since 2.11)
@@ -1394,6 +1407,7 @@
             '*block-incremental': { 'type': 'bool',
                                     'features': [ 'deprecated' ] },
             '*multifd-channels': 'uint8',
+            '*x-multifd-channels-device-state': 'uint8',
             '*xbzrle-cache-size': 'size',
             '*max-postcopy-bandwidth': 'size',
             '*max-cpu-throttle': 'uint8',


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (18 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter Maciej S. Szmigiero
@ 2024-04-16 14:42 ` Maciej S. Szmigiero
  2024-04-16 14:43 ` [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:42 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/channel.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/channel.h b/migration/channel.h
index 4232ee649939..b985c952550d 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -33,6 +33,7 @@ typedef enum {
     MIG_CHANNEL_TYPE_MAIN,
     MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT,
     MIG_CHANNEL_TYPE_MULTIFD,
+    MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE,
 } MigChannelTypes;
 
 typedef struct QEMU_PACKED {


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (19 preceding siblings ...)
  2024-04-16 14:42 ` [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type Maciej S. Szmigiero
@ 2024-04-16 14:43 ` Maciej S. Szmigiero
  2024-04-16 14:43 ` [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:43 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a basic support for receiving device state via multifd channels -
both dedicated ones or shared with RAM transfer.

To differentiate between a device state and a RAM packet the packet
header is read first.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state  (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is then read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/migration.c |   7 +-
 migration/multifd.c   | 146 ++++++++++++++++++++++++++++++++++++------
 migration/multifd.h   |  34 +++++++++-
 3 files changed, 163 insertions(+), 24 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index e4f82695a338..ea2c8a043a77 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -987,7 +987,7 @@ static void migration_ioc_process_incoming_no_header(QIOChannel *ioc,
         /* Multiple connections */
         assert(migration_needs_multiple_sockets());
         if (migrate_multifd()) {
-            multifd_recv_new_channel(ioc, &local_err);
+            multifd_recv_new_channel(ioc, false, &local_err);
         } else {
             assert(migrate_postcopy_preempt());
             f = qemu_file_new_input(ioc);
@@ -1031,6 +1031,7 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         postcopy_preempt_new_channel(migration_incoming_get_current(), f);
         break;
     case MIG_CHANNEL_TYPE_MULTIFD:
+    case MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE:
     {
         Error *local_err = NULL;
 
@@ -1039,7 +1040,9 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
             return;
         }
 
-        multifd_recv_new_channel(ioc, &local_err);
+        multifd_recv_new_channel(ioc,
+                                 header.channel_type == MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE,
+                                 &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
diff --git a/migration/multifd.c b/migration/multifd.c
index 7118c69a4d49..a26418d87485 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -22,6 +22,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -404,7 +405,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
     uint32_t zero_num = pages->num - pages->normal_num;
     int i;
 
-    packet->flags = cpu_to_be32(p->flags);
+    packet->hdr.flags = cpu_to_be32(p->flags);
     packet->pages_alloc = cpu_to_be32(p->pages->allocated);
     packet->normal_pages = cpu_to_be32(pages->normal_num);
     packet->zero_pages = cpu_to_be32(zero_num);
@@ -432,28 +433,44 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
                        p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p, MultiFDPacketHdr_t *hdr,
+                                             Error **errp)
 {
-    MultiFDPacket_t *packet = p->packet;
-    int i;
-
-    packet->magic = be32_to_cpu(packet->magic);
-    if (packet->magic != MULTIFD_MAGIC) {
+    hdr->magic = be32_to_cpu(hdr->magic);
+    if (hdr->magic != MULTIFD_MAGIC) {
         error_setg(errp, "multifd: received packet "
                    "magic %x and expected magic %x",
-                   packet->magic, MULTIFD_MAGIC);
+                   hdr->magic, MULTIFD_MAGIC);
         return -1;
     }
 
-    packet->version = be32_to_cpu(packet->version);
-    if (packet->version != MULTIFD_VERSION) {
+    hdr->version = be32_to_cpu(hdr->version);
+    if (hdr->version != MULTIFD_VERSION) {
         error_setg(errp, "multifd: received packet "
                    "version %u and expected version %u",
-                   packet->version, MULTIFD_VERSION);
+                   hdr->version, MULTIFD_VERSION);
         return -1;
     }
 
-    p->flags = be32_to_cpu(packet->flags);
+    p->flags = be32_to_cpu(hdr->flags);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p, Error **errp)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+    packet->instance_id = be32_to_cpu(packet->instance_id);
+    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
+{
+    MultiFDPacket_t *packet = p->packet;
+    int i;
 
     packet->pages_alloc = be32_to_cpu(packet->pages_alloc);
     /*
@@ -485,7 +502,6 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
 
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
-    p->packets_recved++;
     p->total_normal_pages += p->normal_num;
     p->total_zero_pages += p->zero_num;
 
@@ -533,6 +549,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     return 0;
 }
 
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    p->packets_recved++;
+
+    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
+        return multifd_recv_unfill_packet_device_state(p, errp);
+    } else {
+        return multifd_recv_unfill_packet_ram(p, errp);
+    }
+
+    g_assert_not_reached();
+}
+
 static bool multifd_send_should_exit(void)
 {
     return qatomic_read(&multifd_send_state->exiting);
@@ -1239,8 +1268,8 @@ bool multifd_send_setup(void)
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
-            p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-            p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+            p->packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
 
             /* We need one extra place for the packet header */
             p->iov = g_new0(struct iovec, page_count + 1);
@@ -1415,6 +1444,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
+    g_clear_pointer(&p->packet_dev_state, g_free);
     g_free(p->iov);
     p->iov = NULL;
     g_free(p->normal);
@@ -1474,6 +1504,8 @@ void multifd_recv_sync_main(void)
         for (i = 0; i < thread_count; i++) {
             MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
+            assert(!p->is_device_state_dedicated);
+
             trace_multifd_recv_sync_main_signal(p->id);
             qemu_sem_post(&p->sem);
         }
@@ -1489,6 +1521,12 @@ void multifd_recv_sync_main(void)
      * the work (pending_job=false).
      */
     for (i = 0; i < thread_count; i++) {
+        MultiFDRecvParams *p = &multifd_recv_state->params[i];
+
+        if (p->is_device_state_dedicated) {
+            continue;
+        }
+
         trace_multifd_recv_sync_main_wait(i);
         qemu_sem_wait(&multifd_recv_state->sem_sync);
     }
@@ -1507,6 +1545,10 @@ void multifd_recv_sync_main(void)
     for (i = 0; i < thread_count; i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
+        if (p->is_device_state_dedicated) {
+            continue;
+        }
+
         WITH_QEMU_LOCK_GUARD(&p->mutex) {
             if (multifd_recv_state->packet_num < p->packet_num) {
                 multifd_recv_state->packet_num = p->packet_num;
@@ -1529,8 +1571,13 @@ static void *multifd_recv_thread(void *opaque)
     rcu_register_thread();
 
     while (true) {
+        MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
+        bool is_device_state = false;
         bool has_data = false;
+        uint8_t *pkt_buf;
+        size_t pkt_len;
+
         p->normal_num = 0;
 
         if (use_packets) {
@@ -1538,8 +1585,27 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
-                                           p->packet_len, &local_err);
+            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
+                                           sizeof(hdr), &local_err);
+            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
+                break;
+            }
+
+            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
+            if (ret) {
+                break;
+            }
+
+            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
+            if (is_device_state) {
+                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
+                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
+            } else {
+                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+                pkt_len = p->packet_len - sizeof(hdr);
+            }
+
+            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len, &local_err);
             if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
                 break;
             }
@@ -1582,8 +1648,39 @@ static void *multifd_recv_thread(void *opaque)
             has_data = !!p->data->size;
         }
 
-        if (has_data) {
-            ret = multifd_recv_state->ops->recv(p, &local_err);
+        if (!is_device_state) {
+            if (p->is_device_state_dedicated) {
+                error_setg(&local_err,
+                           "multifd: received non-device-state packet on device-state-dedicated thread");
+                break;
+            }
+
+            if (has_data) {
+                ret = multifd_recv_state->ops->recv(p, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+            }
+        } else {
+            g_autofree char *idstr = NULL;
+            g_autofree char *dev_state_buf = NULL;
+
+            assert(use_packets);
+
+            if (p->next_packet_size > 0) {
+                dev_state_buf = g_malloc(p->next_packet_size);
+
+                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+            }
+
+            idstr = g_strndup(p->packet_dev_state->idstr, sizeof(p->packet_dev_state->idstr));
+            ret = qemu_loadvm_load_state_buffer(idstr,
+                                                p->packet_dev_state->instance_id,
+                                                dev_state_buf, p->next_packet_size,
+                                                &local_err);
             if (ret != 0) {
                 break;
             }
@@ -1591,6 +1688,11 @@ static void *multifd_recv_thread(void *opaque)
 
         if (use_packets) {
             if (flags & MULTIFD_FLAG_SYNC) {
+                if (is_device_state) {
+                    error_setg(&local_err, "multifd: received SYNC device state packet");
+                    break;
+                }
+
                 qemu_sem_post(&multifd_recv_state->sem_sync);
                 qemu_sem_wait(&p->sem_sync);
             }
@@ -1662,6 +1764,7 @@ int multifd_recv_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
         }
         p->name = g_strdup_printf("multifdrecv_%d", i);
         p->iov = g_new0(struct iovec, page_count);
@@ -1703,7 +1806,9 @@ bool multifd_recv_all_channels_created(void)
  * Try to receive all multifd channels to get ready for the migration.
  * Sets @errp when failing to receive the current channel.
  */
-void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
+void multifd_recv_new_channel(QIOChannel *ioc,
+                              bool is_device_state_dedicated,
+                              Error **errp)
 {
     MultiFDRecvParams *p;
     Error *local_err = NULL;
@@ -1733,6 +1838,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
         error_propagate(errp, local_err);
         return;
     }
+    p->is_device_state_dedicated = is_device_state_dedicated;
     p->c = ioc;
     object_ref(OBJECT(ioc));
 
diff --git a/migration/multifd.h b/migration/multifd.h
index fd0cd29104c1..b5fa56b791af 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -24,7 +24,7 @@ int multifd_recv_setup(Error **errp);
 void multifd_recv_cleanup(void);
 void multifd_recv_shutdown(void);
 bool multifd_recv_all_channels_created(void);
-void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
+void multifd_recv_new_channel(QIOChannel *ioc, bool is_device_state_dedicated, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(void);
 bool multifd_queue_page(RAMBlock *block, ram_addr_t offset);
@@ -41,6 +41,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_ZLIB (1 << 1)
 #define MULTIFD_FLAG_ZSTD (2 << 1)
 
+/*
+ * If set it means that this packet contains device state
+ * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
+ */
+#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
+
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
 
@@ -48,6 +54,11 @@ typedef struct {
     uint32_t magic;
     uint32_t version;
     uint32_t flags;
+} __attribute__((packed)) MultiFDPacketHdr_t;
+
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
     /* maximum number of allocated pages */
     uint32_t pages_alloc;
     /* non zero pages */
@@ -68,6 +79,16 @@ typedef struct {
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
+    char idstr[256] QEMU_NONSTRING;
+    uint32_t instance_id;
+
+    /* size of the next packet that contains the actual data */
+    uint32_t next_packet_size;
+} __attribute__((packed)) MultiFDPacketDeviceState_t;
+
 typedef struct {
     /* number of used pages */
     uint32_t num;
@@ -87,6 +108,13 @@ struct MultiFDRecvData {
     off_t file_offset;
 };
 
+typedef struct {
+    char *idstr;
+    uint32_t instance_id;
+    char *buf;
+    size_t buf_len;
+} MultiFDDeviceState_t;
+
 typedef struct {
     /* Fields are only written at creating/deletion time */
     /* No lock required for them, they are read only */
@@ -175,6 +203,7 @@ typedef struct {
     uint32_t page_size;
     /* number of pages in a full packet */
     uint32_t page_count;
+    bool is_device_state_dedicated;
 
     /* syncs main thread and channels */
     QemuSemaphore sem_sync;
@@ -194,8 +223,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_dev_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets received through this channel */


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (20 preceding siblings ...)
  2024-04-16 14:43 ` [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2024-04-16 14:43 ` Maciej S. Szmigiero
  2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:43 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is necessary for multifd_send_pages() to be able to be called
from multiple threads.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index a26418d87485..878ff7d9f9f0 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -622,8 +622,8 @@ static bool multifd_send_pages(void)
      * using more channels, so ensure it doesn't overflow if the
      * limit is lower now.
      */
-    next_channel %= migrate_multifd_channels();
-    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
+    i = qatomic_load_acquire(&next_channel) % migrate_multifd_channels();
+    for (;; i = (i + 1) % migrate_multifd_channels()) {
         if (multifd_send_should_exit()) {
             return false;
         }
@@ -633,7 +633,8 @@ static bool multifd_send_pages(void)
          * sender thread can clear it.
          */
         if (qatomic_read(&p->pending_job) == false) {
-            next_channel = (i + 1) % migrate_multifd_channels();
+            qatomic_store_release(&next_channel,
+                                  (i + 1) % migrate_multifd_channels());
             break;
         }
     }


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (21 preceding siblings ...)
  2024-04-16 14:43 ` [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
@ 2024-04-16 14:43 ` Maciej S. Szmigiero
  2024-04-29 20:04   ` Peter Xu
  2024-04-16 14:43 ` [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  26 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:43 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h |   4 +
 migration/multifd-zlib.c |   2 +-
 migration/multifd-zstd.c |   2 +-
 migration/multifd.c      | 244 ++++++++++++++++++++++++++++++++++-----
 migration/multifd.h      |  30 +++--
 5 files changed, 244 insertions(+), 38 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index c9e200f4eb8f..25968e31247b 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -117,4 +117,8 @@ bool migration_in_bg_snapshot(void);
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
+/* migration/multifd.c */
+int multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                               char *data, size_t len);
+
 #endif
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 99821cd4d5ef..e20c1de6033d 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -177,7 +177,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
 
 out:
     p->flags |= MULTIFD_FLAG_ZLIB;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 02112255adcc..37cebd006921 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -166,7 +166,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
 
 out:
     p->flags |= MULTIFD_FLAG_ZSTD;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 878ff7d9f9f0..d8ce01539a05 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -20,6 +21,7 @@
 #include "qapi/error.h"
 #include "channel.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -50,9 +52,17 @@ typedef struct {
 } __attribute__((packed)) MultiFDInit_t;
 
 struct {
+    /*
+     * Are there some device state dedicated channels (true) or
+     * should device state be sent via any available channel (false)?
+     */
+    bool device_state_dedicated_channels;
+    GMutex queue_job_mutex;
+
     MultiFDSendParams *params;
-    /* array of pages to sent */
+    /* array of pages or device state to be sent */
     MultiFDPages_t *pages;
+    MultiFDDeviceState_t *device_state;
     /*
      * Global number of generated multifd packets.
      *
@@ -169,7 +179,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
 }
 
 /**
- * nocomp_send_prepare: prepare date to be able to send
+ * nocomp_send_prepare_ram: prepare RAM data for sending
  *
  * For no compression we just have to calculate the size of the
  * packet.
@@ -179,7 +189,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
  * @p: Params for the channel that we are using
  * @errp: pointer to an error
  */
-static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
+static int nocomp_send_prepare_ram(MultiFDSendParams *p, Error **errp)
 {
     bool use_zero_copy_send = migrate_zero_copy_send();
     int ret;
@@ -198,13 +208,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
          * Only !zerocopy needs the header in IOV; zerocopy will
          * send it separately.
          */
-        multifd_send_prepare_header(p);
+        multifd_send_prepare_header_ram(p);
     }
 
     multifd_send_prepare_iovs(p);
     p->flags |= MULTIFD_FLAG_NOCOMP;
 
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
 
     if (use_zero_copy_send) {
         /* Send header first, without zerocopy */
@@ -218,6 +228,59 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
     return 0;
 }
 
+static void multifd_send_fill_packet_device_state(MultiFDSendParams *p)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+    packet->hdr.flags = cpu_to_be32(p->flags);
+    strncpy(packet->idstr, p->device_state->idstr, sizeof(packet->idstr));
+    packet->instance_id = cpu_to_be32(p->device_state->instance_id);
+    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+/**
+ * nocomp_send_prepare_device_state: prepare device state data for sending
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
+                                            Error **errp)
+{
+    assert(!multifd_send_state->device_state_dedicated_channels ||
+           p->is_device_state_dedicated);
+
+    multifd_send_prepare_header_device_state(p);
+
+    assert(!(p->flags & MULTIFD_FLAG_SYNC));
+
+    p->next_packet_size = p->device_state->buf_len;
+    if (p->next_packet_size > 0) {
+        p->iov[p->iovs_num].iov_base = p->device_state->buf;
+        p->iov[p->iovs_num].iov_len = p->next_packet_size;
+        p->iovs_num++;
+    }
+
+    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
+
+    multifd_send_fill_packet_device_state(p);
+
+    return 0;
+}
+
+static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
+{
+    if (p->is_device_state_job) {
+        return nocomp_send_prepare_device_state(p, errp);
+    } else {
+        return nocomp_send_prepare_ram(p, errp);
+    }
+
+    g_assert_not_reached();
+}
+
 /**
  * nocomp_recv_setup: setup receive side
  *
@@ -397,7 +460,18 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
     g_free(pages);
 }
 
-void multifd_send_fill_packet(MultiFDSendParams *p)
+static void multifd_device_state_free(MultiFDDeviceState_t *device_state)
+{
+    if (!device_state) {
+        return;
+    }
+
+    g_clear_pointer(&device_state->idstr, g_free);
+    g_clear_pointer(&device_state->buf, g_free);
+    g_free(device_state);
+}
+
+void multifd_send_fill_packet_ram(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
     MultiFDPages_t *pages = p->pages;
@@ -585,7 +659,8 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
 }
 
 /*
- * How we use multifd_send_state->pages and channel->pages?
+ * How we use multifd_send_state->pages + channel->pages
+ * and multifd_send_state->device_state + channel->device_state?
  *
  * We create a pages for each channel, and a main one.  Each time that
  * we need to send a batch of pages we interchange the ones between
@@ -601,14 +676,15 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
  * have to had finish with its own, otherwise pending_job can't be
  * false.
  *
+ * 'device_state' struct has similar handling.
+ *
  * Returns true if succeed, false otherwise.
  */
-static bool multifd_send_pages(void)
+static bool multifd_send_queue_job(bool is_device_state)
 {
     int i;
     static int next_channel;
     MultiFDSendParams *p = NULL; /* make happy gcc */
-    MultiFDPages_t *pages = multifd_send_state->pages;
 
     if (multifd_send_should_exit()) {
         return false;
@@ -632,7 +708,9 @@ static bool multifd_send_pages(void)
          * Lockless read to p->pending_job is safe, because only multifd
          * sender thread can clear it.
          */
-        if (qatomic_read(&p->pending_job) == false) {
+        if ((!multifd_send_state->device_state_dedicated_channels ||
+             p->is_device_state_dedicated == is_device_state) &&
+            qatomic_read(&p->pending_job) == false) {
             qatomic_store_release(&next_channel,
                                   (i + 1) % migrate_multifd_channels());
             break;
@@ -644,12 +722,30 @@ static bool multifd_send_pages(void)
      * qatomic_store_release() in multifd_send_thread().
      */
     smp_mb_acquire();
-    assert(!p->pages->num);
-    multifd_send_state->pages = p->pages;
-    p->pages = pages;
+
+    if (!is_device_state) {
+        assert(!p->pages->num);
+    } else {
+        assert(!p->device_state->buf);
+    }
+
+    p->is_device_state_job = is_device_state;
+
+    if (!is_device_state) {
+        MultiFDPages_t *pages = multifd_send_state->pages;
+
+        multifd_send_state->pages = p->pages;
+        p->pages = pages;
+    } else {
+        MultiFDDeviceState_t *device_state = multifd_send_state->device_state;
+
+        multifd_send_state->device_state = p->device_state;
+        p->device_state = device_state;
+    }
+
     /*
-     * Making sure p->pages is setup before marking pending_job=true. Pairs
-     * with the qatomic_load_acquire() in multifd_send_thread().
+     * Making sure p->pages or p->device state is setup before marking
+     * pending_job=true. Pairs with the qatomic_load_acquire() in multifd_send_thread().
      */
     qatomic_store_release(&p->pending_job, true);
     qemu_sem_post(&p->sem);
@@ -673,7 +769,7 @@ static inline void multifd_enqueue(MultiFDPages_t *pages, ram_addr_t offset)
 }
 
 /* Returns true if enqueue successful, false otherwise */
-bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
+static bool multifd_queue_page_locked(RAMBlock *block, ram_addr_t offset)
 {
     MultiFDPages_t *pages;
 
@@ -696,7 +792,7 @@ retry:
      * After flush, always retry.
      */
     if (pages->block != block || multifd_queue_full(pages)) {
-        if (!multifd_send_pages()) {
+        if (!multifd_send_queue_job(false)) {
             return false;
         }
         goto retry;
@@ -707,6 +803,45 @@ retry:
     return true;
 }
 
+bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
+{
+    g_autoptr(GMutexLocker) locker = NULL;
+
+    /*
+     * Device state submissions for shared channels can come
+     * from multiple threads and conflict with page submissions
+     * with respect to multifd_send_state access.
+     */
+    if (!multifd_send_state->device_state_dedicated_channels) {
+        locker = g_mutex_locker_new(&multifd_send_state->queue_job_mutex);
+    }
+
+    return multifd_queue_page_locked(block, offset);
+}
+
+int multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                               char *data, size_t len)
+{
+    /* Device state submissions can came from multiple threads */
+    g_autoptr(GMutexLocker) locker =
+        g_mutex_locker_new(&multifd_send_state->queue_job_mutex);
+    MultiFDDeviceState_t *device_state = multifd_send_state->device_state;
+
+    assert(!device_state->buf);
+    device_state->idstr = g_strdup(idstr);
+    device_state->instance_id = instance_id;
+    device_state->buf = g_memdup2(data, len);
+    device_state->buf_len = len;
+
+    if (!multifd_send_queue_job(true)) {
+        g_clear_pointer(&device_state->idstr, g_free);
+        g_clear_pointer(&device_state->buf, g_free);
+        return -1;
+    }
+
+    return 0;
+}
+
 /* Multifd send side hit an error; remember it and prepare to quit */
 static void multifd_send_set_error(Error *err)
 {
@@ -811,10 +946,12 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     multifd_pages_clear(p->pages);
     p->pages = NULL;
     p->packet_len = 0;
+    g_clear_pointer(&p->packet_device_state, g_free);
     g_free(p->packet);
     p->packet = NULL;
     g_free(p->iov);
     p->iov = NULL;
+    g_clear_pointer(&p->device_state, multifd_device_state_free);
     multifd_send_state->ops->send_cleanup(p, errp);
 
     return *errp == NULL;
@@ -829,7 +966,9 @@ static void multifd_send_cleanup_state(void)
     g_free(multifd_send_state->params);
     multifd_send_state->params = NULL;
     multifd_pages_clear(multifd_send_state->pages);
+    g_clear_pointer(&multifd_send_state->device_state, multifd_device_state_free);
     multifd_send_state->pages = NULL;
+    g_mutex_clear(&multifd_send_state->queue_job_mutex);
     g_free(multifd_send_state);
     multifd_send_state = NULL;
 }
@@ -876,17 +1015,28 @@ static int multifd_zero_copy_flush(QIOChannel *c)
 
 int multifd_send_sync_main(void)
 {
+    g_autoptr(GMutexLocker) locker = NULL;
     int i;
     bool flush_zero_copy;
 
     if (!migrate_multifd()) {
         return 0;
     }
+
+    /*
+     * Page SYNC can conflict with device state submissions for shared channels
+     * with respect to multifd_send_state access.
+     */
+    if (!multifd_send_state->device_state_dedicated_channels) {
+        locker = g_mutex_locker_new(&multifd_send_state->queue_job_mutex);
+    }
+
     if (multifd_send_state->pages->num) {
-        if (!multifd_send_pages()) {
+        if (!multifd_send_queue_job(false)) {
             error_report("%s: multifd_send_pages fail", __func__);
             return -1;
         }
+        assert(!multifd_send_state->pages->num);
     }
 
     flush_zero_copy = migrate_zero_copy_send();
@@ -898,6 +1048,11 @@ int multifd_send_sync_main(void)
             return -1;
         }
 
+        if (p->is_device_state_dedicated) {
+            assert(multifd_send_state->device_state_dedicated_channels);
+            continue;
+        }
+
         trace_multifd_send_sync_main_signal(p->id);
 
         /*
@@ -915,6 +1070,10 @@ int multifd_send_sync_main(void)
             return -1;
         }
 
+        if (p->is_device_state_dedicated) {
+            continue;
+        }
+
         qemu_sem_wait(&multifd_send_state->channels_ready);
         trace_multifd_send_sync_main_wait(p->id);
         qemu_sem_wait(&p->sem_sync);
@@ -962,17 +1121,22 @@ static void *multifd_send_thread(void *opaque)
          */
         if (qatomic_load_acquire(&p->pending_job)) {
             MultiFDPages_t *pages = p->pages;
+            bool is_device_state = p->is_device_state_job;
+            size_t total_size;
 
             p->flags = 0;
             p->iovs_num = 0;
-            assert(pages->num);
+            assert(is_device_state || pages->num);
 
             ret = multifd_send_state->ops->send_prepare(p, &local_err);
             if (ret != 0) {
                 break;
             }
 
+            total_size = iov_size(p->iov, p->iovs_num);
             if (migrate_mapped_ram()) {
+                assert(!is_device_state);
+
                 ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
                                               p->pages->block, &local_err);
             } else {
@@ -985,12 +1149,18 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
-            stat64_add(&mig_stats.multifd_bytes,
-                       p->next_packet_size + p->packet_len);
-            stat64_add(&mig_stats.normal_pages, pages->normal_num);
-            stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
+            stat64_add(&mig_stats.multifd_bytes, total_size);
+            if (!is_device_state) {
+                stat64_add(&mig_stats.normal_pages, pages->normal_num);
+                stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
+            }
 
-            multifd_pages_reset(p->pages);
+            if (is_device_state) {
+                g_clear_pointer(&p->device_state->idstr, g_free);
+                g_clear_pointer(&p->device_state->buf, g_free);
+            } else {
+                multifd_pages_reset(p->pages);
+            }
             p->next_packet_size = 0;
 
             /*
@@ -1009,7 +1179,7 @@ static void *multifd_send_thread(void *opaque)
 
             if (use_packets) {
                 p->flags = MULTIFD_FLAG_SYNC;
-                multifd_send_fill_packet(p);
+                multifd_send_fill_packet_ram(p);
                 ret = qio_channel_write_all(p->c, (void *)p->packet,
                                             p->packet_len, &local_err);
                 if (ret != 0) {
@@ -1223,7 +1393,12 @@ static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
     g_autoptr(MFDSendChannelConnectData) data = NULL;
     MigChannelHeader header = {};
 
-    header.channel_type = MIG_CHANNEL_TYPE_MULTIFD;
+    if (!p->is_device_state_dedicated) {
+        header.channel_type = MIG_CHANNEL_TYPE_MULTIFD;
+    } else {
+        header.channel_type = MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE;
+    }
+
     data = mfd_send_channel_connect_data_new(p, &header);
 
     if (!multifd_use_packets()) {
@@ -1239,7 +1414,7 @@ bool multifd_send_setup(void)
 {
     MigrationState *s = migrate_get_current();
     Error *local_err = NULL;
-    int thread_count, ret = 0;
+    int thread_count, device_state_thread_count, ret = 0;
     uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
     bool use_packets = multifd_use_packets();
     uint8_t i;
@@ -1249,10 +1424,16 @@ bool multifd_send_setup(void)
     }
 
     thread_count = migrate_multifd_channels();
+    device_state_thread_count = migrate_multifd_channels_device_state();
+    assert(device_state_thread_count < thread_count);
+
     multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
+    multifd_send_state->device_state_dedicated_channels = device_state_thread_count >= 1;
+    g_mutex_init(&multifd_send_state->queue_job_mutex);
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
     multifd_send_state->pages = multifd_pages_init(page_count);
     qemu_sem_init(&multifd_send_state->channels_created, 0);
+    multifd_send_state->device_state = g_malloc0(sizeof(*multifd_send_state->device_state));
     qemu_sem_init(&multifd_send_state->channels_ready, 0);
     qatomic_set(&multifd_send_state->exiting, 0);
     multifd_send_state->ops = multifd_ops[migrate_multifd_compression()];
@@ -1260,21 +1441,28 @@ bool multifd_send_setup(void)
     for (i = 0; i < thread_count; i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
+        p->is_device_state_dedicated = i >= thread_count - device_state_thread_count;
         qemu_sem_init(&p->sem, 0);
         qemu_sem_init(&p->sem_sync, 0);
         p->id = i;
         p->pages = multifd_pages_init(page_count);
 
         if (use_packets) {
+            p->device_state = g_malloc0(sizeof(*p->device_state));
+
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
             p->packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
             p->packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
+            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
+            p->packet_device_state->hdr = p->packet->hdr;
 
             /* We need one extra place for the packet header */
             p->iov = g_new0(struct iovec, page_count + 1);
         } else {
+            assert(!p->is_device_state_dedicated);
+
             p->iov = g_new0(struct iovec, page_count);
         }
         p->name = g_strdup_printf("multifdsend_%d", i);
@@ -1858,7 +2046,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
         return false;
     }
 
-    multifd_send_prepare_header(p);
+    multifd_send_prepare_header_ram(p);
 
     return true;
 }
diff --git a/migration/multifd.h b/migration/multifd.h
index b5fa56b791af..53cf80c66f98 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -138,6 +138,7 @@ typedef struct {
     uint32_t page_count;
     /* multifd flags for sending ram */
     int write_flags;
+    bool is_device_state_dedicated;
 
     /* sem where to wait for more work */
     QemuSemaphore sem;
@@ -157,17 +158,23 @@ typedef struct {
      */
     bool pending_job;
     bool pending_sync;
-    /* array of pages to sent.
-     * The owner of 'pages' depends of 'pending_job' value:
+
+    /* Whether the pending job is pages (false) or device state (true) */
+    bool is_device_state_job;
+
+    /* Array of pages or device state to be sent (depending on the flag above).
+     * The owner of these depends of 'pending_job' value:
      * pending_job == 0 -> migration_thread can use it.
      * pending_job != 0 -> multifd_channel can use it.
      */
     MultiFDPages_t *pages;
+    MultiFDDeviceState_t *device_state;
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_device_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
@@ -268,20 +275,27 @@ typedef struct {
 } MultiFDMethods;
 
 void multifd_register_ops(int method, MultiFDMethods *ops);
-void multifd_send_fill_packet(MultiFDSendParams *p);
+void multifd_send_fill_packet_ram(MultiFDSendParams *p);
 bool multifd_send_prepare_common(MultiFDSendParams *p);
 void multifd_send_zero_page_detect(MultiFDSendParams *p);
 void multifd_recv_zero_page_process(MultiFDRecvParams *p);
 
-static inline void multifd_send_prepare_header(MultiFDSendParams *p)
+struct MFDSendChannelConnectData;
+typedef struct MFDSendChannelConnectData MFDSendChannelConnectData;
+bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc, Error **errp);
+
+static inline void multifd_send_prepare_header_ram(MultiFDSendParams *p)
 {
     p->iov[0].iov_len = p->packet_len;
     p->iov[0].iov_base = p->packet;
     p->iovs_num++;
 }
 
-struct MFDSendChannelConnectData;
-typedef struct MFDSendChannelConnectData MFDSendChannelConnectData;
-bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc, Error **errp);
+static inline void multifd_send_prepare_header_device_state(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = sizeof(*p->packet_device_state);
+    p->iov[0].iov_base = p->packet_device_state;
+    p->iovs_num++;
+}
 
 #endif


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support()
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (22 preceding siblings ...)
  2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2024-04-16 14:43 ` Maciej S. Szmigiero
  2024-04-16 14:43 ` [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:43 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since device state transfer via multifd channels requires multifd
channels with migration channel header and is currently not compatible
with multifd compression add an appropriate query function so device
can learn whether it can actually make use of it.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h | 1 +
 migration/multifd.c      | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 25968e31247b..4da4f7f85f18 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -118,6 +118,7 @@ bool migration_in_bg_snapshot(void);
 void dirty_bitmap_mig_init(void);
 
 /* migration/multifd.c */
+bool migration_has_device_state_support(void);
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                char *data, size_t len);
 
diff --git a/migration/multifd.c b/migration/multifd.c
index d8ce01539a05..d24217e705a0 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -819,6 +819,12 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
     return multifd_queue_page_locked(block, offset);
 }
 
+bool migration_has_device_state_support(void)
+{
+    return migrate_multifd() && migrate_channel_header() &&
+        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}
+
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                char *data, size_t len)
 {


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (23 preceding siblings ...)
  2024-04-16 14:43 ` [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
@ 2024-04-16 14:43 ` Maciej S. Szmigiero
  2024-04-16 14:43 ` [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
  2024-04-17  8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:43 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating
its position in the stream.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
state.

Since it's important to finish loading device state transferred via
the main migration channel (via save_live_iterate handler) before
starting loading the data asynchronously transferred via multifd
a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
mark the end of the main migration channel data.

The device state loading process waits until that flag is seen before
commencing loading of the multifd-transferred device state.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 322 +++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |   9 +-
 include/hw/vfio/vfio-common.h |  14 ++
 3 files changed, 342 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index bc3aea77455c..3af62dea6899 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -15,6 +15,7 @@
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 
+#include "io/channel-buffer.h"
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/misc.h"
@@ -46,6 +47,7 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE    (0xffffffffef100006ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -54,6 +56,15 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+typedef struct VFIODeviceStatePacket {
+    uint32_t version;
+    uint32_t idx;
+    uint32_t flags;
+    uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -186,6 +197,175 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
+typedef struct LoadedBuffer {
+    bool is_present;
+    char *data;
+    size_t len;
+} LoadedBuffer;
+
+static void loaded_buffer_clear(gpointer data)
+{
+    LoadedBuffer *lb = data;
+
+    if (!lb->is_present) {
+        return;
+    }
+
+    g_clear_pointer(&lb->data, g_free);
+    lb->is_present = false;
+}
+
+static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                                  Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+    g_autoptr(GMutexLocker) locker = g_mutex_locker_new(&migration->load_bufs_mutex);
+    LoadedBuffer *lb;
+
+    if (data_size < sizeof(*packet)) {
+        error_setg(errp, "packet too short at %zu (min is %zu)",
+                   data_size, sizeof(*packet));
+        return -1;
+    }
+
+    if (packet->version != 0) {
+        error_setg(errp, "packet has unknown version %" PRIu32,
+                   packet->version);
+        return -1;
+    }
+
+    if (packet->idx == UINT32_MAX) {
+        error_setg(errp, "packet has too high idx %" PRIu32,
+                   packet->idx);
+        return -1;
+    }
+
+    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+    /* config state packet should be the last one in the stream */
+    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+        migration->load_buf_idx_last = packet->idx;
+    }
+
+    assert(migration->load_bufs);
+    if (packet->idx >= migration->load_bufs->len) {
+        g_array_set_size(migration->load_bufs, packet->idx + 1);
+    }
+
+    lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
+    if (lb->is_present) {
+        error_setg(errp, "state buffer %" PRIu32 " already filled", packet->idx);
+        return -1;
+    }
+
+    assert(packet->idx >= migration->load_buf_idx);
+
+    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
+    lb->len = data_size - sizeof(*packet);
+    lb->is_present = true;
+
+    g_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
+
+    return 0;
+}
+
+static void *vfio_load_bufs_thread(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    Error **errp = &migration->load_bufs_thread_errp;
+    g_autoptr(GMutexLocker) locker = g_mutex_locker_new(&migration->load_bufs_mutex);
+    LoadedBuffer *lb;
+
+    while (!migration->load_bufs_device_ready &&
+           !migration->load_bufs_thread_want_exit) {
+        g_cond_wait(&migration->load_bufs_device_ready_cond, &migration->load_bufs_mutex);
+    }
+
+    while (!migration->load_bufs_thread_want_exit) {
+        bool starved;
+        ssize_t ret;
+
+        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
+
+        if (migration->load_buf_idx >= migration->load_bufs->len) {
+            assert(migration->load_buf_idx == migration->load_bufs->len);
+            starved = true;
+        } else {
+            lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
+            starved = !lb->is_present;
+        }
+
+        if (starved) {
+            trace_vfio_load_state_device_buffer_starved(vbasedev->name, migration->load_buf_idx);
+            g_cond_wait(&migration->load_bufs_buffer_ready_cond, &migration->load_bufs_mutex);
+            continue;
+        }
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last) {
+            break;
+        }
+
+        if (migration->load_buf_idx == 0) {
+            trace_vfio_load_state_device_buffer_start(vbasedev->name);
+        }
+
+        if (lb->len) {
+            g_autofree char *buf = NULL;
+            size_t buf_len;
+            int errno_save;
+
+            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
+                                                           migration->load_buf_idx);
+
+            /* lb might become re-allocated when we drop the lock */
+            buf = g_steal_pointer(&lb->data);
+            buf_len = lb->len;
+
+            /* Loading data to the device takes a while, drop the lock during this process */
+            g_clear_pointer(&locker, g_mutex_locker_free);
+            ret = write(migration->data_fd, buf, buf_len);
+            errno_save = errno;
+            locker = g_mutex_locker_new(&migration->load_bufs_mutex);
+
+            if (ret < 0) {
+                error_setg(errp, "write to state buffer %" PRIu32 " failed with %d",
+                           migration->load_buf_idx, errno_save);
+                break;
+            } else if (ret < buf_len) {
+                error_setg(errp, "write to state buffer %" PRIu32 " incomplete %zd / %zu",
+                           migration->load_buf_idx, ret, buf_len);
+                break;
+            }
+
+            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
+                                                         migration->load_buf_idx);
+        }
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
+            trace_vfio_load_state_device_buffer_end(vbasedev->name);
+        }
+
+        migration->load_buf_idx++;
+    }
+
+    if (migration->load_bufs_thread_want_exit &&
+        !*errp) {
+        error_setg(errp, "load bufs thread asked to quit");
+    }
+
+    g_clear_pointer(&locker, g_mutex_locker_free);
+
+    qemu_loadvm_load_finish_ready_lock();
+    migration->load_bufs_thread_finished = true;
+    qemu_loadvm_load_finish_ready_broadcast();
+    qemu_loadvm_load_finish_ready_unlock();
+
+    return NULL;
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -208,6 +388,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     uint64_t data;
 
+    trace_vfio_load_device_config_state_start(vbasedev->name);
+
     if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
         int ret;
 
@@ -226,7 +408,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         return -EINVAL;
     }
 
-    trace_vfio_load_device_config_state(vbasedev->name);
+    trace_vfio_load_device_config_state_end(vbasedev->name);
     return qemu_file_get_error(f);
 }
 
@@ -596,16 +778,69 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
 static int vfio_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
 
-    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
                                    vbasedev->migration->device_state);
+    if (ret) {
+        return ret;
+    }
+
+    assert(!migration->load_bufs);
+    migration->load_bufs = g_array_new(FALSE, TRUE, sizeof(LoadedBuffer));
+    g_array_set_clear_func(migration->load_bufs, loaded_buffer_clear);
+
+    g_mutex_init(&migration->load_bufs_mutex);
+
+    migration->load_bufs_device_ready = false;
+    g_cond_init(&migration->load_bufs_device_ready_cond);
+
+    migration->load_buf_idx = 0;
+    migration->load_buf_idx_last = UINT32_MAX;
+    g_cond_init(&migration->load_bufs_buffer_ready_cond);
+
+    migration->config_state_loaded_to_dev = false;
+
+    assert(!migration->load_bufs_thread_started);
+
+    migration->load_bufs_thread_finished = false;
+    migration->load_bufs_thread_want_exit = false;
+    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs",
+                       vfio_load_bufs_thread, opaque, QEMU_THREAD_JOINABLE);
+
+    migration->load_bufs_thread_started = true;
+
+    return 0;
 }
 
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->load_bufs_thread_started) {
+        g_mutex_lock(&migration->load_bufs_mutex);
+        migration->load_bufs_thread_want_exit = true;
+        g_mutex_unlock(&migration->load_bufs_mutex);
+
+        g_cond_broadcast(&migration->load_bufs_device_ready_cond);
+        g_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
+
+        qemu_thread_join(&migration->load_bufs_thread);
+
+        assert(migration->load_bufs_thread_finished);
+
+        migration->load_bufs_thread_started = false;
+    }
 
     vfio_migration_cleanup(vbasedev);
+
+    g_clear_pointer(&migration->load_bufs, g_array_unref);
+    g_cond_clear(&migration->load_bufs_buffer_ready_cond);
+    g_cond_clear(&migration->load_bufs_device_ready_cond);
+    g_mutex_clear(&migration->load_bufs_mutex);
+
     trace_vfio_load_cleanup(vbasedev->name);
 
     return 0;
@@ -614,6 +849,7 @@ static int vfio_load_cleanup(void *opaque)
 static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     int ret = 0;
     uint64_t data;
 
@@ -625,6 +861,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
         switch (data) {
         case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
         {
+            migration->config_state_loaded_to_dev = true;
             return vfio_load_device_config_state(f, opaque);
         }
         case VFIO_MIG_FLAG_DEV_SETUP_STATE:
@@ -651,6 +888,15 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE:
+        {
+            g_autoptr(GMutexLocker) locker = g_mutex_locker_new(&migration->load_bufs_mutex);
+
+            migration->load_bufs_device_ready = true;
+            g_cond_broadcast(&migration->load_bufs_device_ready_cond);
+
+            break;
+        }
         case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
         {
             if (!vfio_precopy_supported(vbasedev) ||
@@ -683,6 +929,76 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static int vfio_load_finish(void *opaque, bool *is_finished, Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    g_autoptr(GMutexLocker) locker = NULL;
+    LoadedBuffer *lb;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f_out = NULL, *f_in = NULL;
+    uint64_t mig_header;
+    int ret;
+
+    if (migration->config_state_loaded_to_dev) {
+        *is_finished = true;
+        return 0;
+    }
+
+    if (!migration->load_bufs_thread_finished) {
+        assert(migration->load_bufs_thread_started);
+        *is_finished = false;
+        return 0;
+    }
+
+    if (migration->load_bufs_thread_errp) {
+        error_propagate(errp, g_steal_pointer(&migration->load_bufs_thread_errp));
+        return -1;
+    }
+
+    locker = g_mutex_locker_new(&migration->load_bufs_mutex);
+
+    assert(migration->load_buf_idx == migration->load_buf_idx_last);
+    lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
+    assert(lb->is_present);
+
+    bioc = qio_channel_buffer_new(lb->len);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
+
+    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
+    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
+
+    ret = qemu_fflush(f_out);
+    if (ret) {
+        error_setg(errp, "load device config state file flush failed with %d", ret);
+        g_clear_pointer(&f_out, qemu_fclose);
+        return -1;
+    }
+
+    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
+    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
+
+    mig_header = qemu_get_be64(f_in);
+    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+        error_setg(errp, "load device config state invalid header %"PRIu64, mig_header);
+        g_clear_pointer(&f_out, qemu_fclose);
+        g_clear_pointer(&f_in, qemu_fclose);
+        return -1;
+    }
+
+    ret = vfio_load_device_config_state(f_in, opaque);
+    g_clear_pointer(&f_out, qemu_fclose);
+    g_clear_pointer(&f_in, qemu_fclose);
+    if (ret < 0) {
+        error_setg(errp, "load device config state failed with %d", ret);
+        return -1;
+    }
+
+    migration->config_state_loaded_to_dev = true;
+    *is_finished = true;
+    return 0;
+}
+
 static bool vfio_switchover_ack_needed(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -703,6 +1019,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .load_state_buffer = vfio_load_state_buffer,
+    .load_finish = vfio_load_finish,
     .switchover_ack_needed = vfio_switchover_ack_needed,
 };
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a72697678256..569bb6897b66 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,9 +148,16 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_device_config_state_start(const char *name) " (%s)"
+vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
+vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_start(const char *name) " (%s)"
+vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_end(const char *name) " (%s)"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9bb523249e73..f861cbd13384 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -74,6 +74,20 @@ typedef struct VFIOMigration {
 
     bool save_iterate_run;
     bool save_iterate_empty_hit;
+    QemuThread load_bufs_thread;
+    Error *load_bufs_thread_errp;
+    bool load_bufs_thread_started;
+    bool load_bufs_thread_finished;
+    bool load_bufs_thread_want_exit;
+
+    GArray *load_bufs;
+    bool load_bufs_device_ready;
+    GCond load_bufs_device_ready_cond;
+    GCond load_bufs_buffer_ready_cond;
+    GMutex load_bufs_mutex;
+    uint32_t load_buf_idx;
+    uint32_t load_buf_idx_last;
+    bool config_state_loaded_to_dev;
 } VFIOMigration;
 
 struct VFIOGroup;


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (24 preceding siblings ...)
  2024-04-16 14:43 ` [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
@ 2024-04-16 14:43 ` Maciej S. Szmigiero
  2024-04-17  8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
  26 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-16 14:43 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Implement the multifd device state transfer via additional per-device
thread spawned from save_live_complete_precopy_async handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
migration_has_device_state_support() return value.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 195 ++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +
 include/hw/vfio/vfio-common.h |   8 ++
 3 files changed, 206 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3af62dea6899..6177431a0cd3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -608,11 +608,15 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static void vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev);
+
 static void vfio_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
+    vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
     /*
      * Changing device state from STOP_COPY to STOP can take time. Do it here,
      * after migration has completed, so it won't increase downtime.
@@ -621,6 +625,7 @@ static void vfio_save_cleanup(void *opaque)
         vfio_migration_set_state_or_reset(vbasedev, VFIO_DEVICE_STATE_STOP);
     }
 
+    g_clear_pointer(&migration->idstr, g_free);
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
     migration->precopy_init_size = 0;
@@ -735,6 +740,12 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     ssize_t data_size;
     int ret;
 
+    if (migration_has_device_state_support()) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return 0;
+    }
+
     trace_vfio_save_complete_precopy_started(vbasedev->name);
 
     /* We reach here with device state STOP or STOP_COPY only */
@@ -762,11 +773,186 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev, uint32_t idx)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f = NULL;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    size_t packet_len;
+
+    bioc = qio_channel_buffer_new(0);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+    f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+    ret = vfio_save_device_config_state(f, vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        goto ret_close_file;
+    }
+
+    packet_len = sizeof(*packet) + bioc->usage;
+    packet = g_malloc0(packet_len);
+    packet->idx = idx;
+    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+    memcpy(&packet->data, bioc->data, bioc->usage);
+
+    ret = multifd_queue_device_state(migration->idstr, migration->instance_id,
+                                     (char *)packet, packet_len);
+
+    bytes_transferred += packet_len;
+
+ret_close_file:
+    g_clear_pointer(&f, qemu_fclose);
+    return ret;
+}
+
+static void *vfio_save_complete_precopy_async_thread(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int *ret = &migration->save_complete_precopy_thread_ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    uint32_t idx;
+
+    /* We reach here with device state STOP or STOP_COPY only */
+    *ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                    VFIO_DEVICE_STATE_STOP);
+    if (*ret) {
+        return NULL;
+    }
+
+    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+
+    for (idx = 0; ; idx++) {
+        ssize_t data_size;
+        size_t packet_size;
+
+        data_size = read(migration->data_fd, &packet->data,
+                         migration->data_buffer_size);
+        if (data_size < 0) {
+            if (errno != ENOMSG) {
+                *ret = -errno;
+                return NULL;
+            }
+
+            /*
+             * Pre-copy emptied all the device state for now. For more information,
+             * please refer to the Linux kernel VFIO uAPI.
+             */
+            data_size = 0;
+        }
+
+        if (data_size == 0)
+            break;
+
+        packet->idx = idx;
+        packet_size = sizeof(*packet) + data_size;
+
+        *ret = multifd_queue_device_state(migration->idstr, migration->instance_id,
+                                          (char *)packet, packet_size);
+        if (*ret) {
+            return NULL;
+        }
+
+        bytes_transferred += packet_size;
+    }
+
+    *ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idx);
+    if (*ret) {
+        return NULL;
+    }
+
+    trace_vfio_save_complete_precopy_async_finished(vbasedev->name);
+
+    return NULL;
+}
+
+static int vfio_save_complete_precopy_async(QEMUFile *f,
+                                            char *idstr, uint32_t instance_id,
+                                            void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    migration->save_complete_precopy_thread_ret = 0;
+
+    if (!migration_has_device_state_support()) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return 0;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE);
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        return ret;
+    }
+
+    assert(!migration->save_complete_precopy_thread_started);
+
+    assert(!migration->idstr);
+    migration->idstr = g_strdup(idstr);
+    migration->instance_id = instance_id;
+
+    qemu_thread_create(&migration->save_complete_precopy_thread,
+                       "vfio-save_complete_precopy",
+                       vfio_save_complete_precopy_async_thread,
+                       opaque, QEMU_THREAD_JOINABLE);
+
+    migration->save_complete_precopy_thread_started = true;
+
+    trace_vfio_save_complete_precopy_async_started(vbasedev->name, idstr, instance_id);
+
+    return 0;
+}
+
+static void vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration->save_complete_precopy_thread_started) {
+        return;
+    }
+
+    qemu_thread_join(&migration->save_complete_precopy_thread);
+
+    migration->save_complete_precopy_thread_started = false;
+
+    trace_vfio_save_complete_precopy_async_joined(vbasedev->name,
+                                                  migration->save_complete_precopy_thread_ret);
+}
+
+static int vfio_save_complete_precopy_async_wait(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
+    return migration->save_complete_precopy_thread_ret;
+}
+
 static void vfio_save_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     int ret;
 
+    if (migration_has_device_state_support()) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return;
+    }
+
     ret = vfio_save_device_config_state(f, opaque);
     if (ret) {
         error_report("%s: Failed to save device config space",
@@ -1014,6 +1200,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .state_pending_exact = vfio_state_pending_exact,
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy_async = vfio_save_complete_precopy_async,
+    .save_live_complete_precopy_async_wait = vfio_save_complete_precopy_async_wait,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -1034,6 +1222,10 @@ static void vfio_vmstate_change_prepare(void *opaque, bool running,
     enum vfio_device_mig_state new_state;
     int ret;
 
+    if (running) {
+        vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+    }
+
     new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
                     VFIO_DEVICE_STATE_PRE_COPY_P2P :
                     VFIO_DEVICE_STATE_RUNNING_P2P;
@@ -1059,6 +1251,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
     int ret;
 
     if (running) {
+        /* In case "prepare" callback wasn't registered */
+        vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
         new_state =
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 569bb6897b66..44c7bb01a004 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -165,6 +165,9 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_complete_precopy_started(const char *name) " (%s)"
+vfio_save_complete_precopy_async_started(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
+vfio_save_complete_precopy_async_finished(const char *name) " (%s)"
+vfio_save_complete_precopy_async_joined(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_save_iterate_started(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f861cbd13384..0c51b8bf4d9a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -74,12 +74,20 @@ typedef struct VFIOMigration {
 
     bool save_iterate_run;
     bool save_iterate_empty_hit;
+
+    QemuThread save_complete_precopy_thread;
+    int save_complete_precopy_thread_ret;
+    bool save_complete_precopy_thread_started;
+
     QemuThread load_bufs_thread;
     Error *load_bufs_thread_errp;
     bool load_bufs_thread_started;
     bool load_bufs_thread_finished;
     bool load_bufs_thread_want_exit;
 
+    char *idstr;
+    uint32_t instance_id;
+
     GArray *load_bufs;
     bool load_bufs_device_ready;
     GCond load_bufs_device_ready_cond;


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (25 preceding siblings ...)
  2024-04-16 14:43 ` [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2024-04-17  8:36 ` Daniel P. Berrangé
  2024-04-17 12:11   ` Maciej S. Szmigiero
  26 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2024-04-17  8:36 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> VFIO device state transfer is currently done via the main migration channel.
> This means that transfers from multiple VFIO devices are done sequentially
> and via just a single common migration channel.
> 
> Such way of transferring VFIO device state migration data reduces
> performance and severally impacts the migration downtime (~50%) for VMs
> that have multiple such devices with large state size - see the test
> results below.
> 
> However, we already have a way to transfer migration data using multiple
> connections - that's what multifd channels are.
> 
> Unfortunately, multifd channels are currently utilized for RAM transfer
> only.
> This patch set adds a new framework allowing their use for device state
> transfer too.
> 
> The wire protocol is based on Avihai's x-channel-header patches, which
> introduce a header for migration channels that allow the migration source
> to explicitly indicate the migration channel type without having the
> target deduce the channel type by peeking in the channel's content.
> 
> The new wire protocol can be switch on and off via migration.x-channel-header
> option for compatibility with older QEMU versions and testing.
> Switching the new wire protocol off also disables device state transfer via
> multifd channels.
> 
> The device state transfer can happen either via the same multifd channels
> as RAM data is transferred, mixed with RAM data (when
> migration.x-multifd-channels-device-state is 0) or exclusively via
> dedicated device state transfer channels (when
> migration.x-multifd-channels-device-state > 0).
> 
> Using dedicated device state transfer multifd channels brings further
> performance benefits since these channels don't need to participate in
> the RAM sync process.

I'm not convinced there's any need to introduce the new "channel header"
protocol messages. The multifd channels already have an initialization
message that is extensible to allow extra semantics to be indicated.
So if we want some of the multifd channels to be reserved for device
state, we could indicate that via some data in the MultiFDInit_t
message struct.

That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time. Again the 'MultiFDPacket_t' struct has
both 'flags' and unused fields, so it is extensible to indicate
that is it being used for new types of data.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-17  8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
@ 2024-04-17 12:11   ` Maciej S. Szmigiero
  2024-04-17 16:35     ` Daniel P. Berrangé
  0 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-17 12:11 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> VFIO device state transfer is currently done via the main migration channel.
>> This means that transfers from multiple VFIO devices are done sequentially
>> and via just a single common migration channel.
>>
>> Such way of transferring VFIO device state migration data reduces
>> performance and severally impacts the migration downtime (~50%) for VMs
>> that have multiple such devices with large state size - see the test
>> results below.
>>
>> However, we already have a way to transfer migration data using multiple
>> connections - that's what multifd channels are.
>>
>> Unfortunately, multifd channels are currently utilized for RAM transfer
>> only.
>> This patch set adds a new framework allowing their use for device state
>> transfer too.
>>
>> The wire protocol is based on Avihai's x-channel-header patches, which
>> introduce a header for migration channels that allow the migration source
>> to explicitly indicate the migration channel type without having the
>> target deduce the channel type by peeking in the channel's content.
>>
>> The new wire protocol can be switch on and off via migration.x-channel-header
>> option for compatibility with older QEMU versions and testing.
>> Switching the new wire protocol off also disables device state transfer via
>> multifd channels.
>>
>> The device state transfer can happen either via the same multifd channels
>> as RAM data is transferred, mixed with RAM data (when
>> migration.x-multifd-channels-device-state is 0) or exclusively via
>> dedicated device state transfer channels (when
>> migration.x-multifd-channels-device-state > 0).
>>
>> Using dedicated device state transfer multifd channels brings further
>> performance benefits since these channels don't need to participate in
>> the RAM sync process.
> 
> I'm not convinced there's any need to introduce the new "channel header"
> protocol messages. The multifd channels already have an initialization
> message that is extensible to allow extra semantics to be indicated.
> So if we want some of the multifd channels to be reserved for device
> state, we could indicate that via some data in the MultiFDInit_t
> message struct.

The reason for introducing x-channel-header was to avoid having to deduce
the channel type by peeking in the channel's content - where any channel
that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
multifd one.

But if this isn't desired then, as you say, the multifd channel type can
be indicated by using some unused field of the MultiFDInit_t message.

Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.

> That said, the idea of reserving channels specifically for VFIO doesn't
> make a whole lot of sense to me either.
> 
> Once we've done the RAM transfer, and are in the switchover phase
> doing device state transfer, all the multifd channels are idle.
> We should just use all those channels to transfer the device state,
> in parallel.  Reserving channels just guarantees many idle channels
> during RAM transfer, and further idle channels during vmstate
> transfer.
> 
> IMHO it is more flexible to just use all available multifd channel
> resources all the time.

The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.

> Again the 'MultiFDPacket_t' struct has
> both 'flags' and unused fields, so it is extensible to indicate
> that is it being used for new types of data.

Yeah, that's what MULTIFD_FLAG_DEVICE_STATE in packet header already
does in this patch set - it indicates that the packet contains device
state, not RAM data.
  
> With regards,
> Daniel

Best regards,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-17 12:11   ` Maciej S. Szmigiero
@ 2024-04-17 16:35     ` Daniel P. Berrangé
  2024-04-18  9:50       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2024-04-17 16:35 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > VFIO device state transfer is currently done via the main migration channel.
> > > This means that transfers from multiple VFIO devices are done sequentially
> > > and via just a single common migration channel.
> > > 
> > > Such way of transferring VFIO device state migration data reduces
> > > performance and severally impacts the migration downtime (~50%) for VMs
> > > that have multiple such devices with large state size - see the test
> > > results below.
> > > 
> > > However, we already have a way to transfer migration data using multiple
> > > connections - that's what multifd channels are.
> > > 
> > > Unfortunately, multifd channels are currently utilized for RAM transfer
> > > only.
> > > This patch set adds a new framework allowing their use for device state
> > > transfer too.
> > > 
> > > The wire protocol is based on Avihai's x-channel-header patches, which
> > > introduce a header for migration channels that allow the migration source
> > > to explicitly indicate the migration channel type without having the
> > > target deduce the channel type by peeking in the channel's content.
> > > 
> > > The new wire protocol can be switch on and off via migration.x-channel-header
> > > option for compatibility with older QEMU versions and testing.
> > > Switching the new wire protocol off also disables device state transfer via
> > > multifd channels.
> > > 
> > > The device state transfer can happen either via the same multifd channels
> > > as RAM data is transferred, mixed with RAM data (when
> > > migration.x-multifd-channels-device-state is 0) or exclusively via
> > > dedicated device state transfer channels (when
> > > migration.x-multifd-channels-device-state > 0).
> > > 
> > > Using dedicated device state transfer multifd channels brings further
> > > performance benefits since these channels don't need to participate in
> > > the RAM sync process.
> > 
> > I'm not convinced there's any need to introduce the new "channel header"
> > protocol messages. The multifd channels already have an initialization
> > message that is extensible to allow extra semantics to be indicated.
> > So if we want some of the multifd channels to be reserved for device
> > state, we could indicate that via some data in the MultiFDInit_t
> > message struct.
> 
> The reason for introducing x-channel-header was to avoid having to deduce
> the channel type by peeking in the channel's content - where any channel
> that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
> multifd one.
> 
> But if this isn't desired then, as you say, the multifd channel type can
> be indicated by using some unused field of the MultiFDInit_t message.
> 
> Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.

I don't like the heuristics we currently have, and would to have
a better solution. What makes me cautious is that this proposal
is a protocol change, but only addressing one very narrow problem
with the migration protocol.

I'd like migration to see a more explicit bi-directional protocol
negotiation message set, where both QEMU can auto-negotiate amongst
themselves many of the features that currently require tedious
manual configuration by mgmt apps via migrate parameters/capabilities.
That would address the problem you describe here, and so much more.

If we add this channel header feature now, it creates yet another
thing to keep around for back compatibility. So if this is not
strictly required, in order to solve the VFIO VMstate problem, I'd
prefer to just solve the VMstate stuff on its own.

> > That said, the idea of reserving channels specifically for VFIO doesn't
> > make a whole lot of sense to me either.
> > 
> > Once we've done the RAM transfer, and are in the switchover phase
> > doing device state transfer, all the multifd channels are idle.
> > We should just use all those channels to transfer the device state,
> > in parallel.  Reserving channels just guarantees many idle channels
> > during RAM transfer, and further idle channels during vmstate
> > transfer.
> > 
> > IMHO it is more flexible to just use all available multifd channel
> > resources all the time.
> 
> The reason for having dedicated device state channels is that they
> provide lower downtime in my tests.
> 
> With either 15 or 11 mixed multifd channels (no dedicated device state
> channels) I get a downtime of about 1250 msec.
> 
> Comparing that with 15 total multifd channels / 4 dedicated device
> state channels that give downtime of about 1100 ms it means that using
> dedicated channels gets about 14% downtime improvement.

Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
place ? Is is transferred concurrently with the RAM ? I had thought
this series still has the RAM transfer iterations running first,
and then the VFIO VMstate at the end, simply making use of multifd
channels for parallelism of the end phase. your reply though makes
me question my interpretation though.

Let me try to illustrate channel flow in various scenarios, time
flowing left to right:

1. serialized RAM, then serialized VM state  (ie historical migration)

      main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |


2. parallel RAM, then serialized VM state (ie today's multifd)

      main: | Init |                                            | VM state |
  multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |


3. parallel RAM, then parallel VM state

      main: | Init |                                            | VM state |
  multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd4:                                                     | VFIO VM state |
  multifd5:                                                     | VFIO VM state |


4. parallel RAM and VFIO VM state, then remaining VM state

      main: | Init |                                            | VM state |
  multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd4:        | VFIO VM state                                         |
  multifd5:        | VFIO VM state                                         |


I thought this series was implementing approx (3), but are you actually
implementing (4), or something else entirely ?


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-17 16:35     ` Daniel P. Berrangé
@ 2024-04-18  9:50       ` Maciej S. Szmigiero
  2024-04-18 10:39         ` Daniel P. Berrangé
  0 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-18  9:50 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> VFIO device state transfer is currently done via the main migration channel.
>>>> This means that transfers from multiple VFIO devices are done sequentially
>>>> and via just a single common migration channel.
>>>>
>>>> Such way of transferring VFIO device state migration data reduces
>>>> performance and severally impacts the migration downtime (~50%) for VMs
>>>> that have multiple such devices with large state size - see the test
>>>> results below.
>>>>
>>>> However, we already have a way to transfer migration data using multiple
>>>> connections - that's what multifd channels are.
>>>>
>>>> Unfortunately, multifd channels are currently utilized for RAM transfer
>>>> only.
>>>> This patch set adds a new framework allowing their use for device state
>>>> transfer too.
>>>>
>>>> The wire protocol is based on Avihai's x-channel-header patches, which
>>>> introduce a header for migration channels that allow the migration source
>>>> to explicitly indicate the migration channel type without having the
>>>> target deduce the channel type by peeking in the channel's content.
>>>>
>>>> The new wire protocol can be switch on and off via migration.x-channel-header
>>>> option for compatibility with older QEMU versions and testing.
>>>> Switching the new wire protocol off also disables device state transfer via
>>>> multifd channels.
>>>>
>>>> The device state transfer can happen either via the same multifd channels
>>>> as RAM data is transferred, mixed with RAM data (when
>>>> migration.x-multifd-channels-device-state is 0) or exclusively via
>>>> dedicated device state transfer channels (when
>>>> migration.x-multifd-channels-device-state > 0).
>>>>
>>>> Using dedicated device state transfer multifd channels brings further
>>>> performance benefits since these channels don't need to participate in
>>>> the RAM sync process.
>>>
>>> I'm not convinced there's any need to introduce the new "channel header"
>>> protocol messages. The multifd channels already have an initialization
>>> message that is extensible to allow extra semantics to be indicated.
>>> So if we want some of the multifd channels to be reserved for device
>>> state, we could indicate that via some data in the MultiFDInit_t
>>> message struct.
>>
>> The reason for introducing x-channel-header was to avoid having to deduce
>> the channel type by peeking in the channel's content - where any channel
>> that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
>> multifd one.
>>
>> But if this isn't desired then, as you say, the multifd channel type can
>> be indicated by using some unused field of the MultiFDInit_t message.
>>
>> Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.
> 
> I don't like the heuristics we currently have, and would to have
> a better solution. What makes me cautious is that this proposal
> is a protocol change, but only addressing one very narrow problem
> with the migration protocol.
> 
> I'd like migration to see a more explicit bi-directional protocol
> negotiation message set, where both QEMU can auto-negotiate amongst
> themselves many of the features that currently require tedious
> manual configuration by mgmt apps via migrate parameters/capabilities.
> That would address the problem you describe here, and so much more.

Isn't the capability negotiation handled automatically by libvirt
today?
I guess you'd prefer for QEMU to internally handle it instead?

> If we add this channel header feature now, it creates yet another
> thing to keep around for back compatibility. So if this is not
> strictly required, in order to solve the VFIO VMstate problem, I'd
> prefer to just solve the VMstate stuff on its own.

Okay, got it.

>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>> make a whole lot of sense to me either.
>>>
>>> Once we've done the RAM transfer, and are in the switchover phase
>>> doing device state transfer, all the multifd channels are idle.
>>> We should just use all those channels to transfer the device state,
>>> in parallel.  Reserving channels just guarantees many idle channels
>>> during RAM transfer, and further idle channels during vmstate
>>> transfer.
>>>
>>> IMHO it is more flexible to just use all available multifd channel
>>> resources all the time.
>>
>> The reason for having dedicated device state channels is that they
>> provide lower downtime in my tests.
>>
>> With either 15 or 11 mixed multifd channels (no dedicated device state
>> channels) I get a downtime of about 1250 msec.
>>
>> Comparing that with 15 total multifd channels / 4 dedicated device
>> state channels that give downtime of about 1100 ms it means that using
>> dedicated channels gets about 14% downtime improvement.
> 
> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> place ? Is is transferred concurrently with the RAM ? I had thought
> this series still has the RAM transfer iterations running first,
> and then the VFIO VMstate at the end, simply making use of multifd
> channels for parallelism of the end phase. your reply though makes
> me question my interpretation though.
> 
> Let me try to illustrate channel flow in various scenarios, time
> flowing left to right:
> 
> 1. serialized RAM, then serialized VM state  (ie historical migration)
> 
>        main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> 
> 
> 2. parallel RAM, then serialized VM state (ie today's multifd)
> 
>        main: | Init |                                            | VM state |
>    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> 
> 
> 3. parallel RAM, then parallel VM state
> 
>        main: | Init |                                            | VM state |
>    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd4:                                                     | VFIO VM state |
>    multifd5:                                                     | VFIO VM state |
> 
> 
> 4. parallel RAM and VFIO VM state, then remaining VM state
> 
>        main: | Init |                                            | VM state |
>    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd4:        | VFIO VM state                                         |
>    multifd5:        | VFIO VM state                                         |
> 
> 
> I thought this series was implementing approx (3), but are you actually
> implementing (4), or something else entirely ?

You are right that this series operation is approximately implementing
the schema described as numer 3 in your diagrams.

However, there are some additional details worth mentioning:
* There's some but relatively small amount of VFIO data being
transferred from the "save_live_iterate" SaveVMHandler while the VM is
still running.

This is still happening via the main migration channel.
Parallelizing this transfer in the future might make sense too,
although obviously this doesn't impact the downtime.

* After the VM is stopped and downtime starts the main (~ 400 MiB)
VFIO device state gets transferred via multifd channels.

However, these multifd channels (if they are not dedicated to device
state transfer) aren't idle during that time.
Rather they seem to be transferring the residual RAM data.

That's most likely what causes the additional observed downtime
when dedicated device state transfer multifd channels aren't used.

> 
> With regards,
> Daniel

Best regards,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-18  9:50       ` Maciej S. Szmigiero
@ 2024-04-18 10:39         ` Daniel P. Berrangé
  2024-04-18 18:14           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2024-04-18 10:39 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > VFIO device state transfer is currently done via the main migration channel.
> > > > > This means that transfers from multiple VFIO devices are done sequentially
> > > > > and via just a single common migration channel.
> > > > > 
> > > > > Such way of transferring VFIO device state migration data reduces
> > > > > performance and severally impacts the migration downtime (~50%) for VMs
> > > > > that have multiple such devices with large state size - see the test
> > > > > results below.
> > > > > 
> > > > > However, we already have a way to transfer migration data using multiple
> > > > > connections - that's what multifd channels are.
> > > > > 
> > > > > Unfortunately, multifd channels are currently utilized for RAM transfer
> > > > > only.
> > > > > This patch set adds a new framework allowing their use for device state
> > > > > transfer too.
> > > > > 
> > > > > The wire protocol is based on Avihai's x-channel-header patches, which
> > > > > introduce a header for migration channels that allow the migration source
> > > > > to explicitly indicate the migration channel type without having the
> > > > > target deduce the channel type by peeking in the channel's content.
> > > > > 
> > > > > The new wire protocol can be switch on and off via migration.x-channel-header
> > > > > option for compatibility with older QEMU versions and testing.
> > > > > Switching the new wire protocol off also disables device state transfer via
> > > > > multifd channels.
> > > > > 
> > > > > The device state transfer can happen either via the same multifd channels
> > > > > as RAM data is transferred, mixed with RAM data (when
> > > > > migration.x-multifd-channels-device-state is 0) or exclusively via
> > > > > dedicated device state transfer channels (when
> > > > > migration.x-multifd-channels-device-state > 0).
> > > > > 
> > > > > Using dedicated device state transfer multifd channels brings further
> > > > > performance benefits since these channels don't need to participate in
> > > > > the RAM sync process.
> > > > 
> > > > I'm not convinced there's any need to introduce the new "channel header"
> > > > protocol messages. The multifd channels already have an initialization
> > > > message that is extensible to allow extra semantics to be indicated.
> > > > So if we want some of the multifd channels to be reserved for device
> > > > state, we could indicate that via some data in the MultiFDInit_t
> > > > message struct.
> > > 
> > > The reason for introducing x-channel-header was to avoid having to deduce
> > > the channel type by peeking in the channel's content - where any channel
> > > that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
> > > multifd one.
> > > 
> > > But if this isn't desired then, as you say, the multifd channel type can
> > > be indicated by using some unused field of the MultiFDInit_t message.
> > > 
> > > Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.
> > 
> > I don't like the heuristics we currently have, and would to have
> > a better solution. What makes me cautious is that this proposal
> > is a protocol change, but only addressing one very narrow problem
> > with the migration protocol.
> > 
> > I'd like migration to see a more explicit bi-directional protocol
> > negotiation message set, where both QEMU can auto-negotiate amongst
> > themselves many of the features that currently require tedious
> > manual configuration by mgmt apps via migrate parameters/capabilities.
> > That would address the problem you describe here, and so much more.
> 
> Isn't the capability negotiation handled automatically by libvirt
> today?
> I guess you'd prefer for QEMU to internally handle it instead?

Yes, it would be much saner if QEMU handled it automatically as
part of its own protocol handshake. This avoids the need to change
libvirt to enable new functionality in the migration protocol in
many (but not all) cases, and thus speed up development and deployment
of new features.

Libvirt should really only need to be changed to support runtime
performance tunables, rather than migration protocol features.

> > > > That said, the idea of reserving channels specifically for VFIO doesn't
> > > > make a whole lot of sense to me either.
> > > > 
> > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > doing device state transfer, all the multifd channels are idle.
> > > > We should just use all those channels to transfer the device state,
> > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > during RAM transfer, and further idle channels during vmstate
> > > > transfer.
> > > > 
> > > > IMHO it is more flexible to just use all available multifd channel
> > > > resources all the time.
> > > 
> > > The reason for having dedicated device state channels is that they
> > > provide lower downtime in my tests.
> > > 
> > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > channels) I get a downtime of about 1250 msec.
> > > 
> > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > state channels that give downtime of about 1100 ms it means that using
> > > dedicated channels gets about 14% downtime improvement.
> > 
> > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > place ? Is is transferred concurrently with the RAM ? I had thought
> > this series still has the RAM transfer iterations running first,
> > and then the VFIO VMstate at the end, simply making use of multifd
> > channels for parallelism of the end phase. your reply though makes
> > me question my interpretation though.
> > 
> > Let me try to illustrate channel flow in various scenarios, time
> > flowing left to right:
> > 
> > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > 
> >        main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> > 
> > 
> > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > 
> >        main: | Init |                                            | VM state |
> >    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > 
> > 
> > 3. parallel RAM, then parallel VM state
> > 
> >        main: | Init |                                            | VM state |
> >    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd4:                                                     | VFIO VM state |
> >    multifd5:                                                     | VFIO VM state |
> > 
> > 
> > 4. parallel RAM and VFIO VM state, then remaining VM state
> > 
> >        main: | Init |                                            | VM state |
> >    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd4:        | VFIO VM state                                         |
> >    multifd5:        | VFIO VM state                                         |
> > 
> > 
> > I thought this series was implementing approx (3), but are you actually
> > implementing (4), or something else entirely ?
> 
> You are right that this series operation is approximately implementing
> the schema described as numer 3 in your diagrams.

> However, there are some additional details worth mentioning:
> * There's some but relatively small amount of VFIO data being
> transferred from the "save_live_iterate" SaveVMHandler while the VM is
> still running.
> 
> This is still happening via the main migration channel.
> Parallelizing this transfer in the future might make sense too,
> although obviously this doesn't impact the downtime.
> 
> * After the VM is stopped and downtime starts the main (~ 400 MiB)
> VFIO device state gets transferred via multifd channels.
> 
> However, these multifd channels (if they are not dedicated to device
> state transfer) aren't idle during that time.
> Rather they seem to be transferring the residual RAM data.
> 
> That's most likely what causes the additional observed downtime
> when dedicated device state transfer multifd channels aren't used.

Ahh yes, I forgot about the residual dirty RAM, that makes sense as
an explanation. Allow me to work through the scenarios though, as I
still think my suggestion to not have separate dedicate channels is
better....


Lets say hypothetically we have an existing deployment today that
uses 6 multifd channels for RAM. ie:
 
        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |

That value of 6 was chosen because that corresponds to the amount
of network & CPU utilization the admin wants to allow, for this
VM to migrate. All 6 channels are fully utilized at all times.


If we now want to parallelize VFIO VM state, the peak network
and CPU utilization the admin wants to reserve for the VM should
not change. Thus the admin will still wants to configure only 6
channels total.

With your proposal the admin has to reduce RAM transfer to 4 of the
channels, in order to then reserve 2 channels for VFIO VM state, so we
get a flow like:

 
        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd5:                                                     | VFIO VM state |
    multifd6:                                                     | VFIO VM state |

This is bad, as it reduces performance of RAM transfer. VFIO VM
state transfer is better, but that's not a net win overall.



So lets say the admin was happy to increase the number of multifd
channels from 6 to 8.

This series proposes that they would leave RAM using 6 channels as
before, and now reserve the 2 extra ones for VFIO VM state:

        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd7:                                                     | VFIO VM state |
    multifd8:                                                     | VFIO VM state |


RAM would perform as well as it did historically, and VM state would
improve due to the 2 parallel channels, and not competing with the
residual RAM transfer.

This is what your latency comparison numbers show as a benefit for
this channel reservation design.

I believe this comparison is inappropriate / unfair though, as it is
comparing a situation with 6 total channels against a situation with
8 total channels.

If the admin was happy to increase the total channels to 8, then they
should allow RAM to use all 8 channels, and then VFIO VM state +
residual RAM to also use the very same set of 8 channels:

        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|

This will speed up initial RAM iters still further & the final switch
over phase even more. If residual RAM is larger than VFIO VM state,
then it will dominate the switchover latency, so having VFIO VM state
compete is not a problem. If VFIO VM state is larger than residual RAM,
then allowing it acces to all 8 channels instead of only 2 channels
will be a clear win.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-18 10:39         ` Daniel P. Berrangé
@ 2024-04-18 18:14           ` Maciej S. Szmigiero
  2024-04-18 20:02             ` Peter Xu
  2024-04-19 10:20             ` Daniel P. Berrangé
  0 siblings, 2 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-18 18:14 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
>> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
(..)
>>>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>>>> make a whole lot of sense to me either.
>>>>>
>>>>> Once we've done the RAM transfer, and are in the switchover phase
>>>>> doing device state transfer, all the multifd channels are idle.
>>>>> We should just use all those channels to transfer the device state,
>>>>> in parallel.  Reserving channels just guarantees many idle channels
>>>>> during RAM transfer, and further idle channels during vmstate
>>>>> transfer.
>>>>>
>>>>> IMHO it is more flexible to just use all available multifd channel
>>>>> resources all the time.
>>>>
>>>> The reason for having dedicated device state channels is that they
>>>> provide lower downtime in my tests.
>>>>
>>>> With either 15 or 11 mixed multifd channels (no dedicated device state
>>>> channels) I get a downtime of about 1250 msec.
>>>>
>>>> Comparing that with 15 total multifd channels / 4 dedicated device
>>>> state channels that give downtime of about 1100 ms it means that using
>>>> dedicated channels gets about 14% downtime improvement.
>>>
>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
>>> place ? Is is transferred concurrently with the RAM ? I had thought
>>> this series still has the RAM transfer iterations running first,
>>> and then the VFIO VMstate at the end, simply making use of multifd
>>> channels for parallelism of the end phase. your reply though makes
>>> me question my interpretation though.
>>>
>>> Let me try to illustrate channel flow in various scenarios, time
>>> flowing left to right:
>>>
>>> 1. serialized RAM, then serialized VM state  (ie historical migration)
>>>
>>>         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
>>>
>>>
>>> 2. parallel RAM, then serialized VM state (ie today's multifd)
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>
>>>
>>> 3. parallel RAM, then parallel VM state
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd4:                                                     | VFIO VM state |
>>>     multifd5:                                                     | VFIO VM state |
>>>
>>>
>>> 4. parallel RAM and VFIO VM state, then remaining VM state
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd4:        | VFIO VM state                                         |
>>>     multifd5:        | VFIO VM state                                         |
>>>
>>>
>>> I thought this series was implementing approx (3), but are you actually
>>> implementing (4), or something else entirely ?
>>
>> You are right that this series operation is approximately implementing
>> the schema described as numer 3 in your diagrams.
> 
>> However, there are some additional details worth mentioning:
>> * There's some but relatively small amount of VFIO data being
>> transferred from the "save_live_iterate" SaveVMHandler while the VM is
>> still running.
>>
>> This is still happening via the main migration channel.
>> Parallelizing this transfer in the future might make sense too,
>> although obviously this doesn't impact the downtime.
>>
>> * After the VM is stopped and downtime starts the main (~ 400 MiB)
>> VFIO device state gets transferred via multifd channels.
>>
>> However, these multifd channels (if they are not dedicated to device
>> state transfer) aren't idle during that time.
>> Rather they seem to be transferring the residual RAM data.
>>
>> That's most likely what causes the additional observed downtime
>> when dedicated device state transfer multifd channels aren't used.
> 
> Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> an explanation. Allow me to work through the scenarios though, as I
> still think my suggestion to not have separate dedicate channels is
> better....
> 
> 
> Lets say hypothetically we have an existing deployment today that
> uses 6 multifd channels for RAM. ie:
>   
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> 
> That value of 6 was chosen because that corresponds to the amount
> of network & CPU utilization the admin wants to allow, for this
> VM to migrate. All 6 channels are fully utilized at all times.
> 
> 
> If we now want to parallelize VFIO VM state, the peak network
> and CPU utilization the admin wants to reserve for the VM should
> not change. Thus the admin will still wants to configure only 6
> channels total.
> 
> With your proposal the admin has to reduce RAM transfer to 4 of the
> channels, in order to then reserve 2 channels for VFIO VM state, so we
> get a flow like:
> 
>   
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:                                                     | VFIO VM state |
>      multifd6:                                                     | VFIO VM state |
> 
> This is bad, as it reduces performance of RAM transfer. VFIO VM
> state transfer is better, but that's not a net win overall.
> 
> 
> 
> So lets say the admin was happy to increase the number of multifd
> channels from 6 to 8.
> 
> This series proposes that they would leave RAM using 6 channels as
> before, and now reserve the 2 extra ones for VFIO VM state:
> 
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd7:                                                     | VFIO VM state |
>      multifd8:                                                     | VFIO VM state |
> 
> 
> RAM would perform as well as it did historically, and VM state would
> improve due to the 2 parallel channels, and not competing with the
> residual RAM transfer.
> 
> This is what your latency comparison numbers show as a benefit for
> this channel reservation design.
> 
> I believe this comparison is inappropriate / unfair though, as it is
> comparing a situation with 6 total channels against a situation with
> 8 total channels.
> 
> If the admin was happy to increase the total channels to 8, then they
> should allow RAM to use all 8 channels, and then VFIO VM state +
> residual RAM to also use the very same set of 8 channels:
> 
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> 
> This will speed up initial RAM iters still further & the final switch
> over phase even more. If residual RAM is larger than VFIO VM state,
> then it will dominate the switchover latency, so having VFIO VM state
> compete is not a problem. If VFIO VM state is larger than residual RAM,
> then allowing it acces to all 8 channels instead of only 2 channels
> will be a clear win.

I re-did the measurement with increased the number of multifd channels,
first to (total count/dedicated count) 25/0, then to 100/0.

The results did not improve:
With 25/0 multifd mixed channels config I still get around 1250 msec
downtime - the same as with 15/0 or 11/0 mixed configs I measured
earlier.

But with the (pretty insane) 100/0 mixed channel config the whole setup
gets so for into the law of diminishing returns that the results actually
get worse: the downtime is now about 1450 msec.
I guess that's from all the extra overhead from switching between 100
multifd channels.

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

It is possible that there are other subtle performance interactions too,
but I am not 100% sure about that.

> With regards,
> Daniel

Best regards,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-18 18:14           ` Maciej S. Szmigiero
@ 2024-04-18 20:02             ` Peter Xu
  2024-04-19 10:07               ` Daniel P. Berrangé
  2024-04-23 16:14               ` Maciej S. Szmigiero
  2024-04-19 10:20             ` Daniel P. Berrangé
  1 sibling, 2 replies; 54+ messages in thread
From: Peter Xu @ 2024-04-18 20:02 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> (..)
> > > > > > That said, the idea of reserving channels specifically for VFIO doesn't
> > > > > > make a whole lot of sense to me either.
> > > > > > 
> > > > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > > > doing device state transfer, all the multifd channels are idle.
> > > > > > We should just use all those channels to transfer the device state,
> > > > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > > > during RAM transfer, and further idle channels during vmstate
> > > > > > transfer.
> > > > > > 
> > > > > > IMHO it is more flexible to just use all available multifd channel
> > > > > > resources all the time.
> > > > > 
> > > > > The reason for having dedicated device state channels is that they
> > > > > provide lower downtime in my tests.
> > > > > 
> > > > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > > > channels) I get a downtime of about 1250 msec.
> > > > > 
> > > > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > > > state channels that give downtime of about 1100 ms it means that using
> > > > > dedicated channels gets about 14% downtime improvement.
> > > > 
> > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > > > place ? Is is transferred concurrently with the RAM ? I had thought
> > > > this series still has the RAM transfer iterations running first,
> > > > and then the VFIO VMstate at the end, simply making use of multifd
> > > > channels for parallelism of the end phase. your reply though makes
> > > > me question my interpretation though.
> > > > 
> > > > Let me try to illustrate channel flow in various scenarios, time
> > > > flowing left to right:
> > > > 
> > > > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > > > 
> > > >         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> > > > 
> > > > 
> > > > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > > 
> > > > 
> > > > 3. parallel RAM, then parallel VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:                                                     | VFIO VM state |
> > > >     multifd5:                                                     | VFIO VM state |
> > > > 
> > > > 
> > > > 4. parallel RAM and VFIO VM state, then remaining VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:        | VFIO VM state                                         |
> > > >     multifd5:        | VFIO VM state                                         |
> > > > 
> > > > 
> > > > I thought this series was implementing approx (3), but are you actually
> > > > implementing (4), or something else entirely ?
> > > 
> > > You are right that this series operation is approximately implementing
> > > the schema described as numer 3 in your diagrams.
> > 
> > > However, there are some additional details worth mentioning:
> > > * There's some but relatively small amount of VFIO data being
> > > transferred from the "save_live_iterate" SaveVMHandler while the VM is
> > > still running.
> > > 
> > > This is still happening via the main migration channel.
> > > Parallelizing this transfer in the future might make sense too,
> > > although obviously this doesn't impact the downtime.
> > > 
> > > * After the VM is stopped and downtime starts the main (~ 400 MiB)
> > > VFIO device state gets transferred via multifd channels.
> > > 
> > > However, these multifd channels (if they are not dedicated to device
> > > state transfer) aren't idle during that time.
> > > Rather they seem to be transferring the residual RAM data.
> > > 
> > > That's most likely what causes the additional observed downtime
> > > when dedicated device state transfer multifd channels aren't used.
> > 
> > Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> > an explanation. Allow me to work through the scenarios though, as I
> > still think my suggestion to not have separate dedicate channels is
> > better....
> > 
> > 
> > Lets say hypothetically we have an existing deployment today that
> > uses 6 multifd channels for RAM. ie:
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> > 
> > That value of 6 was chosen because that corresponds to the amount
> > of network & CPU utilization the admin wants to allow, for this
> > VM to migrate. All 6 channels are fully utilized at all times.
> > 
> > 
> > If we now want to parallelize VFIO VM state, the peak network
> > and CPU utilization the admin wants to reserve for the VM should
> > not change. Thus the admin will still wants to configure only 6
> > channels total.
> > 
> > With your proposal the admin has to reduce RAM transfer to 4 of the
> > channels, in order to then reserve 2 channels for VFIO VM state, so we
> > get a flow like:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:                                                     | VFIO VM state |
> >      multifd6:                                                     | VFIO VM state |
> > 
> > This is bad, as it reduces performance of RAM transfer. VFIO VM
> > state transfer is better, but that's not a net win overall.
> > 
> > 
> > 
> > So lets say the admin was happy to increase the number of multifd
> > channels from 6 to 8.
> > 
> > This series proposes that they would leave RAM using 6 channels as
> > before, and now reserve the 2 extra ones for VFIO VM state:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd7:                                                     | VFIO VM state |
> >      multifd8:                                                     | VFIO VM state |
> > 
> > 
> > RAM would perform as well as it did historically, and VM state would
> > improve due to the 2 parallel channels, and not competing with the
> > residual RAM transfer.
> > 
> > This is what your latency comparison numbers show as a benefit for
> > this channel reservation design.
> > 
> > I believe this comparison is inappropriate / unfair though, as it is
> > comparing a situation with 6 total channels against a situation with
> > 8 total channels.
> > 
> > If the admin was happy to increase the total channels to 8, then they
> > should allow RAM to use all 8 channels, and then VFIO VM state +
> > residual RAM to also use the very same set of 8 channels:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> > 
> > This will speed up initial RAM iters still further & the final switch
> > over phase even more. If residual RAM is larger than VFIO VM state,
> > then it will dominate the switchover latency, so having VFIO VM state
> > compete is not a problem. If VFIO VM state is larger than residual RAM,
> > then allowing it acces to all 8 channels instead of only 2 channels
> > will be a clear win.
> 
> I re-did the measurement with increased the number of multifd channels,
> first to (total count/dedicated count) 25/0, then to 100/0.
> 
> The results did not improve:
> With 25/0 multifd mixed channels config I still get around 1250 msec
> downtime - the same as with 15/0 or 11/0 mixed configs I measured
> earlier.
> 
> But with the (pretty insane) 100/0 mixed channel config the whole setup
> gets so for into the law of diminishing returns that the results actually
> get worse: the downtime is now about 1450 msec.
> I guess that's from all the extra overhead from switching between 100
> multifd channels.

100 threads are probably too much indeed.

However I agree with Dan's question raised, and I'd like to second that.
It so far looks better if the multifd channels can be managed just like a
pool of workers without assignments to specific jobs.  It looks like this
series is already getting there, it's a pity we lose that genericity only
because some other side effects on the ram sync semantics.

> 
> I think one of the reasons for these results is that mixed (RAM + device
> state) multifd channels participate in the RAM sync process
> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

  - complete_precopy_async
  - complete_precopy
  - complete_precopy_async_wait

But perhaps better:

  - complete_precopy_begin
  - complete_precopy
  - complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

I would even ask one step further as what Dan was asking: have you thought
about dumping VFIO states via multifd even during iterations?  Would that
help even more than this series (which IIUC only helps during the blackout
phase)?

It could mean that the "async*" hooks can be done differently, and I'm not
sure whether they're needed at all, e.g. when threads are created during
save_setup but cleaned up in save_cleanup.

Thanks,

> 
> It is possible that there are other subtle performance interactions too,
> but I am not 100% sure about that.
> 
> > With regards,
> > Daniel
> 
> Best regards,
> Maciej
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-18 20:02             ` Peter Xu
@ 2024-04-19 10:07               ` Daniel P. Berrangé
  2024-04-19 15:31                 ` Peter Xu
  2024-04-23 16:14               ` Maciej S. Szmigiero
  1 sibling, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2024-04-19 10:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > I think one of the reasons for these results is that mixed (RAM + device
> > state) multifd channels participate in the RAM sync process
> > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> 
> Firstly, I'm wondering whether we can have better names for these new
> hooks.  Currently (only comment on the async* stuff):
> 
>   - complete_precopy_async
>   - complete_precopy
>   - complete_precopy_async_wait
> 
> But perhaps better:
> 
>   - complete_precopy_begin
>   - complete_precopy
>   - complete_precopy_end
> 
> ?
> 
> As I don't see why the device must do something with async in such hook.
> To me it's more like you're splitting one process into multiple, then
> begin/end sounds more generic.
> 
> Then, if with that in mind, IIUC we can already split ram_save_complete()
> into >1 phases too. For example, I would be curious whether the performance
> will go back to normal if we offloading multifd_send_sync_main() into the
> complete_precopy_end(), because we really only need one shot of that, and I
> am quite surprised it already greatly affects VFIO dumping its own things.
> 
> I would even ask one step further as what Dan was asking: have you thought
> about dumping VFIO states via multifd even during iterations?  Would that
> help even more than this series (which IIUC only helps during the blackout
> phase)?

To dump during RAM iteration, the VFIO device will need to have
dirty tracking and iterate on its state, because the guest CPUs
will still be running potentially changing VFIO state. That seems
impractical in the general case.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-18 18:14           ` Maciej S. Szmigiero
  2024-04-18 20:02             ` Peter Xu
@ 2024-04-19 10:20             ` Daniel P. Berrangé
  1 sibling, 0 replies; 54+ messages in thread
From: Daniel P. Berrangé @ 2024-04-19 10:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> (..)
> > > > > > That said, the idea of reserving channels specifically for VFIO doesn't
> > > > > > make a whole lot of sense to me either.
> > > > > > 
> > > > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > > > doing device state transfer, all the multifd channels are idle.
> > > > > > We should just use all those channels to transfer the device state,
> > > > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > > > during RAM transfer, and further idle channels during vmstate
> > > > > > transfer.
> > > > > > 
> > > > > > IMHO it is more flexible to just use all available multifd channel
> > > > > > resources all the time.
> > > > > 
> > > > > The reason for having dedicated device state channels is that they
> > > > > provide lower downtime in my tests.
> > > > > 
> > > > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > > > channels) I get a downtime of about 1250 msec.
> > > > > 
> > > > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > > > state channels that give downtime of about 1100 ms it means that using
> > > > > dedicated channels gets about 14% downtime improvement.
> > > > 
> > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > > > place ? Is is transferred concurrently with the RAM ? I had thought
> > > > this series still has the RAM transfer iterations running first,
> > > > and then the VFIO VMstate at the end, simply making use of multifd
> > > > channels for parallelism of the end phase. your reply though makes
> > > > me question my interpretation though.
> > > > 
> > > > Let me try to illustrate channel flow in various scenarios, time
> > > > flowing left to right:
> > > > 
> > > > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > > > 
> > > >         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> > > > 
> > > > 
> > > > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > > 
> > > > 
> > > > 3. parallel RAM, then parallel VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:                                                     | VFIO VM state |
> > > >     multifd5:                                                     | VFIO VM state |
> > > > 
> > > > 
> > > > 4. parallel RAM and VFIO VM state, then remaining VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:        | VFIO VM state                                         |
> > > >     multifd5:        | VFIO VM state                                         |
> > > > 
> > > > 
> > > > I thought this series was implementing approx (3), but are you actually
> > > > implementing (4), or something else entirely ?
> > > 
> > > You are right that this series operation is approximately implementing
> > > the schema described as numer 3 in your diagrams.
> > 
> > > However, there are some additional details worth mentioning:
> > > * There's some but relatively small amount of VFIO data being
> > > transferred from the "save_live_iterate" SaveVMHandler while the VM is
> > > still running.
> > > 
> > > This is still happening via the main migration channel.
> > > Parallelizing this transfer in the future might make sense too,
> > > although obviously this doesn't impact the downtime.
> > > 
> > > * After the VM is stopped and downtime starts the main (~ 400 MiB)
> > > VFIO device state gets transferred via multifd channels.
> > > 
> > > However, these multifd channels (if they are not dedicated to device
> > > state transfer) aren't idle during that time.
> > > Rather they seem to be transferring the residual RAM data.
> > > 
> > > That's most likely what causes the additional observed downtime
> > > when dedicated device state transfer multifd channels aren't used.
> > 
> > Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> > an explanation. Allow me to work through the scenarios though, as I
> > still think my suggestion to not have separate dedicate channels is
> > better....
> > 
> > 
> > Lets say hypothetically we have an existing deployment today that
> > uses 6 multifd channels for RAM. ie:
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> > 
> > That value of 6 was chosen because that corresponds to the amount
> > of network & CPU utilization the admin wants to allow, for this
> > VM to migrate. All 6 channels are fully utilized at all times.
> > 
> > 
> > If we now want to parallelize VFIO VM state, the peak network
> > and CPU utilization the admin wants to reserve for the VM should
> > not change. Thus the admin will still wants to configure only 6
> > channels total.
> > 
> > With your proposal the admin has to reduce RAM transfer to 4 of the
> > channels, in order to then reserve 2 channels for VFIO VM state, so we
> > get a flow like:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:                                                     | VFIO VM state |
> >      multifd6:                                                     | VFIO VM state |
> > 
> > This is bad, as it reduces performance of RAM transfer. VFIO VM
> > state transfer is better, but that's not a net win overall.
> > 
> > 
> > 
> > So lets say the admin was happy to increase the number of multifd
> > channels from 6 to 8.
> > 
> > This series proposes that they would leave RAM using 6 channels as
> > before, and now reserve the 2 extra ones for VFIO VM state:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd7:                                                     | VFIO VM state |
> >      multifd8:                                                     | VFIO VM state |
> > 
> > 
> > RAM would perform as well as it did historically, and VM state would
> > improve due to the 2 parallel channels, and not competing with the
> > residual RAM transfer.
> > 
> > This is what your latency comparison numbers show as a benefit for
> > this channel reservation design.
> > 
> > I believe this comparison is inappropriate / unfair though, as it is
> > comparing a situation with 6 total channels against a situation with
> > 8 total channels.
> > 
> > If the admin was happy to increase the total channels to 8, then they
> > should allow RAM to use all 8 channels, and then VFIO VM state +
> > residual RAM to also use the very same set of 8 channels:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> > 
> > This will speed up initial RAM iters still further & the final switch
> > over phase even more. If residual RAM is larger than VFIO VM state,
> > then it will dominate the switchover latency, so having VFIO VM state
> > compete is not a problem. If VFIO VM state is larger than residual RAM,
> > then allowing it acces to all 8 channels instead of only 2 channels
> > will be a clear win.
> 
> I re-did the measurement with increased the number of multifd channels,
> first to (total count/dedicated count) 25/0, then to 100/0.
> 
> The results did not improve:
> With 25/0 multifd mixed channels config I still get around 1250 msec
> downtime - the same as with 15/0 or 11/0 mixed configs I measured
> earlier.
> 
> But with the (pretty insane) 100/0 mixed channel config the whole setup
> gets so for into the law of diminishing returns that the results actually
> get worse: the downtime is now about 1450 msec.
> I guess that's from all the extra overhead from switching between 100
> multifd channels.
> 
> I think one of the reasons for these results is that mixed (RAM + device
> state) multifd channels participate in the RAM sync process
> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

Hmm, I wouldn't have expected the sync packets to have a signicant
overhead on the wire. Looking at the code though I guess the issue
is that we're blocking I/O in /all/ threads, until all threads have
seen the sync packet.

eg in multifd_recv_sync_main

    for (i = 0; i < thread_count; i++) {

        qemu_sem_wait(&multifd_recv_state->sem_sync);
    }

    for (i = 0; i < thread_count; i++) {
        qemu_sem_post(&p->sem_sync);
    }

and then in the recv thread is 

    qemu_sem_post(&multifd_recv_state->sem_sync);
    qemu_sem_wait(&p->sem_sync);

so if any 1 of the recv threads is slow to recv the sync packet on
the wire, then its qemu_sem_post is delayed, and all other recv
threads are kept idle until the sync packet arrives.

I'm not sure how much this all matters during the final switchover
phase though. We send syncs at the end of each iteration, and then
after sending the residual RAM. I'm not sure how that orders wrt
sending of the parallel VFIO state

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-19 10:07               ` Daniel P. Berrangé
@ 2024-04-19 15:31                 ` Peter Xu
  2024-04-23 16:15                   ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-04-19 15:31 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > I think one of the reasons for these results is that mixed (RAM + device
> > > state) multifd channels participate in the RAM sync process
> > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > 
> > Firstly, I'm wondering whether we can have better names for these new
> > hooks.  Currently (only comment on the async* stuff):
> > 
> >   - complete_precopy_async
> >   - complete_precopy
> >   - complete_precopy_async_wait
> > 
> > But perhaps better:
> > 
> >   - complete_precopy_begin
> >   - complete_precopy
> >   - complete_precopy_end
> > 
> > ?
> > 
> > As I don't see why the device must do something with async in such hook.
> > To me it's more like you're splitting one process into multiple, then
> > begin/end sounds more generic.
> > 
> > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > into >1 phases too. For example, I would be curious whether the performance
> > will go back to normal if we offloading multifd_send_sync_main() into the
> > complete_precopy_end(), because we really only need one shot of that, and I
> > am quite surprised it already greatly affects VFIO dumping its own things.
> > 
> > I would even ask one step further as what Dan was asking: have you thought
> > about dumping VFIO states via multifd even during iterations?  Would that
> > help even more than this series (which IIUC only helps during the blackout
> > phase)?
> 
> To dump during RAM iteration, the VFIO device will need to have
> dirty tracking and iterate on its state, because the guest CPUs
> will still be running potentially changing VFIO state. That seems
> impractical in the general case.

We already do such interations in vfio_save_iterate()?

My understanding is the recent VFIO work is based on the fact that the VFIO
device can track device state changes more or less (besides being able to
save/load full states).  E.g. I still remember in our QE tests some old
devices report much more dirty pages than expected during the iterations
when we were looking into such issue that a huge amount of dirty pages
reported.  But newer models seem to have fixed that and report much less.

That issue was about GPU not NICs, though, and IIUC a major portion of such
tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
maybe they work differently.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-18 20:02             ` Peter Xu
  2024-04-19 10:07               ` Daniel P. Berrangé
@ 2024-04-23 16:14               ` Maciej S. Szmigiero
  2024-04-23 22:27                 ` Peter Xu
  1 sibling, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-23 16:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 18.04.2024 22:02, Peter Xu wrote:
> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
>>> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
>>>> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
>>>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> (..)
>>>>>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>>>>>> make a whole lot of sense to me either.
>>>>>>>
>>>>>>> Once we've done the RAM transfer, and are in the switchover phase
>>>>>>> doing device state transfer, all the multifd channels are idle.
>>>>>>> We should just use all those channels to transfer the device state,
>>>>>>> in parallel.  Reserving channels just guarantees many idle channels
>>>>>>> during RAM transfer, and further idle channels during vmstate
>>>>>>> transfer.
>>>>>>>
>>>>>>> IMHO it is more flexible to just use all available multifd channel
>>>>>>> resources all the time.
>>>>>>
>>>>>> The reason for having dedicated device state channels is that they
>>>>>> provide lower downtime in my tests.
>>>>>>
>>>>>> With either 15 or 11 mixed multifd channels (no dedicated device state
>>>>>> channels) I get a downtime of about 1250 msec.
>>>>>>
>>>>>> Comparing that with 15 total multifd channels / 4 dedicated device
>>>>>> state channels that give downtime of about 1100 ms it means that using
>>>>>> dedicated channels gets about 14% downtime improvement.
>>>>>
>>>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
>>>>> place ? Is is transferred concurrently with the RAM ? I had thought
>>>>> this series still has the RAM transfer iterations running first,
>>>>> and then the VFIO VMstate at the end, simply making use of multifd
>>>>> channels for parallelism of the end phase. your reply though makes
>>>>> me question my interpretation though.
>>>>>
>>>>> Let me try to illustrate channel flow in various scenarios, time
>>>>> flowing left to right:
>>>>>
>>>>> 1. serialized RAM, then serialized VM state  (ie historical migration)
>>>>>
>>>>>          main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
>>>>>
>>>>>
>>>>> 2. parallel RAM, then serialized VM state (ie today's multifd)
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>
>>>>>
>>>>> 3. parallel RAM, then parallel VM state
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd4:                                                     | VFIO VM state |
>>>>>      multifd5:                                                     | VFIO VM state |
>>>>>
>>>>>
>>>>> 4. parallel RAM and VFIO VM state, then remaining VM state
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd4:        | VFIO VM state                                         |
>>>>>      multifd5:        | VFIO VM state                                         |
>>>>>
>>>>>
>>>>> I thought this series was implementing approx (3), but are you actually
>>>>> implementing (4), or something else entirely ?
>>>>
>>>> You are right that this series operation is approximately implementing
>>>> the schema described as numer 3 in your diagrams.
>>>
>>>> However, there are some additional details worth mentioning:
>>>> * There's some but relatively small amount of VFIO data being
>>>> transferred from the "save_live_iterate" SaveVMHandler while the VM is
>>>> still running.
>>>>
>>>> This is still happening via the main migration channel.
>>>> Parallelizing this transfer in the future might make sense too,
>>>> although obviously this doesn't impact the downtime.
>>>>
>>>> * After the VM is stopped and downtime starts the main (~ 400 MiB)
>>>> VFIO device state gets transferred via multifd channels.
>>>>
>>>> However, these multifd channels (if they are not dedicated to device
>>>> state transfer) aren't idle during that time.
>>>> Rather they seem to be transferring the residual RAM data.
>>>>
>>>> That's most likely what causes the additional observed downtime
>>>> when dedicated device state transfer multifd channels aren't used.
>>>
>>> Ahh yes, I forgot about the residual dirty RAM, that makes sense as
>>> an explanation. Allow me to work through the scenarios though, as I
>>> still think my suggestion to not have separate dedicate channels is
>>> better....
>>>
>>>
>>> Lets say hypothetically we have an existing deployment today that
>>> uses 6 multifd channels for RAM. ie:
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>
>>> That value of 6 was chosen because that corresponds to the amount
>>> of network & CPU utilization the admin wants to allow, for this
>>> VM to migrate. All 6 channels are fully utilized at all times.
>>>
>>>
>>> If we now want to parallelize VFIO VM state, the peak network
>>> and CPU utilization the admin wants to reserve for the VM should
>>> not change. Thus the admin will still wants to configure only 6
>>> channels total.
>>>
>>> With your proposal the admin has to reduce RAM transfer to 4 of the
>>> channels, in order to then reserve 2 channels for VFIO VM state, so we
>>> get a flow like:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:                                                     | VFIO VM state |
>>>       multifd6:                                                     | VFIO VM state |
>>>
>>> This is bad, as it reduces performance of RAM transfer. VFIO VM
>>> state transfer is better, but that's not a net win overall.
>>>
>>>
>>>
>>> So lets say the admin was happy to increase the number of multifd
>>> channels from 6 to 8.
>>>
>>> This series proposes that they would leave RAM using 6 channels as
>>> before, and now reserve the 2 extra ones for VFIO VM state:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd7:                                                     | VFIO VM state |
>>>       multifd8:                                                     | VFIO VM state |
>>>
>>>
>>> RAM would perform as well as it did historically, and VM state would
>>> improve due to the 2 parallel channels, and not competing with the
>>> residual RAM transfer.
>>>
>>> This is what your latency comparison numbers show as a benefit for
>>> this channel reservation design.
>>>
>>> I believe this comparison is inappropriate / unfair though, as it is
>>> comparing a situation with 6 total channels against a situation with
>>> 8 total channels.
>>>
>>> If the admin was happy to increase the total channels to 8, then they
>>> should allow RAM to use all 8 channels, and then VFIO VM state +
>>> residual RAM to also use the very same set of 8 channels:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>
>>> This will speed up initial RAM iters still further & the final switch
>>> over phase even more. If residual RAM is larger than VFIO VM state,
>>> then it will dominate the switchover latency, so having VFIO VM state
>>> compete is not a problem. If VFIO VM state is larger than residual RAM,
>>> then allowing it acces to all 8 channels instead of only 2 channels
>>> will be a clear win.
>>
>> I re-did the measurement with increased the number of multifd channels,
>> first to (total count/dedicated count) 25/0, then to 100/0.
>>
>> The results did not improve:
>> With 25/0 multifd mixed channels config I still get around 1250 msec
>> downtime - the same as with 15/0 or 11/0 mixed configs I measured
>> earlier.
>>
>> But with the (pretty insane) 100/0 mixed channel config the whole setup
>> gets so for into the law of diminishing returns that the results actually
>> get worse: the downtime is now about 1450 msec.
>> I guess that's from all the extra overhead from switching between 100
>> multifd channels.
> 
> 100 threads are probably too much indeed.
> 
> However I agree with Dan's question raised, and I'd like to second that.
> It so far looks better if the multifd channels can be managed just like a
> pool of workers without assignments to specific jobs.  It looks like this
> series is already getting there, it's a pity we lose that genericity only
> because some other side effects on the ram sync semantics.

We don't lose any genericity since by default the transfer is done via
mixed RAM / device state multifd channels from a shared pool.

It's only when x-multifd-channels-device-state is set to value > 0 then
the requested multifd channel counts gets dedicated to device state.

It could be seen as a fine-tuning option for cases where tests show that
it provides some benefits to the particular workload - just like many
other existing migration options are.

14% downtime improvement is too much to waste - I'm not sure that's only
due to avoiding RAM syncs, it's possible that there are other subtle
performance interactions too.

For even more genericity this option could be named like
x-multifd-channels-map and contain an array of channel settings like
"ram,ram,ram,device-state,device-state".
Then a possible future other uses of multifd channels wouldn't even need
a new dedicated option.

>>
>> I think one of the reasons for these results is that mixed (RAM + device
>> state) multifd channels participate in the RAM sync process
>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> 
> Firstly, I'm wondering whether we can have better names for these new
> hooks.  Currently (only comment on the async* stuff):
> 
>    - complete_precopy_async
>    - complete_precopy
>    - complete_precopy_async_wait
> 
> But perhaps better:
> 
>    - complete_precopy_begin
>    - complete_precopy
>    - complete_precopy_end
> 
> ?
> 
> As I don't see why the device must do something with async in such hook.
> To me it's more like you're splitting one process into multiple, then
> begin/end sounds more generic.

Ack, I will rename these hooks to begin/end.

> Then, if with that in mind, IIUC we can already split ram_save_complete()
> into >1 phases too. For example, I would be curious whether the performance
> will go back to normal if we offloading multifd_send_sync_main() into the
> complete_precopy_end(), because we really only need one shot of that, and I
> am quite surprised it already greatly affects VFIO dumping its own things.

AFAIK there's already just one multifd_send_sync_main() during downtime -
the one called from save_live_complete_precopy SaveVMHandler.

In order to truly never interfere with device state transfer the sync would
need to be ordered after the device state transfer is complete - that is,
after VFIO complete_precopy_end (complete_precopy_async_wait) handler
returns.

> I would even ask one step further as what Dan was asking: have you thought
> about dumping VFIO states via multifd even during iterations?  Would that
> help even more than this series (which IIUC only helps during the blackout
> phase)?
> 
> It could mean that the "async*" hooks can be done differently, and I'm not
> sure whether they're needed at all, e.g. when threads are created during
> save_setup but cleaned up in save_cleanup.

Responded to this thread in another e-mail message.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-19 15:31                 ` Peter Xu
@ 2024-04-23 16:15                   ` Maciej S. Szmigiero
  2024-04-23 22:20                     ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-23 16:15 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

On 19.04.2024 17:31, Peter Xu wrote:
> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>> state) multifd channels participate in the RAM sync process
>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>
>>> Firstly, I'm wondering whether we can have better names for these new
>>> hooks.  Currently (only comment on the async* stuff):
>>>
>>>    - complete_precopy_async
>>>    - complete_precopy
>>>    - complete_precopy_async_wait
>>>
>>> But perhaps better:
>>>
>>>    - complete_precopy_begin
>>>    - complete_precopy
>>>    - complete_precopy_end
>>>
>>> ?
>>>
>>> As I don't see why the device must do something with async in such hook.
>>> To me it's more like you're splitting one process into multiple, then
>>> begin/end sounds more generic.
>>>
>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>> into >1 phases too. For example, I would be curious whether the performance
>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>> complete_precopy_end(), because we really only need one shot of that, and I
>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>
>>> I would even ask one step further as what Dan was asking: have you thought
>>> about dumping VFIO states via multifd even during iterations?  Would that
>>> help even more than this series (which IIUC only helps during the blackout
>>> phase)?
>>
>> To dump during RAM iteration, the VFIO device will need to have
>> dirty tracking and iterate on its state, because the guest CPUs
>> will still be running potentially changing VFIO state. That seems
>> impractical in the general case.
> 
> We already do such interations in vfio_save_iterate()?
> 
> My understanding is the recent VFIO work is based on the fact that the VFIO
> device can track device state changes more or less (besides being able to
> save/load full states).  E.g. I still remember in our QE tests some old
> devices report much more dirty pages than expected during the iterations
> when we were looking into such issue that a huge amount of dirty pages
> reported.  But newer models seem to have fixed that and report much less.
> 
> That issue was about GPU not NICs, though, and IIUC a major portion of such
> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> maybe they work differently.

The device which this series was developed against (Mellanox ConnectX-7)
is already transferring its live state before the VM gets stopped (via
save_live_iterate SaveVMHandler).

It's just that in addition to the live state it has more than 400 MiB
of state that cannot be transferred while the VM is still running.
And that fact hurts a lot with respect to the migration downtime.

AFAIK it's a very similar story for (some) GPUs.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-23 16:15                   ` Maciej S. Szmigiero
@ 2024-04-23 22:20                     ` Peter Xu
  2024-04-23 22:25                       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-04-23 22:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> On 19.04.2024 17:31, Peter Xu wrote:
> > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > state) multifd channels participate in the RAM sync process
> > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > 
> > > > Firstly, I'm wondering whether we can have better names for these new
> > > > hooks.  Currently (only comment on the async* stuff):
> > > > 
> > > >    - complete_precopy_async
> > > >    - complete_precopy
> > > >    - complete_precopy_async_wait
> > > > 
> > > > But perhaps better:
> > > > 
> > > >    - complete_precopy_begin
> > > >    - complete_precopy
> > > >    - complete_precopy_end
> > > > 
> > > > ?
> > > > 
> > > > As I don't see why the device must do something with async in such hook.
> > > > To me it's more like you're splitting one process into multiple, then
> > > > begin/end sounds more generic.
> > > > 
> > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > into >1 phases too. For example, I would be curious whether the performance
> > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > 
> > > > I would even ask one step further as what Dan was asking: have you thought
> > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > help even more than this series (which IIUC only helps during the blackout
> > > > phase)?
> > > 
> > > To dump during RAM iteration, the VFIO device will need to have
> > > dirty tracking and iterate on its state, because the guest CPUs
> > > will still be running potentially changing VFIO state. That seems
> > > impractical in the general case.
> > 
> > We already do such interations in vfio_save_iterate()?
> > 
> > My understanding is the recent VFIO work is based on the fact that the VFIO
> > device can track device state changes more or less (besides being able to
> > save/load full states).  E.g. I still remember in our QE tests some old
> > devices report much more dirty pages than expected during the iterations
> > when we were looking into such issue that a huge amount of dirty pages
> > reported.  But newer models seem to have fixed that and report much less.
> > 
> > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > maybe they work differently.
> 
> The device which this series was developed against (Mellanox ConnectX-7)
> is already transferring its live state before the VM gets stopped (via
> save_live_iterate SaveVMHandler).
> 
> It's just that in addition to the live state it has more than 400 MiB
> of state that cannot be transferred while the VM is still running.
> And that fact hurts a lot with respect to the migration downtime.
> 
> AFAIK it's a very similar story for (some) GPUs.

So during iteration phase VFIO cannot yet leverage the multifd channels
when with this series, am I right?

Is it possible to extend that use case too?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-23 22:20                     ` Peter Xu
@ 2024-04-23 22:25                       ` Maciej S. Szmigiero
  2024-04-23 22:35                         ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-23 22:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 24.04.2024 00:20, Peter Xu wrote:
> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>> On 19.04.2024 17:31, Peter Xu wrote:
>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>> state) multifd channels participate in the RAM sync process
>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>
>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>
>>>>>     - complete_precopy_async
>>>>>     - complete_precopy
>>>>>     - complete_precopy_async_wait
>>>>>
>>>>> But perhaps better:
>>>>>
>>>>>     - complete_precopy_begin
>>>>>     - complete_precopy
>>>>>     - complete_precopy_end
>>>>>
>>>>> ?
>>>>>
>>>>> As I don't see why the device must do something with async in such hook.
>>>>> To me it's more like you're splitting one process into multiple, then
>>>>> begin/end sounds more generic.
>>>>>
>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>
>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>> phase)?
>>>>
>>>> To dump during RAM iteration, the VFIO device will need to have
>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>> will still be running potentially changing VFIO state. That seems
>>>> impractical in the general case.
>>>
>>> We already do such interations in vfio_save_iterate()?
>>>
>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>> device can track device state changes more or less (besides being able to
>>> save/load full states).  E.g. I still remember in our QE tests some old
>>> devices report much more dirty pages than expected during the iterations
>>> when we were looking into such issue that a huge amount of dirty pages
>>> reported.  But newer models seem to have fixed that and report much less.
>>>
>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>> maybe they work differently.
>>
>> The device which this series was developed against (Mellanox ConnectX-7)
>> is already transferring its live state before the VM gets stopped (via
>> save_live_iterate SaveVMHandler).
>>
>> It's just that in addition to the live state it has more than 400 MiB
>> of state that cannot be transferred while the VM is still running.
>> And that fact hurts a lot with respect to the migration downtime.
>>
>> AFAIK it's a very similar story for (some) GPUs.
> 
> So during iteration phase VFIO cannot yet leverage the multifd channels
> when with this series, am I right?

That's right.

> Is it possible to extend that use case too?

I guess so, but since this phase (iteration while the VM is still
running) doesn't impact downtime it is much less critical.
  
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-23 16:14               ` Maciej S. Szmigiero
@ 2024-04-23 22:27                 ` Peter Xu
  2024-04-26 17:35                   ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-04-23 22:27 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:
> We don't lose any genericity since by default the transfer is done via
> mixed RAM / device state multifd channels from a shared pool.
> 
> It's only when x-multifd-channels-device-state is set to value > 0 then
> the requested multifd channel counts gets dedicated to device state.
> 
> It could be seen as a fine-tuning option for cases where tests show that
> it provides some benefits to the particular workload - just like many
> other existing migration options are.
> 
> 14% downtime improvement is too much to waste - I'm not sure that's only
> due to avoiding RAM syncs, it's possible that there are other subtle
> performance interactions too.
> 
> For even more genericity this option could be named like
> x-multifd-channels-map and contain an array of channel settings like
> "ram,ram,ram,device-state,device-state".
> Then a possible future other uses of multifd channels wouldn't even need
> a new dedicated option.

Yeah I understand such option would only provide more options.

However as long as such option got introduced, user will start to do their
own "optimizations" on how to provision the multifd channels, and IMHO
it'll be great if we as developer can be crystal clear on why it needs to
be introduced in the first place, rather than making all channels open to
all purposes.

So I don't think I'm strongly against such parameter, but I want to double
check we really understand what's behind this to justify such parameter.
Meanwhile I'd be always be pretty caucious on introducing any migration
parameters, due to the compatibility nightmares.  The less parameter the
better..

> 
> > > 
> > > I think one of the reasons for these results is that mixed (RAM + device
> > > state) multifd channels participate in the RAM sync process
> > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > 
> > Firstly, I'm wondering whether we can have better names for these new
> > hooks.  Currently (only comment on the async* stuff):
> > 
> >    - complete_precopy_async
> >    - complete_precopy
> >    - complete_precopy_async_wait
> > 
> > But perhaps better:
> > 
> >    - complete_precopy_begin
> >    - complete_precopy
> >    - complete_precopy_end
> > 
> > ?
> > 
> > As I don't see why the device must do something with async in such hook.
> > To me it's more like you're splitting one process into multiple, then
> > begin/end sounds more generic.
> 
> Ack, I will rename these hooks to begin/end.
> 
> > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > into >1 phases too. For example, I would be curious whether the performance
> > will go back to normal if we offloading multifd_send_sync_main() into the
> > complete_precopy_end(), because we really only need one shot of that, and I
> > am quite surprised it already greatly affects VFIO dumping its own things.
> 
> AFAIK there's already just one multifd_send_sync_main() during downtime -
> the one called from save_live_complete_precopy SaveVMHandler.
> 
> In order to truly never interfere with device state transfer the sync would
> need to be ordered after the device state transfer is complete - that is,
> after VFIO complete_precopy_end (complete_precopy_async_wait) handler
> returns.

Do you think it'll be worthwhile give it a shot, even if we can't decide
yet on the order of end()s to be called?

It'll be great if we could look into these issues instead of workarounds,
and figure out what affected the performance behind, and also whether that
can be fixed without such parameter.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-23 22:25                       ` Maciej S. Szmigiero
@ 2024-04-23 22:35                         ` Peter Xu
  2024-04-26 17:34                           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-04-23 22:35 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> On 24.04.2024 00:20, Peter Xu wrote:
> > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > > > state) multifd channels participate in the RAM sync process
> > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > > > 
> > > > > > Firstly, I'm wondering whether we can have better names for these new
> > > > > > hooks.  Currently (only comment on the async* stuff):
> > > > > > 
> > > > > >     - complete_precopy_async
> > > > > >     - complete_precopy
> > > > > >     - complete_precopy_async_wait
> > > > > > 
> > > > > > But perhaps better:
> > > > > > 
> > > > > >     - complete_precopy_begin
> > > > > >     - complete_precopy
> > > > > >     - complete_precopy_end
> > > > > > 
> > > > > > ?
> > > > > > 
> > > > > > As I don't see why the device must do something with async in such hook.
> > > > > > To me it's more like you're splitting one process into multiple, then
> > > > > > begin/end sounds more generic.
> > > > > > 
> > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > > > into >1 phases too. For example, I would be curious whether the performance
> > > > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > > > 
> > > > > > I would even ask one step further as what Dan was asking: have you thought
> > > > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > > > help even more than this series (which IIUC only helps during the blackout
> > > > > > phase)?
> > > > > 
> > > > > To dump during RAM iteration, the VFIO device will need to have
> > > > > dirty tracking and iterate on its state, because the guest CPUs
> > > > > will still be running potentially changing VFIO state. That seems
> > > > > impractical in the general case.
> > > > 
> > > > We already do such interations in vfio_save_iterate()?
> > > > 
> > > > My understanding is the recent VFIO work is based on the fact that the VFIO
> > > > device can track device state changes more or less (besides being able to
> > > > save/load full states).  E.g. I still remember in our QE tests some old
> > > > devices report much more dirty pages than expected during the iterations
> > > > when we were looking into such issue that a huge amount of dirty pages
> > > > reported.  But newer models seem to have fixed that and report much less.
> > > > 
> > > > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > > > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > > > maybe they work differently.
> > > 
> > > The device which this series was developed against (Mellanox ConnectX-7)
> > > is already transferring its live state before the VM gets stopped (via
> > > save_live_iterate SaveVMHandler).
> > > 
> > > It's just that in addition to the live state it has more than 400 MiB
> > > of state that cannot be transferred while the VM is still running.
> > > And that fact hurts a lot with respect to the migration downtime.
> > > 
> > > AFAIK it's a very similar story for (some) GPUs.
> > 
> > So during iteration phase VFIO cannot yet leverage the multifd channels
> > when with this series, am I right?
> 
> That's right.
> 
> > Is it possible to extend that use case too?
> 
> I guess so, but since this phase (iteration while the VM is still
> running) doesn't impact downtime it is much less critical.

But it affects the bandwidth, e.g. even with multifd enabled, the device
iteration data will still bottleneck at ~15Gbps on a common system setup
the best case, even if the hosts are 100Gbps direct connected.  Would that
be a concern in the future too, or it's known problem and it won't be fixed
anyway?

I remember Avihai used to have plan to look into similar issues, I hope
this is exactly what he is looking for.  Otherwise changing migration
protocol from time to time is cumbersome; we always need to provide a flag
to make sure old systems migrates in the old ways, new systems run the new
ways, and for such a relatively major change I'd want to double check on
how far away we can support offload VFIO iterations data to multifd.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-23 22:35                         ` Peter Xu
@ 2024-04-26 17:34                           ` Maciej S. Szmigiero
  2024-04-29 15:09                             ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-26 17:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 24.04.2024 00:35, Peter Xu wrote:
> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
>> On 24.04.2024 00:20, Peter Xu wrote:
>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>>>> On 19.04.2024 17:31, Peter Xu wrote:
>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>>>> state) multifd channels participate in the RAM sync process
>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>>>
>>>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>>>
>>>>>>>      - complete_precopy_async
>>>>>>>      - complete_precopy
>>>>>>>      - complete_precopy_async_wait
>>>>>>>
>>>>>>> But perhaps better:
>>>>>>>
>>>>>>>      - complete_precopy_begin
>>>>>>>      - complete_precopy
>>>>>>>      - complete_precopy_end
>>>>>>>
>>>>>>> ?
>>>>>>>
>>>>>>> As I don't see why the device must do something with async in such hook.
>>>>>>> To me it's more like you're splitting one process into multiple, then
>>>>>>> begin/end sounds more generic.
>>>>>>>
>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>>>
>>>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>>>> phase)?
>>>>>>
>>>>>> To dump during RAM iteration, the VFIO device will need to have
>>>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>>>> will still be running potentially changing VFIO state. That seems
>>>>>> impractical in the general case.
>>>>>
>>>>> We already do such interations in vfio_save_iterate()?
>>>>>
>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>>>> device can track device state changes more or less (besides being able to
>>>>> save/load full states).  E.g. I still remember in our QE tests some old
>>>>> devices report much more dirty pages than expected during the iterations
>>>>> when we were looking into such issue that a huge amount of dirty pages
>>>>> reported.  But newer models seem to have fixed that and report much less.
>>>>>
>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>>>> maybe they work differently.
>>>>
>>>> The device which this series was developed against (Mellanox ConnectX-7)
>>>> is already transferring its live state before the VM gets stopped (via
>>>> save_live_iterate SaveVMHandler).
>>>>
>>>> It's just that in addition to the live state it has more than 400 MiB
>>>> of state that cannot be transferred while the VM is still running.
>>>> And that fact hurts a lot with respect to the migration downtime.
>>>>
>>>> AFAIK it's a very similar story for (some) GPUs.
>>>
>>> So during iteration phase VFIO cannot yet leverage the multifd channels
>>> when with this series, am I right?
>>
>> That's right.
>>
>>> Is it possible to extend that use case too?
>>
>> I guess so, but since this phase (iteration while the VM is still
>> running) doesn't impact downtime it is much less critical.
> 
> But it affects the bandwidth, e.g. even with multifd enabled, the device
> iteration data will still bottleneck at ~15Gbps on a common system setup
> the best case, even if the hosts are 100Gbps direct connected.  Would that
> be a concern in the future too, or it's known problem and it won't be fixed
> anyway?

I think any improvements to the migration performance are good, even if
they don't impact downtime.

It's just that this patch set focuses on the downtime phase as the more
critical thing.

After this gets improved there's no reason why not to look at improving
performance of the VM live phase too if it brings sensible improvements.

> I remember Avihai used to have plan to look into similar issues, I hope
> this is exactly what he is looking for.  Otherwise changing migration
> protocol from time to time is cumbersome; we always need to provide a flag
> to make sure old systems migrates in the old ways, new systems run the new
> ways, and for such a relatively major change I'd want to double check on
> how far away we can support offload VFIO iterations data to multifd.

The device state transfer is indicated by a new flag in the multifd
header (MULTIFD_FLAG_DEVICE_STATE).

If we are to use multifd channels for VM live phase transfers these
could simply re-use the same flag type.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-23 22:27                 ` Peter Xu
@ 2024-04-26 17:35                   ` Maciej S. Szmigiero
  2024-04-29 20:34                     ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-04-26 17:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 24.04.2024 00:27, Peter Xu wrote:
> On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:
>> We don't lose any genericity since by default the transfer is done via
>> mixed RAM / device state multifd channels from a shared pool.
>>
>> It's only when x-multifd-channels-device-state is set to value > 0 then
>> the requested multifd channel counts gets dedicated to device state.
>>
>> It could be seen as a fine-tuning option for cases where tests show that
>> it provides some benefits to the particular workload - just like many
>> other existing migration options are.
>>
>> 14% downtime improvement is too much to waste - I'm not sure that's only
>> due to avoiding RAM syncs, it's possible that there are other subtle
>> performance interactions too.
>>
>> For even more genericity this option could be named like
>> x-multifd-channels-map and contain an array of channel settings like
>> "ram,ram,ram,device-state,device-state".
>> Then a possible future other uses of multifd channels wouldn't even need
>> a new dedicated option.
> 
> Yeah I understand such option would only provide more options.
> 
> However as long as such option got introduced, user will start to do their
> own "optimizations" on how to provision the multifd channels, and IMHO
> it'll be great if we as developer can be crystal clear on why it needs to
> be introduced in the first place, rather than making all channels open to
> all purposes.
> 
> So I don't think I'm strongly against such parameter, but I want to double
> check we really understand what's behind this to justify such parameter.
> Meanwhile I'd be always be pretty caucious on introducing any migration
> parameters, due to the compatibility nightmares.  The less parameter the
> better..

Ack, I am also curious why dedicated device state multifd channels bring
such downtime improvement.

>>
>>>>
>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>> state) multifd channels participate in the RAM sync process
>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>
>>> Firstly, I'm wondering whether we can have better names for these new
>>> hooks.  Currently (only comment on the async* stuff):
>>>
>>>     - complete_precopy_async
>>>     - complete_precopy
>>>     - complete_precopy_async_wait
>>>
>>> But perhaps better:
>>>
>>>     - complete_precopy_begin
>>>     - complete_precopy
>>>     - complete_precopy_end
>>>
>>> ?
>>>
>>> As I don't see why the device must do something with async in such hook.
>>> To me it's more like you're splitting one process into multiple, then
>>> begin/end sounds more generic.
>>
>> Ack, I will rename these hooks to begin/end.
>>
>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>> into >1 phases too. For example, I would be curious whether the performance
>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>> complete_precopy_end(), because we really only need one shot of that, and I
>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>
>> AFAIK there's already just one multifd_send_sync_main() during downtime -
>> the one called from save_live_complete_precopy SaveVMHandler.
>>
>> In order to truly never interfere with device state transfer the sync would
>> need to be ordered after the device state transfer is complete - that is,
>> after VFIO complete_precopy_end (complete_precopy_async_wait) handler
>> returns.
> 
> Do you think it'll be worthwhile give it a shot, even if we can't decide
> yet on the order of end()s to be called?

Upon a closer inspection it looks like that there are, in fact, *two*
RAM syncs done during the downtime - besides the one at the end of
ram_save_complete() there's another on in find_dirty_block(). This function
is called earlier from ram_save_complete() -> ram_find_and_save_block().

Unfortunately, skipping that intermediate sync in find_dirty_block() and
moving the one from the end of ram_save_complete() to another SaveVMHandler
that's called only after VFIO device state transfer doesn't actually
improve downtime (at least not on its own).

> It'll be great if we could look into these issues instead of workarounds,
> and figure out what affected the performance behind, and also whether that
> can be fixed without such parameter.

I've been looking at this and added some measurements around device state
queuing for submission in multifd_queue_device_state().

To my surprise, the mixed RAM / device state config of 15/0 has *much*
lower total queuing time of around 100 msec compared to the dedicated
device state channels 15/4 config with total queuing time of around
300 msec.

Despite that, the 15/4 config still has significantly lower overall
downtime.

This means that any reason for the downtime difference is probably on
the receive / load side of the migration rather than on the save /
send side.

I guess the reason for the lower device state queuing time in the 15/0
case is that this data could be sent via any of the 15 multifd channels
rather than just the 4 dedicated ones in the 15/4 case.

Nevertheless, I will continue to look at this problem to at least find
some explanation for the difference in downtime that dedicated device
state multifd channels bring.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-26 17:34                           ` Maciej S. Szmigiero
@ 2024-04-29 15:09                             ` Peter Xu
  2024-05-06 16:26                               ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-04-29 15:09 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
> On 24.04.2024 00:35, Peter Xu wrote:
> > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> > > On 24.04.2024 00:20, Peter Xu wrote:
> > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > > > > > state) multifd channels participate in the RAM sync process
> > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > > > > > 
> > > > > > > > Firstly, I'm wondering whether we can have better names for these new
> > > > > > > > hooks.  Currently (only comment on the async* stuff):
> > > > > > > > 
> > > > > > > >      - complete_precopy_async
> > > > > > > >      - complete_precopy
> > > > > > > >      - complete_precopy_async_wait
> > > > > > > > 
> > > > > > > > But perhaps better:
> > > > > > > > 
> > > > > > > >      - complete_precopy_begin
> > > > > > > >      - complete_precopy
> > > > > > > >      - complete_precopy_end
> > > > > > > > 
> > > > > > > > ?
> > > > > > > > 
> > > > > > > > As I don't see why the device must do something with async in such hook.
> > > > > > > > To me it's more like you're splitting one process into multiple, then
> > > > > > > > begin/end sounds more generic.
> > > > > > > > 
> > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > > > > > into >1 phases too. For example, I would be curious whether the performance
> > > > > > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > > > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > > > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > > > > > 
> > > > > > > > I would even ask one step further as what Dan was asking: have you thought
> > > > > > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > > > > > help even more than this series (which IIUC only helps during the blackout
> > > > > > > > phase)?
> > > > > > > 
> > > > > > > To dump during RAM iteration, the VFIO device will need to have
> > > > > > > dirty tracking and iterate on its state, because the guest CPUs
> > > > > > > will still be running potentially changing VFIO state. That seems
> > > > > > > impractical in the general case.
> > > > > > 
> > > > > > We already do such interations in vfio_save_iterate()?
> > > > > > 
> > > > > > My understanding is the recent VFIO work is based on the fact that the VFIO
> > > > > > device can track device state changes more or less (besides being able to
> > > > > > save/load full states).  E.g. I still remember in our QE tests some old
> > > > > > devices report much more dirty pages than expected during the iterations
> > > > > > when we were looking into such issue that a huge amount of dirty pages
> > > > > > reported.  But newer models seem to have fixed that and report much less.
> > > > > > 
> > > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > > > > > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > > > > > maybe they work differently.
> > > > > 
> > > > > The device which this series was developed against (Mellanox ConnectX-7)
> > > > > is already transferring its live state before the VM gets stopped (via
> > > > > save_live_iterate SaveVMHandler).
> > > > > 
> > > > > It's just that in addition to the live state it has more than 400 MiB
> > > > > of state that cannot be transferred while the VM is still running.
> > > > > And that fact hurts a lot with respect to the migration downtime.
> > > > > 
> > > > > AFAIK it's a very similar story for (some) GPUs.
> > > > 
> > > > So during iteration phase VFIO cannot yet leverage the multifd channels
> > > > when with this series, am I right?
> > > 
> > > That's right.
> > > 
> > > > Is it possible to extend that use case too?
> > > 
> > > I guess so, but since this phase (iteration while the VM is still
> > > running) doesn't impact downtime it is much less critical.
> > 
> > But it affects the bandwidth, e.g. even with multifd enabled, the device
> > iteration data will still bottleneck at ~15Gbps on a common system setup
> > the best case, even if the hosts are 100Gbps direct connected.  Would that
> > be a concern in the future too, or it's known problem and it won't be fixed
> > anyway?
> 
> I think any improvements to the migration performance are good, even if
> they don't impact downtime.
> 
> It's just that this patch set focuses on the downtime phase as the more
> critical thing.
> 
> After this gets improved there's no reason why not to look at improving
> performance of the VM live phase too if it brings sensible improvements.
> 
> > I remember Avihai used to have plan to look into similar issues, I hope
> > this is exactly what he is looking for.  Otherwise changing migration
> > protocol from time to time is cumbersome; we always need to provide a flag
> > to make sure old systems migrates in the old ways, new systems run the new
> > ways, and for such a relatively major change I'd want to double check on
> > how far away we can support offload VFIO iterations data to multifd.
> 
> The device state transfer is indicated by a new flag in the multifd
> header (MULTIFD_FLAG_DEVICE_STATE).
> 
> If we are to use multifd channels for VM live phase transfers these
> could simply re-use the same flag type.

Right, and that's also my major purpose of such request to consider both
issues.

If supporting iterators can be easy on top of this, I am thinking whether
we should do this in one shot.  The problem is even if the flag type can be
reused, old/new qemu binaries may not be compatible and may not migrate
well when:

  - The old qemu only supports the downtime optimizations
  - The new qemu supports both downtime + iteration optimizations

IIUC, at least the device threads are currently created only at the end of
migration when switching over for the downtime-only optimization (aka, this
series).  Then it means it won't be compatible with a new QEMU as the
threads there will need to be created before iteration starts to take
iteration data.  So I believe we'll need yet another flag to tune the
behavior of such, one for each optimizations (downtime v.s. data during
iterations).  If they work mostly similarly, I want to avoid two flags.
It'll be chaos for user to see such similar flags and they'll be pretty
confusing.

If possible, I wish we can spend some time looking into that if they're so
close, and if it's low hanging fruit when on top of this series, maybe we
can consider doing that in one shot.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side
  2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2024-04-29 20:04   ` Peter Xu
  2024-05-06 16:25     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-04-29 20:04 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

On Tue, Apr 16, 2024 at 04:43:02PM +0200, Maciej S. Szmigiero wrote:
> +bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> +{
> +    g_autoptr(GMutexLocker) locker = NULL;
> +
> +    /*
> +     * Device state submissions for shared channels can come
> +     * from multiple threads and conflict with page submissions
> +     * with respect to multifd_send_state access.
> +     */
> +    if (!multifd_send_state->device_state_dedicated_channels) {
> +        locker = g_mutex_locker_new(&multifd_send_state->queue_job_mutex);

Haven't read the rest, but suggest to stick with QemuMutex for the whole
patchset, as that's what we use in the rest migration code, along with
QEMU_LOCK_GUARD().

> +    }
> +
> +    return multifd_queue_page_locked(block, offset);
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-26 17:35                   ` Maciej S. Szmigiero
@ 2024-04-29 20:34                     ` Peter Xu
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Xu @ 2024-04-29 20:34 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Fri, Apr 26, 2024 at 07:35:36PM +0200, Maciej S. Szmigiero wrote:
> On 24.04.2024 00:27, Peter Xu wrote:
> > On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:
> > > We don't lose any genericity since by default the transfer is done via
> > > mixed RAM / device state multifd channels from a shared pool.
> > > 
> > > It's only when x-multifd-channels-device-state is set to value > 0 then
> > > the requested multifd channel counts gets dedicated to device state.
> > > 
> > > It could be seen as a fine-tuning option for cases where tests show that
> > > it provides some benefits to the particular workload - just like many
> > > other existing migration options are.
> > > 
> > > 14% downtime improvement is too much to waste - I'm not sure that's only
> > > due to avoiding RAM syncs, it's possible that there are other subtle
> > > performance interactions too.
> > > 
> > > For even more genericity this option could be named like
> > > x-multifd-channels-map and contain an array of channel settings like
> > > "ram,ram,ram,device-state,device-state".
> > > Then a possible future other uses of multifd channels wouldn't even need
> > > a new dedicated option.
> > 
> > Yeah I understand such option would only provide more options.
> > 
> > However as long as such option got introduced, user will start to do their
> > own "optimizations" on how to provision the multifd channels, and IMHO
> > it'll be great if we as developer can be crystal clear on why it needs to
> > be introduced in the first place, rather than making all channels open to
> > all purposes.
> > 
> > So I don't think I'm strongly against such parameter, but I want to double
> > check we really understand what's behind this to justify such parameter.
> > Meanwhile I'd be always be pretty caucious on introducing any migration
> > parameters, due to the compatibility nightmares.  The less parameter the
> > better..
> 
> Ack, I am also curious why dedicated device state multifd channels bring
> such downtime improvement.
> 
> > > 
> > > > > 
> > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > state) multifd channels participate in the RAM sync process
> > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > 
> > > > Firstly, I'm wondering whether we can have better names for these new
> > > > hooks.  Currently (only comment on the async* stuff):
> > > > 
> > > >     - complete_precopy_async
> > > >     - complete_precopy
> > > >     - complete_precopy_async_wait
> > > > 
> > > > But perhaps better:
> > > > 
> > > >     - complete_precopy_begin
> > > >     - complete_precopy
> > > >     - complete_precopy_end
> > > > 
> > > > ?
> > > > 
> > > > As I don't see why the device must do something with async in such hook.
> > > > To me it's more like you're splitting one process into multiple, then
> > > > begin/end sounds more generic.
> > > 
> > > Ack, I will rename these hooks to begin/end.
> > > 
> > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > into >1 phases too. For example, I would be curious whether the performance
> > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > 
> > > AFAIK there's already just one multifd_send_sync_main() during downtime -
> > > the one called from save_live_complete_precopy SaveVMHandler.
> > > 
> > > In order to truly never interfere with device state transfer the sync would
> > > need to be ordered after the device state transfer is complete - that is,
> > > after VFIO complete_precopy_end (complete_precopy_async_wait) handler
> > > returns.
> > 
> > Do you think it'll be worthwhile give it a shot, even if we can't decide
> > yet on the order of end()s to be called?
> 
> Upon a closer inspection it looks like that there are, in fact, *two*
> RAM syncs done during the downtime - besides the one at the end of
> ram_save_complete() there's another on in find_dirty_block(). This function
> is called earlier from ram_save_complete() -> ram_find_and_save_block().

Fabiano and I used to discuss this when he's working on the mapped-ram
feature, and afaiu the flush in complete() is not needed when the other one
existed.

I tried to remove it and at least the qtests run all well:

@@ -3415,10 +3415,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    if (migrate_multifd() && !migrate_multifd_flush_after_each_section() &&
-        !migrate_mapped_ram()) {
-        qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
-    }
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     return qemu_fflush(f);
 }

> 
> Unfortunately, skipping that intermediate sync in find_dirty_block() and
> moving the one from the end of ram_save_complete() to another SaveVMHandler
> that's called only after VFIO device state transfer doesn't actually
> improve downtime (at least not on its own).
> 
> > It'll be great if we could look into these issues instead of workarounds,
> > and figure out what affected the performance behind, and also whether that
> > can be fixed without such parameter.
> 
> I've been looking at this and added some measurements around device state
> queuing for submission in multifd_queue_device_state().
> 
> To my surprise, the mixed RAM / device state config of 15/0 has *much*
> lower total queuing time of around 100 msec compared to the dedicated
> device state channels 15/4 config with total queuing time of around
> 300 msec.

Did it account device only, or device+RAM?

I'd expect RAM enqueue time grows in 15/0 due to the sharing with device
threads.

However even if so it may not be that fair a comparison, as the cpu
resources aren't equal.  It's fairer if we compare 15/0 (mixed) v.s. 10/5
(dedicated), for example.

> 
> Despite that, the 15/4 config still has significantly lower overall
> downtime.
> 
> This means that any reason for the downtime difference is probably on
> the receive / load side of the migration rather than on the save /
> send side.
> 
> I guess the reason for the lower device state queuing time in the 15/0
> case is that this data could be sent via any of the 15 multifd channels
> rather than just the 4 dedicated ones in the 15/4 case.

Agree.

> 
> Nevertheless, I will continue to look at this problem to at least find
> some explanation for the difference in downtime that dedicated device
> state multifd channels bring.

Thanks for looking at this.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side
  2024-04-29 20:04   ` Peter Xu
@ 2024-05-06 16:25     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-05-06 16:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

On 29.04.2024 22:04, Peter Xu wrote:
> On Tue, Apr 16, 2024 at 04:43:02PM +0200, Maciej S. Szmigiero wrote:
>> +bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>> +{
>> +    g_autoptr(GMutexLocker) locker = NULL;
>> +
>> +    /*
>> +     * Device state submissions for shared channels can come
>> +     * from multiple threads and conflict with page submissions
>> +     * with respect to multifd_send_state access.
>> +     */
>> +    if (!multifd_send_state->device_state_dedicated_channels) {
>> +        locker = g_mutex_locker_new(&multifd_send_state->queue_job_mutex);
> 
> Haven't read the rest, but suggest to stick with QemuMutex for the whole
> patchset, as that's what we use in the rest migration code, along with
> QEMU_LOCK_GUARD().
> 

Ack, from a quick scan of QEMU thread sync primitives it seems that
QemuMutex with QemuLockable and QemuCond should fulfill the
requirements to replace GMutex, GMutexLocker and GCond.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-04-29 15:09                             ` Peter Xu
@ 2024-05-06 16:26                               ` Maciej S. Szmigiero
  2024-05-06 17:56                                 ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Maciej S. Szmigiero @ 2024-05-06 16:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 29.04.2024 17:09, Peter Xu wrote:
> On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
>> On 24.04.2024 00:35, Peter Xu wrote:
>>> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
>>>> On 24.04.2024 00:20, Peter Xu wrote:
>>>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 19.04.2024 17:31, Peter Xu wrote:
>>>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>>>>>> state) multifd channels participate in the RAM sync process
>>>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>>>>>
>>>>>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>>>>>
>>>>>>>>>       - complete_precopy_async
>>>>>>>>>       - complete_precopy
>>>>>>>>>       - complete_precopy_async_wait
>>>>>>>>>
>>>>>>>>> But perhaps better:
>>>>>>>>>
>>>>>>>>>       - complete_precopy_begin
>>>>>>>>>       - complete_precopy
>>>>>>>>>       - complete_precopy_end
>>>>>>>>>
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>> As I don't see why the device must do something with async in such hook.
>>>>>>>>> To me it's more like you're splitting one process into multiple, then
>>>>>>>>> begin/end sounds more generic.
>>>>>>>>>
>>>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>>>>>
>>>>>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>>>>>> phase)?
>>>>>>>>
>>>>>>>> To dump during RAM iteration, the VFIO device will need to have
>>>>>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>>>>>> will still be running potentially changing VFIO state. That seems
>>>>>>>> impractical in the general case.
>>>>>>>
>>>>>>> We already do such interations in vfio_save_iterate()?
>>>>>>>
>>>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>>>>>> device can track device state changes more or less (besides being able to
>>>>>>> save/load full states).  E.g. I still remember in our QE tests some old
>>>>>>> devices report much more dirty pages than expected during the iterations
>>>>>>> when we were looking into such issue that a huge amount of dirty pages
>>>>>>> reported.  But newer models seem to have fixed that and report much less.
>>>>>>>
>>>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>>>>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>>>>>> maybe they work differently.
>>>>>>
>>>>>> The device which this series was developed against (Mellanox ConnectX-7)
>>>>>> is already transferring its live state before the VM gets stopped (via
>>>>>> save_live_iterate SaveVMHandler).
>>>>>>
>>>>>> It's just that in addition to the live state it has more than 400 MiB
>>>>>> of state that cannot be transferred while the VM is still running.
>>>>>> And that fact hurts a lot with respect to the migration downtime.
>>>>>>
>>>>>> AFAIK it's a very similar story for (some) GPUs.
>>>>>
>>>>> So during iteration phase VFIO cannot yet leverage the multifd channels
>>>>> when with this series, am I right?
>>>>
>>>> That's right.
>>>>
>>>>> Is it possible to extend that use case too?
>>>>
>>>> I guess so, but since this phase (iteration while the VM is still
>>>> running) doesn't impact downtime it is much less critical.
>>>
>>> But it affects the bandwidth, e.g. even with multifd enabled, the device
>>> iteration data will still bottleneck at ~15Gbps on a common system setup
>>> the best case, even if the hosts are 100Gbps direct connected.  Would that
>>> be a concern in the future too, or it's known problem and it won't be fixed
>>> anyway?
>>
>> I think any improvements to the migration performance are good, even if
>> they don't impact downtime.
>>
>> It's just that this patch set focuses on the downtime phase as the more
>> critical thing.
>>
>> After this gets improved there's no reason why not to look at improving
>> performance of the VM live phase too if it brings sensible improvements.
>>
>>> I remember Avihai used to have plan to look into similar issues, I hope
>>> this is exactly what he is looking for.  Otherwise changing migration
>>> protocol from time to time is cumbersome; we always need to provide a flag
>>> to make sure old systems migrates in the old ways, new systems run the new
>>> ways, and for such a relatively major change I'd want to double check on
>>> how far away we can support offload VFIO iterations data to multifd.
>>
>> The device state transfer is indicated by a new flag in the multifd
>> header (MULTIFD_FLAG_DEVICE_STATE).
>>
>> If we are to use multifd channels for VM live phase transfers these
>> could simply re-use the same flag type.
> 
> Right, and that's also my major purpose of such request to consider both
> issues.
> 
> If supporting iterators can be easy on top of this, I am thinking whether
> we should do this in one shot.  The problem is even if the flag type can be
> reused, old/new qemu binaries may not be compatible and may not migrate
> well when:
> 
>    - The old qemu only supports the downtime optimizations
>    - The new qemu supports both downtime + iteration optimizations

I think the situation here will be the same as with any new flag
affecting the migration wire protocol - if the old version of QEMU
doesn't support that flag then it has to be kept at its backward-compatible
setting for migration to succeed.

> IIUC, at least the device threads are currently created only at the end of
> migration when switching over for the downtime-only optimization (aka, this
> series).  Then it means it won't be compatible with a new QEMU as the
> threads there will need to be created before iteration starts to take
> iteration data.  So I believe we'll need yet another flag to tune the
> behavior of such, one for each optimizations (downtime v.s. data during
> iterations).  If they work mostly similarly, I want to avoid two flags.
> It'll be chaos for user to see such similar flags and they'll be pretty
> confusing.

The VFIO loading threads are created from vfio_load_setup(), which is
called at the very beginning of the migration, so they should be already
there.

However, they aren't currently prepared to receive VM live phase data.

> If possible, I wish we can spend some time looking into that if they're so
> close, and if it's low hanging fruit when on top of this series, maybe we
> can consider doing that in one shot.

I'm still trying to figure out the complete explanation why dedicated
device state channels improve downtime as there was a bunch of holidays
last week here.

I will have a look later what would it take to add VM live phase multifd
device state transfer support and also how invasive it would be as I
think it's better to keep the number of code conflicts in a patch set
to a manageable size as it reduces the chance of accidentally
introducing regressions when forward-porting the patch set to the git master.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-05-06 16:26                               ` Maciej S. Szmigiero
@ 2024-05-06 17:56                                 ` Peter Xu
  2024-05-07  8:41                                   ` Avihai Horon
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-05-06 17:56 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote:
> On 29.04.2024 17:09, Peter Xu wrote:
> > On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
> > > On 24.04.2024 00:35, Peter Xu wrote:
> > > > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> > > > > On 24.04.2024 00:20, Peter Xu wrote:
> > > > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > > > > > > > state) multifd channels participate in the RAM sync process
> > > > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > > > > > > > 
> > > > > > > > > > Firstly, I'm wondering whether we can have better names for these new
> > > > > > > > > > hooks.  Currently (only comment on the async* stuff):
> > > > > > > > > > 
> > > > > > > > > >       - complete_precopy_async
> > > > > > > > > >       - complete_precopy
> > > > > > > > > >       - complete_precopy_async_wait
> > > > > > > > > > 
> > > > > > > > > > But perhaps better:
> > > > > > > > > > 
> > > > > > > > > >       - complete_precopy_begin
> > > > > > > > > >       - complete_precopy
> > > > > > > > > >       - complete_precopy_end
> > > > > > > > > > 
> > > > > > > > > > ?
> > > > > > > > > > 
> > > > > > > > > > As I don't see why the device must do something with async in such hook.
> > > > > > > > > > To me it's more like you're splitting one process into multiple, then
> > > > > > > > > > begin/end sounds more generic.
> > > > > > > > > > 
> > > > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > > > > > > > into >1 phases too. For example, I would be curious whether the performance
> > > > > > > > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > > > > > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > > > > > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > > > > > > > 
> > > > > > > > > > I would even ask one step further as what Dan was asking: have you thought
> > > > > > > > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > > > > > > > help even more than this series (which IIUC only helps during the blackout
> > > > > > > > > > phase)?
> > > > > > > > > 
> > > > > > > > > To dump during RAM iteration, the VFIO device will need to have
> > > > > > > > > dirty tracking and iterate on its state, because the guest CPUs
> > > > > > > > > will still be running potentially changing VFIO state. That seems
> > > > > > > > > impractical in the general case.
> > > > > > > > 
> > > > > > > > We already do such interations in vfio_save_iterate()?
> > > > > > > > 
> > > > > > > > My understanding is the recent VFIO work is based on the fact that the VFIO
> > > > > > > > device can track device state changes more or less (besides being able to
> > > > > > > > save/load full states).  E.g. I still remember in our QE tests some old
> > > > > > > > devices report much more dirty pages than expected during the iterations
> > > > > > > > when we were looking into such issue that a huge amount of dirty pages
> > > > > > > > reported.  But newer models seem to have fixed that and report much less.
> > > > > > > > 
> > > > > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > > > > > > > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > > > > > > > maybe they work differently.
> > > > > > > 
> > > > > > > The device which this series was developed against (Mellanox ConnectX-7)
> > > > > > > is already transferring its live state before the VM gets stopped (via
> > > > > > > save_live_iterate SaveVMHandler).
> > > > > > > 
> > > > > > > It's just that in addition to the live state it has more than 400 MiB
> > > > > > > of state that cannot be transferred while the VM is still running.
> > > > > > > And that fact hurts a lot with respect to the migration downtime.
> > > > > > > 
> > > > > > > AFAIK it's a very similar story for (some) GPUs.
> > > > > > 
> > > > > > So during iteration phase VFIO cannot yet leverage the multifd channels
> > > > > > when with this series, am I right?
> > > > > 
> > > > > That's right.
> > > > > 
> > > > > > Is it possible to extend that use case too?
> > > > > 
> > > > > I guess so, but since this phase (iteration while the VM is still
> > > > > running) doesn't impact downtime it is much less critical.
> > > > 
> > > > But it affects the bandwidth, e.g. even with multifd enabled, the device
> > > > iteration data will still bottleneck at ~15Gbps on a common system setup
> > > > the best case, even if the hosts are 100Gbps direct connected.  Would that
> > > > be a concern in the future too, or it's known problem and it won't be fixed
> > > > anyway?
> > > 
> > > I think any improvements to the migration performance are good, even if
> > > they don't impact downtime.
> > > 
> > > It's just that this patch set focuses on the downtime phase as the more
> > > critical thing.
> > > 
> > > After this gets improved there's no reason why not to look at improving
> > > performance of the VM live phase too if it brings sensible improvements.
> > > 
> > > > I remember Avihai used to have plan to look into similar issues, I hope
> > > > this is exactly what he is looking for.  Otherwise changing migration
> > > > protocol from time to time is cumbersome; we always need to provide a flag
> > > > to make sure old systems migrates in the old ways, new systems run the new
> > > > ways, and for such a relatively major change I'd want to double check on
> > > > how far away we can support offload VFIO iterations data to multifd.
> > > 
> > > The device state transfer is indicated by a new flag in the multifd
> > > header (MULTIFD_FLAG_DEVICE_STATE).
> > > 
> > > If we are to use multifd channels for VM live phase transfers these
> > > could simply re-use the same flag type.
> > 
> > Right, and that's also my major purpose of such request to consider both
> > issues.
> > 
> > If supporting iterators can be easy on top of this, I am thinking whether
> > we should do this in one shot.  The problem is even if the flag type can be
> > reused, old/new qemu binaries may not be compatible and may not migrate
> > well when:
> > 
> >    - The old qemu only supports the downtime optimizations
> >    - The new qemu supports both downtime + iteration optimizations
> 
> I think the situation here will be the same as with any new flag
> affecting the migration wire protocol - if the old version of QEMU
> doesn't support that flag then it has to be kept at its backward-compatible
> setting for migration to succeed.
> 
> > IIUC, at least the device threads are currently created only at the end of
> > migration when switching over for the downtime-only optimization (aka, this
> > series).  Then it means it won't be compatible with a new QEMU as the
> > threads there will need to be created before iteration starts to take
> > iteration data.  So I believe we'll need yet another flag to tune the
> > behavior of such, one for each optimizations (downtime v.s. data during
> > iterations).  If they work mostly similarly, I want to avoid two flags.
> > It'll be chaos for user to see such similar flags and they'll be pretty
> > confusing.
> 
> The VFIO loading threads are created from vfio_load_setup(), which is
> called at the very beginning of the migration, so they should be already
> there.
> 
> However, they aren't currently prepared to receive VM live phase data.
> 
> > If possible, I wish we can spend some time looking into that if they're so
> > close, and if it's low hanging fruit when on top of this series, maybe we
> > can consider doing that in one shot.
> 
> I'm still trying to figure out the complete explanation why dedicated
> device state channels improve downtime as there was a bunch of holidays
> last week here.

No rush.  I am not sure whether it'll reduce downtime, but it may improve
total migration time when multiple devices are used.

> 
> I will have a look later what would it take to add VM live phase multifd
> device state transfer support and also how invasive it would be as I
> think it's better to keep the number of code conflicts in a patch set
> to a manageable size as it reduces the chance of accidentally
> introducing regressions when forward-porting the patch set to the git master.

Yes it makes sense.  It'll be good to look one step further in this case,
then:

  - If it's easy to add support then we do in one batch, or

  - If it's not easy to add support, but if we can find a compatible way so
    that ABI can be transparent when adding that later, it'll be also nice, or
    
  - If we have solid clue it should be a major separate work, and we must
    need a new flag, then we at least know we should simply split the
    effort due to that complexity

The hope is option (1)/(2) would work out.

I hope Avihai can also chim in here (or please reach him out) because I
remember he used to consider proposing such a whole solution, but maybe I
just misunderstood.  I suppose no harm to check with him.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-05-06 17:56                                 ` Peter Xu
@ 2024-05-07  8:41                                   ` Avihai Horon
  2024-05-07 16:13                                     ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Avihai Horon @ 2024-05-07  8:41 UTC (permalink / raw)
  To: Peter Xu, Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Joao Martins, qemu-devel


On 06/05/2024 20:56, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote:
>> On 29.04.2024 17:09, Peter Xu wrote:
>>> On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
>>>> On 24.04.2024 00:35, Peter Xu wrote:
>>>>> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 24.04.2024 00:20, Peter Xu wrote:
>>>>>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> On 19.04.2024 17:31, Peter Xu wrote:
>>>>>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>>>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>>>>>>>> state) multifd channels participate in the RAM sync process
>>>>>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>>>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>>>>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>>>>>>>
>>>>>>>>>>>        - complete_precopy_async
>>>>>>>>>>>        - complete_precopy
>>>>>>>>>>>        - complete_precopy_async_wait
>>>>>>>>>>>
>>>>>>>>>>> But perhaps better:
>>>>>>>>>>>
>>>>>>>>>>>        - complete_precopy_begin
>>>>>>>>>>>        - complete_precopy
>>>>>>>>>>>        - complete_precopy_end
>>>>>>>>>>>
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>> As I don't see why the device must do something with async in such hook.
>>>>>>>>>>> To me it's more like you're splitting one process into multiple, then
>>>>>>>>>>> begin/end sounds more generic.
>>>>>>>>>>>
>>>>>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>>>>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>>>>>>>
>>>>>>>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>>>>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>>>>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>>>>>>>> phase)?
>>>>>>>>>> To dump during RAM iteration, the VFIO device will need to have
>>>>>>>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>>>>>>>> will still be running potentially changing VFIO state. That seems
>>>>>>>>>> impractical in the general case.
>>>>>>>>> We already do such interations in vfio_save_iterate()?
>>>>>>>>>
>>>>>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>>>>>>>> device can track device state changes more or less (besides being able to
>>>>>>>>> save/load full states).  E.g. I still remember in our QE tests some old
>>>>>>>>> devices report much more dirty pages than expected during the iterations
>>>>>>>>> when we were looking into such issue that a huge amount of dirty pages
>>>>>>>>> reported.  But newer models seem to have fixed that and report much less.
>>>>>>>>>
>>>>>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>>>>>>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>>>>>>>> maybe they work differently.
>>>>>>>> The device which this series was developed against (Mellanox ConnectX-7)
>>>>>>>> is already transferring its live state before the VM gets stopped (via
>>>>>>>> save_live_iterate SaveVMHandler).
>>>>>>>>
>>>>>>>> It's just that in addition to the live state it has more than 400 MiB
>>>>>>>> of state that cannot be transferred while the VM is still running.
>>>>>>>> And that fact hurts a lot with respect to the migration downtime.
>>>>>>>>
>>>>>>>> AFAIK it's a very similar story for (some) GPUs.
>>>>>>> So during iteration phase VFIO cannot yet leverage the multifd channels
>>>>>>> when with this series, am I right?
>>>>>> That's right.
>>>>>>
>>>>>>> Is it possible to extend that use case too?
>>>>>> I guess so, but since this phase (iteration while the VM is still
>>>>>> running) doesn't impact downtime it is much less critical.
>>>>> But it affects the bandwidth, e.g. even with multifd enabled, the device
>>>>> iteration data will still bottleneck at ~15Gbps on a common system setup
>>>>> the best case, even if the hosts are 100Gbps direct connected.  Would that
>>>>> be a concern in the future too, or it's known problem and it won't be fixed
>>>>> anyway?
>>>> I think any improvements to the migration performance are good, even if
>>>> they don't impact downtime.
>>>>
>>>> It's just that this patch set focuses on the downtime phase as the more
>>>> critical thing.
>>>>
>>>> After this gets improved there's no reason why not to look at improving
>>>> performance of the VM live phase too if it brings sensible improvements.
>>>>
>>>>> I remember Avihai used to have plan to look into similar issues, I hope
>>>>> this is exactly what he is looking for.  Otherwise changing migration
>>>>> protocol from time to time is cumbersome; we always need to provide a flag
>>>>> to make sure old systems migrates in the old ways, new systems run the new
>>>>> ways, and for such a relatively major change I'd want to double check on
>>>>> how far away we can support offload VFIO iterations data to multifd.
>>>> The device state transfer is indicated by a new flag in the multifd
>>>> header (MULTIFD_FLAG_DEVICE_STATE).
>>>>
>>>> If we are to use multifd channels for VM live phase transfers these
>>>> could simply re-use the same flag type.
>>> Right, and that's also my major purpose of such request to consider both
>>> issues.
>>>
>>> If supporting iterators can be easy on top of this, I am thinking whether
>>> we should do this in one shot.  The problem is even if the flag type can be
>>> reused, old/new qemu binaries may not be compatible and may not migrate
>>> well when:
>>>
>>>     - The old qemu only supports the downtime optimizations
>>>     - The new qemu supports both downtime + iteration optimizations
>> I think the situation here will be the same as with any new flag
>> affecting the migration wire protocol - if the old version of QEMU
>> doesn't support that flag then it has to be kept at its backward-compatible
>> setting for migration to succeed.
>>
>>> IIUC, at least the device threads are currently created only at the end of
>>> migration when switching over for the downtime-only optimization (aka, this
>>> series).  Then it means it won't be compatible with a new QEMU as the
>>> threads there will need to be created before iteration starts to take
>>> iteration data.  So I believe we'll need yet another flag to tune the
>>> behavior of such, one for each optimizations (downtime v.s. data during
>>> iterations).  If they work mostly similarly, I want to avoid two flags.
>>> It'll be chaos for user to see such similar flags and they'll be pretty
>>> confusing.
>> The VFIO loading threads are created from vfio_load_setup(), which is
>> called at the very beginning of the migration, so they should be already
>> there.
>>
>> However, they aren't currently prepared to receive VM live phase data.
>>
>>> If possible, I wish we can spend some time looking into that if they're so
>>> close, and if it's low hanging fruit when on top of this series, maybe we
>>> can consider doing that in one shot.
>> I'm still trying to figure out the complete explanation why dedicated
>> device state channels improve downtime as there was a bunch of holidays
>> last week here.
> No rush.  I am not sure whether it'll reduce downtime, but it may improve
> total migration time when multiple devices are used.
>
>> I will have a look later what would it take to add VM live phase multifd
>> device state transfer support and also how invasive it would be as I
>> think it's better to keep the number of code conflicts in a patch set
>> to a manageable size as it reduces the chance of accidentally
>> introducing regressions when forward-porting the patch set to the git master.
> Yes it makes sense.  It'll be good to look one step further in this case,
> then:
>
>    - If it's easy to add support then we do in one batch, or
>
>    - If it's not easy to add support, but if we can find a compatible way so
>      that ABI can be transparent when adding that later, it'll be also nice, or
>
>    - If we have solid clue it should be a major separate work, and we must
>      need a new flag, then we at least know we should simply split the
>      effort due to that complexity
>
> The hope is option (1)/(2) would work out.
>
> I hope Avihai can also chim in here (or please reach him out) because I
> remember he used to consider proposing such a whole solution, but maybe I
> just misunderstood.  I suppose no harm to check with him.

Yes, I was working on parallel VFIO migration, but in a different 
approach (not over multifd) which I'm not sure is relevant to this series.
I've been skimming over your discussions but haven't had the time to go 
over Maciej's series thoroughly.
I will try to find time to do this next week and see if I can help.

Thanks.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-05-07  8:41                                   ` Avihai Horon
@ 2024-05-07 16:13                                     ` Peter Xu
  2024-05-07 17:23                                       ` Avihai Horon
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-05-07 16:13 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Maciej S. Szmigiero, Daniel P. Berrangé, Fabiano Rosas,
	Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Joao Martins, qemu-devel

On Tue, May 07, 2024 at 11:41:05AM +0300, Avihai Horon wrote:
> Yes, I was working on parallel VFIO migration, but in a different approach
> (not over multifd) which I'm not sure is relevant to this series.
> I've been skimming over your discussions but haven't had the time to go over
> Maciej's series thoroughly.
> I will try to find time to do this next week and see if I can help.

IIUC your solution could also improve downtime, it's just that it bypasses
migration in general so from that POV a multifd-based solution is
preferred.

Fundamentally I think you share the goal more or less on allowing
concurrent vfio migrations, so it will be greatly helpful to have your
input / reviews, also making sure the ultimate solution will work for all
the use cases.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
  2024-05-07 16:13                                     ` Peter Xu
@ 2024-05-07 17:23                                       ` Avihai Horon
  0 siblings, 0 replies; 54+ messages in thread
From: Avihai Horon @ 2024-05-07 17:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Daniel P. Berrangé, Fabiano Rosas,
	Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Joao Martins, qemu-devel


On 07/05/2024 19:13, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, May 07, 2024 at 11:41:05AM +0300, Avihai Horon wrote:
>> Yes, I was working on parallel VFIO migration, but in a different approach
>> (not over multifd) which I'm not sure is relevant to this series.
>> I've been skimming over your discussions but haven't had the time to go over
>> Maciej's series thoroughly.
>> I will try to find time to do this next week and see if I can help.
> IIUC your solution could also improve downtime, it's just that it bypasses
> migration in general so from that POV a multifd-based solution is
> preferred.
>
> Fundamentally I think you share the goal more or less on allowing
> concurrent vfio migrations, so it will be greatly helpful to have your
> input / reviews, also making sure the ultimate solution will work for all
> the use cases.

Yes, of course, I am planning to review and test this series once I have 
time.
As I previously mentioned [1], I'm in sync with Maciej and his work is 
the reason why I pulled back from mine.

Thanks.

[1] 
https://lore.kernel.org/qemu-devel/f1882336-15ac-40a4-b481-03efdb152510@nvidia.com/



^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2024-05-07 17:24 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 02/26] migration: Add migration channel header send/receive Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 03/26] migration: Add send/receive header for main channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create() Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 10/26] migration: Add send/receive header for multifd channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 14/26] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 18/26] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-04-29 20:04   ` Peter Xu
2024-05-06 16:25     ` Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-04-17  8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
2024-04-17 12:11   ` Maciej S. Szmigiero
2024-04-17 16:35     ` Daniel P. Berrangé
2024-04-18  9:50       ` Maciej S. Szmigiero
2024-04-18 10:39         ` Daniel P. Berrangé
2024-04-18 18:14           ` Maciej S. Szmigiero
2024-04-18 20:02             ` Peter Xu
2024-04-19 10:07               ` Daniel P. Berrangé
2024-04-19 15:31                 ` Peter Xu
2024-04-23 16:15                   ` Maciej S. Szmigiero
2024-04-23 22:20                     ` Peter Xu
2024-04-23 22:25                       ` Maciej S. Szmigiero
2024-04-23 22:35                         ` Peter Xu
2024-04-26 17:34                           ` Maciej S. Szmigiero
2024-04-29 15:09                             ` Peter Xu
2024-05-06 16:26                               ` Maciej S. Szmigiero
2024-05-06 17:56                                 ` Peter Xu
2024-05-07  8:41                                   ` Avihai Horon
2024-05-07 16:13                                     ` Peter Xu
2024-05-07 17:23                                       ` Avihai Horon
2024-04-23 16:14               ` Maciej S. Szmigiero
2024-04-23 22:27                 ` Peter Xu
2024-04-26 17:35                   ` Maciej S. Szmigiero
2024-04-29 20:34                     ` Peter Xu
2024-04-19 10:20             ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).