[PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
@ 2024-06-18 16:12 Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
                   ` (13 more replies)
  0 siblings, 14 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/

Changes from the RFC (v0):
* Extend the existing multifd packet format instead of introducing a new
migration channel header.

* As the replacement of switching the migration channel header on or off
introduce "x-migration-multifd-transfer" VFIO device property instead that
allows configuring at runtime whether to send the particular device state
via multifd channels when live migrating that device.

This property defaults to "false" for bit stream compatibility with older
QEMU versions.

* Remove the support for having dedicated device state transfer multifd
channels since the same downtime performance can be attained by simply
reducing the total number of multifd channels in a shared channel
configuration to the number of channels available for RAM transfer in
the dedicated device state channels configuration.

For example, the best downtime from the dedicated device state config
on my setup (achieved in configuration of 10 total multifd channels /
4 dedicated device state channels) can also be achieved in the
shared RAM/device state channel configuration by reducing the total
multifd channel count to 6.

It looks like not having too many RAM transfer multifd channels is
key to having a good downtime since the results are reproducibly
worse with 15 shared channels total, while they are as good as with
6 shared channels if there are 15 total channels but 4 of them are
dedicated to transferring device state (leaving 11 for RAM transfer).

* Make the next multifd channel selection more fair when converting
multifd_send_pages::next_channel to atomic.

* Convert the code to use QEMU thread sync primitives (QemuMutex with
QemuLockable, QemuCond) instead of their Glib equivalents (GMutex,
GMutexLocker and GCond).

* Rename complete_precopy_async{,_wait} to complete_precopy_{begin,_end} as
suggested.

* Rebase onto the last week's QEMU git master and retest.

When working on the updated patch set version I also investigated the
possibility of refactoring VM live phase (non-downtime) transfers to
happen via multifd channels.

However, the VM live phase transfer works differently: it happens
opportunistically until the remaining data drops below the switchover
threshold, rather that transferring always the whole device state data
until their exhaustion.

For this reason, there would need to be some way in the migration
framework to update the remaining data estimate from per-device
saving/transfer queuing thread and then stop these threads when the
decision has been reached in the migration core to stop the VM and
switch over. Such functionality would need to be introduced first.

There would also need to be some fairness guarantees so every device
gets similar access to multifd channels - otherwise there could be a
situation that the remaining data never drops below switchover
threshold because some devices are starved with respect to access to
the multifd transfer channels - as in the VM live phase additional
device data is constantly being generated.

Moreover, there's nothing stopping a QEMU device driver from requiring
different handling (loading, etc.) of VM live phase data from the
post-switchover data.
For cases like this some kind of a new device VM live phase incoming
data load handler would need to be introduced too.

For the above reasons, the VM live phase multifd transfer functionality
isn't a simple extension of the functionality introduced by this patch
set.

For convenience, this patch set is also available as a git tree:
https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio

Maciej S. Szmigiero (13):
  vfio/migration: Add save_{iterate,complete_precopy}_started trace
    events
  migration/ram: Add load start trace event
  migration/multifd: Zero p->flags before starting filling a packet
  migration: Add save_live_complete_precopy_{begin,end} handlers
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: Add load_finish handler and associated functions
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Convert multifd_send_pages::next_channel to atomic
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add migration_has_device_state_support()
  vfio/migration: Multifd device state transfer support - receive side
  vfio/migration: Add x-migration-multifd-transfer VFIO property
  vfio/migration: Multifd device state transfer support - send side

 hw/vfio/migration.c           | 545 +++++++++++++++++++++++++++++++++-
 hw/vfio/pci.c                 |   7 +
 hw/vfio/trace-events          |  15 +-
 include/hw/vfio/vfio-common.h |  27 ++
 include/migration/misc.h      |   5 +
 include/migration/register.h  |  70 +++++
 migration/migration.c         |   6 +
 migration/migration.h         |   3 +
 migration/multifd-zlib.c      |   2 +-
 migration/multifd-zstd.c      |   2 +-
 migration/multifd.c           | 336 +++++++++++++++++----
 migration/multifd.h           |  57 +++-
 migration/ram.c               |   1 +
 migration/savevm.c            | 112 +++++++
 migration/savevm.h            |   7 +
 migration/trace-events        |   1 +
 16 files changed, 1132 insertions(+), 64 deletions(-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 02/13] migration/ram: Add load start trace event Maciej S. Szmigiero
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way both the start and end points of migrating a particular VFIO
device are known.

Add also a vfio_save_iterate_empty_hit trace event so it is known when
there's no more data to send for that device.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 13 +++++++++++++
 hw/vfio/trace-events          |  3 +++
 include/hw/vfio/vfio-common.h |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 34d4be2ce1b1..93f767e3c2dd 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -472,6 +472,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
         return -ENOMEM;
     }
 
+    migration->save_iterate_run = false;
+    migration->save_iterate_empty_hit = false;
+
     if (vfio_precopy_supported(vbasedev)) {
         switch (migration->device_state) {
         case VFIO_DEVICE_STATE_RUNNING:
@@ -605,9 +608,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     VFIOMigration *migration = vbasedev->migration;
     ssize_t data_size;
 
+    if (!migration->save_iterate_run) {
+        trace_vfio_save_iterate_started(vbasedev->name);
+        migration->save_iterate_run = true;
+    }
+
     data_size = vfio_save_block(f, migration);
     if (data_size < 0) {
         return data_size;
+    } else if (data_size == 0 && !migration->save_iterate_empty_hit) {
+        trace_vfio_save_iterate_empty_hit(vbasedev->name);
+        migration->save_iterate_empty_hit = true;
     }
 
     vfio_update_estimated_pending_data(migration, data_size);
@@ -633,6 +644,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     int ret;
     Error *local_err = NULL;
 
+    trace_vfio_save_complete_precopy_started(vbasedev->name);
+
     /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP, &local_err);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 64161bf6f44c..814000796687 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -158,8 +158,11 @@ vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
 vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
+vfio_save_complete_precopy_started(const char *name) " (%s)"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_save_iterate_started(const char *name) " (%s)"
+vfio_save_iterate_empty_hit(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 4cb1ab8645dc..510818f4dae3 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -71,6 +71,9 @@ typedef struct VFIOMigration {
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
     bool initial_data_sent;
+
+    bool save_iterate_run;
+    bool save_iterate_empty_hit;
 } VFIOMigration;
 
 struct VFIOGroup;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 02/13] migration/ram: Add load start trace event
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 03/13] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

There's a RAM load complete trace event but there wasn't its start equivalent.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/ram.c        | 1 +
 migration/trace-events | 1 +
 2 files changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index ceea586b06ba..87b0cf86db0c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4129,6 +4129,7 @@ static int ram_load_precopy(QEMUFile *f)
                           RAM_SAVE_FLAG_ZERO);
     }
 
+    trace_ram_load_start();
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr;
         void *host = NULL, *host_bak = NULL;
diff --git a/migration/trace-events b/migration/trace-events
index 0b7c3324fb5e..43dfe4a4bc03 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) ""
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
+ram_load_start(void) ""
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 03/13] migration/multifd: Zero p->flags before starting filling a packet
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 02/13] migration/ram: Add load start trace event Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 04/13] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way there aren't stale flags there.

p->flags can't contain SYNC to be sent at the next RAM packet since syncs
are now handled separately in multifd_send_thread.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index f317bff07746..c8a5b363f7d4 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -933,6 +933,7 @@ static void *multifd_send_thread(void *opaque)
         if (qatomic_load_acquire(&p->pending_job)) {
             MultiFDPages_t *pages = p->pages;
 
+            p->flags = 0;
             p->iovs_num = 0;
             assert(pages->num);
 
@@ -986,7 +987,6 @@ static void *multifd_send_thread(void *opaque)
                 }
                 /* p->next_packet_size will always be zero for a SYNC packet */
                 stat64_add(&mig_stats.multifd_bytes, p->packet_len);
-                p->flags = 0;
             }
 
             qatomic_set(&p->pending_sync, false);


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 04/13] migration: Add save_live_complete_precopy_{begin, end} handlers
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 03/13] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 05/13] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

These SaveVMHandlers allow device to provide its own asynchronous
transmission of the remaining data at the end of a precopy phase.

In this use case the save_live_complete_precopy_begin handler is
supposed to start such transmission (for example, by launching
appropriate threads) while the save_live_complete_precopy_end
handler is supposed to wait until such transfer has finished (for
example, until the sending threads have exited).

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 34 ++++++++++++++++++++++++++++++++++
 migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index f60e797894e5..f7b3df799991 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -103,6 +103,40 @@ typedef struct SaveVMHandlers {
      */
     int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+    /**
+     * @save_live_complete_precopy_begin
+     *
+     * Called at the end of a precopy phase, before all @save_live_complete_precopy
+     * handlers. The handler might, for example, arrange for device-specific
+     * asynchronous transmission of the remaining data. When postcopy is enabled,
+     * devices that support postcopy will skip this step.
+     *
+     * @f: QEMUFile where the handler can synchronously send data before returning
+     * @idstr: this device section idstr
+     * @instance_id: this device section instance_id
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*save_live_complete_precopy_begin)(QEMUFile *f,
+                                            char *idstr, uint32_t instance_id,
+                                            void *opaque);
+    /**
+     * @save_live_complete_precopy_end
+     *
+     * Called at the end of a precopy phase, after all @save_live_complete_precopy
+     * handlers. The handler might, for example, wait for the asynchronous
+     * transmission started by the @save_live_complete_precopy_begin handler
+     * to complete. When postcopy is enabled, devices that support postcopy will
+     * skip this step.
+     *
+     * @f: QEMUFile where the handler can synchronously send data before returning
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
+
     /* This runs both outside and inside the BQL.  */
 
     /**
diff --git a/migration/savevm.c b/migration/savevm.c
index c621f2359ba3..56fb1c4c2563 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1494,6 +1494,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     SaveStateEntry *se;
     int ret;
 
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+             se->ops->has_postcopy(se->opaque)) ||
+            !se->ops->save_live_complete_precopy_begin) {
+            continue;
+        }
+
+        save_section_header(f, se, QEMU_VM_SECTION_END);
+
+        ret = se->ops->save_live_complete_precopy_begin(f,
+                                                        se->idstr, se->instance_id,
+                                                        se->opaque);
+
+        save_section_footer(f, se);
+
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
             (in_postcopy && se->ops->has_postcopy &&
@@ -1525,6 +1546,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
                                     end_ts_each - start_ts_each);
     }
 
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+             se->ops->has_postcopy(se->opaque)) ||
+            !se->ops->save_live_complete_precopy_end) {
+            continue;
+        }
+
+        ret = se->ops->save_live_complete_precopy_end(f, se->opaque);
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 05/13] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 04/13] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 06/13] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 15 +++++++++++++++
 migration/savevm.c           | 25 +++++++++++++++++++++++++
 migration/savevm.h           |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index f7b3df799991..ce7641c90cea 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -261,6 +261,21 @@ typedef struct SaveVMHandlers {
      */
     int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+    /**
+     * @load_state_buffer
+     *
+     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @data: the data buffer to load
+     * @data_size: the data length in buffer
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
+                             Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/savevm.c b/migration/savevm.c
index 56fb1c4c2563..2e538cb02936 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3099,6 +3099,31 @@ int qemu_loadvm_approve_switchover(void)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp)
+{
+    SaveStateEntry *se;
+
+    se = find_se(idstr, instance_id);
+    if (!se) {
+        error_setg(errp, "Unknown idstr %s or instance id %u for load state buffer",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    if (!se->ops || !se->ops->load_state_buffer) {
+        error_setg(errp, "idstr %s / instance %u has no load state buffer operation",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 9ec96a995c93..d388f1bfca98 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp);
+
 #endif


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 06/13] migration: Add load_finish handler and associated functions
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 05/13] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 07/13] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

load_finish SaveVMHandler allows migration code to poll whether
a device-specific asynchronous device state loading operation had finished.

In order to avoid calling this handler needlessly the device is supposed
to notify the migration code of its possible readiness via a call to
qemu_loadvm_load_finish_ready_broadcast() while holding
qemu_loadvm_load_finish_ready_lock.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 21 +++++++++++++++
 migration/migration.c        |  6 +++++
 migration/migration.h        |  3 +++
 migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
 migration/savevm.h           |  4 +++
 5 files changed, 86 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index ce7641c90cea..7c20a9fb86ff 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -276,6 +276,27 @@ typedef struct SaveVMHandlers {
     int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
                              Error **errp);
 
+    /**
+     * @load_finish
+     *
+     * Poll whether all asynchronous device state loading had finished.
+     * Not called on the load failure path.
+     *
+     * Called while holding the qemu_loadvm_load_finish_ready_lock.
+     *
+     * If this method signals "not ready" then it might not be called
+     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
+     * while holding qemu_loadvm_load_finish_ready_lock.
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @is_finished: whether the loading had finished (output parameter)
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     * It's not an error that the loading still hasn't finished.
+     */
+    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/migration.c b/migration/migration.c
index e1b269624c01..ff149e00132f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -236,6 +236,9 @@ void migration_object_init(void)
 
     current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
 
+    qemu_mutex_init(&current_incoming->load_finish_ready_mutex);
+    qemu_cond_init(&current_incoming->load_finish_ready_cond);
+
     migration_object_check(current_migration, &error_fatal);
 
     ram_mig_init();
@@ -387,6 +390,9 @@ void migration_incoming_state_destroy(void)
         mis->postcopy_qemufile_dst = NULL;
     }
 
+    qemu_mutex_destroy(&mis->load_finish_ready_mutex);
+    qemu_cond_destroy(&mis->load_finish_ready_cond);
+
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index 6af01362d424..0f2716ac42c6 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -230,6 +230,9 @@ struct MigrationIncomingState {
 
     /* Do exit on incoming migration failure */
     bool exit_on_error;
+
+    QemuCond load_finish_ready_cond;
+    QemuMutex load_finish_ready_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 2e538cb02936..46cfb73eae79 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3020,6 +3020,37 @@ int qemu_loadvm_state(QEMUFile *f)
         return ret;
     }
 
+    qemu_loadvm_load_finish_ready_lock();
+    while (!ret) { /* Don't call load_finish() handlers on the load failure path */
+        bool all_ready = true;
+        SaveStateEntry *se = NULL;
+
+        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+            bool this_ready;
+
+            if (!se->ops || !se->ops->load_finish) {
+                continue;
+            }
+
+            ret = se->ops->load_finish(se->opaque, &this_ready, &local_err);
+            if (ret) {
+                error_report_err(local_err);
+
+                qemu_loadvm_load_finish_ready_unlock();
+                return -EINVAL;
+            } else if (!this_ready) {
+                all_ready = false;
+            }
+        }
+
+        if (all_ready) {
+            break;
+        }
+
+        qemu_cond_wait(&mis->load_finish_ready_cond, &mis->load_finish_ready_mutex);
+    }
+    qemu_loadvm_load_finish_ready_unlock();
+
     if (ret == 0) {
         ret = qemu_file_get_error(f);
     }
@@ -3124,6 +3155,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
     return 0;
 }
 
+void qemu_loadvm_load_finish_ready_lock(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    qemu_mutex_lock(&mis->load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_unlock(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    qemu_mutex_unlock(&mis->load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_broadcast(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    qemu_cond_broadcast(&mis->load_finish_ready_cond);
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index d388f1bfca98..69ae22cded7a 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -73,4 +73,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
 int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
                                   char *buf, size_t len, Error **errp);
 
+void qemu_loadvm_load_finish_ready_lock(void);
+void qemu_loadvm_load_finish_ready_unlock(void);
+void qemu_loadvm_load_finish_ready_broadcast(void);
+
 #endif


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 07/13] migration/multifd: Device state transfer support - receive side
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (5 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 06/13] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 08/13] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.

To differentiate between a device state and a RAM packet the packet
header is read first.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is then read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 123 +++++++++++++++++++++++++++++++++++++-------
 migration/multifd.h |  31 ++++++++++-
 2 files changed, 134 insertions(+), 20 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c8a5b363f7d4..6e0af84bb9a1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -21,6 +21,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -404,7 +405,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
     uint32_t zero_num = pages->num - pages->normal_num;
     int i;
 
-    packet->flags = cpu_to_be32(p->flags);
+    packet->hdr.flags = cpu_to_be32(p->flags);
     packet->pages_alloc = cpu_to_be32(p->pages->allocated);
     packet->normal_pages = cpu_to_be32(pages->normal_num);
     packet->zero_pages = cpu_to_be32(zero_num);
@@ -432,28 +433,44 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
                        p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p, MultiFDPacketHdr_t *hdr,
+                                             Error **errp)
 {
-    MultiFDPacket_t *packet = p->packet;
-    int i;
-
-    packet->magic = be32_to_cpu(packet->magic);
-    if (packet->magic != MULTIFD_MAGIC) {
+    hdr->magic = be32_to_cpu(hdr->magic);
+    if (hdr->magic != MULTIFD_MAGIC) {
         error_setg(errp, "multifd: received packet "
                    "magic %x and expected magic %x",
-                   packet->magic, MULTIFD_MAGIC);
+                   hdr->magic, MULTIFD_MAGIC);
         return -1;
     }
 
-    packet->version = be32_to_cpu(packet->version);
-    if (packet->version != MULTIFD_VERSION) {
+    hdr->version = be32_to_cpu(hdr->version);
+    if (hdr->version != MULTIFD_VERSION) {
         error_setg(errp, "multifd: received packet "
                    "version %u and expected version %u",
-                   packet->version, MULTIFD_VERSION);
+                   hdr->version, MULTIFD_VERSION);
         return -1;
     }
 
-    p->flags = be32_to_cpu(packet->flags);
+    p->flags = be32_to_cpu(hdr->flags);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p, Error **errp)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+    packet->instance_id = be32_to_cpu(packet->instance_id);
+    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
+{
+    MultiFDPacket_t *packet = p->packet;
+    int i;
 
     packet->pages_alloc = be32_to_cpu(packet->pages_alloc);
     /*
@@ -485,7 +502,6 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
 
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
-    p->packets_recved++;
     p->total_normal_pages += p->normal_num;
     p->total_zero_pages += p->zero_num;
 
@@ -533,6 +549,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     return 0;
 }
 
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    p->packets_recved++;
+
+    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
+        return multifd_recv_unfill_packet_device_state(p, errp);
+    } else {
+        return multifd_recv_unfill_packet_ram(p, errp);
+    }
+
+    g_assert_not_reached();
+}
+
 static bool multifd_send_should_exit(void)
 {
     return qatomic_read(&multifd_send_state->exiting);
@@ -1177,8 +1206,8 @@ bool multifd_send_setup(void)
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
-            p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-            p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+            p->packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
 
             /* We need one extra place for the packet header */
             p->iov = g_new0(struct iovec, page_count + 1);
@@ -1353,6 +1382,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
+    g_clear_pointer(&p->packet_dev_state, g_free);
     g_free(p->iov);
     p->iov = NULL;
     g_free(p->normal);
@@ -1467,8 +1497,13 @@ static void *multifd_recv_thread(void *opaque)
     rcu_register_thread();
 
     while (true) {
+        MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
+        bool is_device_state = false;
         bool has_data = false;
+        uint8_t *pkt_buf;
+        size_t pkt_len;
+
         p->normal_num = 0;
 
         if (use_packets) {
@@ -1476,8 +1511,27 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
-                                           p->packet_len, &local_err);
+            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
+                                           sizeof(hdr), &local_err);
+            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
+                break;
+            }
+
+            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
+            if (ret) {
+                break;
+            }
+
+            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
+            if (is_device_state) {
+                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
+                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
+            } else {
+                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+                pkt_len = p->packet_len - sizeof(hdr);
+            }
+
+            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len, &local_err);
             if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
                 break;
             }
@@ -1520,8 +1574,33 @@ static void *multifd_recv_thread(void *opaque)
             has_data = !!p->data->size;
         }
 
-        if (has_data) {
-            ret = multifd_recv_state->ops->recv(p, &local_err);
+        if (!is_device_state) {
+            if (has_data) {
+                ret = multifd_recv_state->ops->recv(p, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+            }
+        } else {
+            g_autofree char *idstr = NULL;
+            g_autofree char *dev_state_buf = NULL;
+
+            assert(use_packets);
+
+            if (p->next_packet_size > 0) {
+                dev_state_buf = g_malloc(p->next_packet_size);
+
+                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+            }
+
+            idstr = g_strndup(p->packet_dev_state->idstr, sizeof(p->packet_dev_state->idstr));
+            ret = qemu_loadvm_load_state_buffer(idstr,
+                                                p->packet_dev_state->instance_id,
+                                                dev_state_buf, p->next_packet_size,
+                                                &local_err);
             if (ret != 0) {
                 break;
             }
@@ -1529,6 +1608,11 @@ static void *multifd_recv_thread(void *opaque)
 
         if (use_packets) {
             if (flags & MULTIFD_FLAG_SYNC) {
+                if (is_device_state) {
+                    error_setg(&local_err, "multifd: received SYNC device state packet");
+                    break;
+                }
+
                 qemu_sem_post(&multifd_recv_state->sem_sync);
                 qemu_sem_wait(&p->sem_sync);
             }
@@ -1600,6 +1684,7 @@ int multifd_recv_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
         }
         p->name = g_strdup_printf("multifdrecv_%d", i);
         p->iov = g_new0(struct iovec, page_count);
diff --git a/migration/multifd.h b/migration/multifd.h
index c9d9b0923953..40ee613dd88a 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -41,6 +41,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_ZLIB (1 << 1)
 #define MULTIFD_FLAG_ZSTD (2 << 1)
 
+/*
+ * If set it means that this packet contains device state
+ * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
+ */
+#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
+
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
 
@@ -48,6 +54,11 @@ typedef struct {
     uint32_t magic;
     uint32_t version;
     uint32_t flags;
+} __attribute__((packed)) MultiFDPacketHdr_t;
+
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
     /* maximum number of allocated pages */
     uint32_t pages_alloc;
     /* non zero pages */
@@ -68,6 +79,16 @@ typedef struct {
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
+    char idstr[256] QEMU_NONSTRING;
+    uint32_t instance_id;
+
+    /* size of the next packet that contains the actual data */
+    uint32_t next_packet_size;
+} __attribute__((packed)) MultiFDPacketDeviceState_t;
+
 typedef struct {
     /* number of used pages */
     uint32_t num;
@@ -87,6 +108,13 @@ struct MultiFDRecvData {
     off_t file_offset;
 };
 
+typedef struct {
+    char *idstr;
+    uint32_t instance_id;
+    char *buf;
+    size_t buf_len;
+} MultiFDDeviceState_t;
+
 typedef struct {
     /* Fields are only written at creating/deletion time */
     /* No lock required for them, they are read only */
@@ -194,8 +222,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_dev_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets received through this channel */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 08/13] migration/multifd: Convert multifd_send_pages::next_channel to atomic
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (6 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 07/13] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 09/13] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is necessary for multifd_send_pages() to be able to be called
from multiple threads.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 6e0af84bb9a1..daa34172bf24 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -614,26 +614,38 @@ static bool multifd_send_pages(void)
         return false;
     }
 
-    /* We wait here, until at least one channel is ready */
-    qemu_sem_wait(&multifd_send_state->channels_ready);
-
     /*
      * next_channel can remain from a previous migration that was
      * using more channels, so ensure it doesn't overflow if the
      * limit is lower now.
      */
-    next_channel %= migrate_multifd_channels();
-    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
+    i = qatomic_load_acquire(&next_channel);
+    if (unlikely(i >= migrate_multifd_channels())) {
+        qatomic_cmpxchg(&next_channel, i, 0);
+    }
+
+    /* We wait here, until at least one channel is ready */
+    qemu_sem_wait(&multifd_send_state->channels_ready);
+
+    while (true) {
+        int i_next;
+
         if (multifd_send_should_exit()) {
             return false;
         }
+
+        i = qatomic_load_acquire(&next_channel);
+        i_next = (i + 1) % migrate_multifd_channels();
+        if (qatomic_cmpxchg(&next_channel, i, i_next) != i) {
+            continue;
+        }
+
         p = &multifd_send_state->params[i];
         /*
          * Lockless read to p->pending_job is safe, because only multifd
          * sender thread can clear it.
          */
         if (qatomic_read(&p->pending_job) == false) {
-            next_channel = (i + 1) % migrate_multifd_channels();
             break;
         }
     }


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 09/13] migration/multifd: Device state transfer support - send side
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (7 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 08/13] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 10/13] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h |   4 +
 migration/multifd-zlib.c |   2 +-
 migration/multifd-zstd.c |   2 +-
 migration/multifd.c      | 181 +++++++++++++++++++++++++++++++++------
 migration/multifd.h      |  26 ++++--
 5 files changed, 182 insertions(+), 33 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613bac..abf6f33eeae8 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void);
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
+/* migration/multifd.c */
+int multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                               char *data, size_t len);
+
 #endif
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 737a9645d2fe..424547aa5be0 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -177,7 +177,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
 
 out:
     p->flags |= MULTIFD_FLAG_ZLIB;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 256858df0a0a..89ef21898485 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -166,7 +166,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
 
 out:
     p->flags |= MULTIFD_FLAG_ZSTD;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index daa34172bf24..6a7e5d659925 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -19,6 +20,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -49,9 +51,12 @@ typedef struct {
 } __attribute__((packed)) MultiFDInit_t;
 
 struct {
+    QemuMutex queue_job_mutex;
+
     MultiFDSendParams *params;
-    /* array of pages to sent */
+    /* array of pages or device state to be sent */
     MultiFDPages_t *pages;
+    MultiFDDeviceState_t *device_state;
     /*
      * Global number of generated multifd packets.
      *
@@ -168,7 +173,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
 }
 
 /**
- * nocomp_send_prepare: prepare date to be able to send
+ * nocomp_send_prepare_ram: prepare RAM data for sending
  *
  * For no compression we just have to calculate the size of the
  * packet.
@@ -178,7 +183,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
  * @p: Params for the channel that we are using
  * @errp: pointer to an error
  */
-static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
+static int nocomp_send_prepare_ram(MultiFDSendParams *p, Error **errp)
 {
     bool use_zero_copy_send = migrate_zero_copy_send();
     int ret;
@@ -197,13 +202,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
          * Only !zerocopy needs the header in IOV; zerocopy will
          * send it separately.
          */
-        multifd_send_prepare_header(p);
+        multifd_send_prepare_header_ram(p);
     }
 
     multifd_send_prepare_iovs(p);
     p->flags |= MULTIFD_FLAG_NOCOMP;
 
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
 
     if (use_zero_copy_send) {
         /* Send header first, without zerocopy */
@@ -217,6 +222,56 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
     return 0;
 }
 
+static void multifd_send_fill_packet_device_state(MultiFDSendParams *p)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+    packet->hdr.flags = cpu_to_be32(p->flags);
+    strncpy(packet->idstr, p->device_state->idstr, sizeof(packet->idstr));
+    packet->instance_id = cpu_to_be32(p->device_state->instance_id);
+    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+/**
+ * nocomp_send_prepare_device_state: prepare device state data for sending
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
+                                            Error **errp)
+{
+    multifd_send_prepare_header_device_state(p);
+
+    assert(!(p->flags & MULTIFD_FLAG_SYNC));
+
+    p->next_packet_size = p->device_state->buf_len;
+    if (p->next_packet_size > 0) {
+        p->iov[p->iovs_num].iov_base = p->device_state->buf;
+        p->iov[p->iovs_num].iov_len = p->next_packet_size;
+        p->iovs_num++;
+    }
+
+    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
+
+    multifd_send_fill_packet_device_state(p);
+
+    return 0;
+}
+
+static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
+{
+    if (p->is_device_state_job) {
+        return nocomp_send_prepare_device_state(p, errp);
+    } else {
+        return nocomp_send_prepare_ram(p, errp);
+    }
+
+    g_assert_not_reached();
+}
+
 /**
  * nocomp_recv_setup: setup receive side
  *
@@ -397,7 +452,18 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
     g_free(pages);
 }
 
-void multifd_send_fill_packet(MultiFDSendParams *p)
+static void multifd_device_state_free(MultiFDDeviceState_t *device_state)
+{
+    if (!device_state) {
+        return;
+    }
+
+    g_clear_pointer(&device_state->idstr, g_free);
+    g_clear_pointer(&device_state->buf, g_free);
+    g_free(device_state);
+}
+
+void multifd_send_fill_packet_ram(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
     MultiFDPages_t *pages = p->pages;
@@ -585,7 +651,8 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
 }
 
 /*
- * How we use multifd_send_state->pages and channel->pages?
+ * How we use multifd_send_state->pages + channel->pages
+ * and multifd_send_state->device_state + channel->device_state?
  *
  * We create a pages for each channel, and a main one.  Each time that
  * we need to send a batch of pages we interchange the ones between
@@ -601,14 +668,15 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
  * have to had finish with its own, otherwise pending_job can't be
  * false.
  *
+ * 'device_state' struct has similar handling.
+ *
  * Returns true if succeed, false otherwise.
  */
-static bool multifd_send_pages(void)
+static bool multifd_send_queue_job(bool is_device_state)
 {
     int i;
     static int next_channel;
     MultiFDSendParams *p = NULL; /* make happy gcc */
-    MultiFDPages_t *pages = multifd_send_state->pages;
 
     if (multifd_send_should_exit()) {
         return false;
@@ -645,7 +713,7 @@ static bool multifd_send_pages(void)
          * Lockless read to p->pending_job is safe, because only multifd
          * sender thread can clear it.
          */
-        if (qatomic_read(&p->pending_job) == false) {
+        if (qatomic_cmpxchg(&p->pending_job_preparing, false, true) == false) {
             break;
         }
     }
@@ -655,12 +723,30 @@ static bool multifd_send_pages(void)
      * qatomic_store_release() in multifd_send_thread().
      */
     smp_mb_acquire();
-    assert(!p->pages->num);
-    multifd_send_state->pages = p->pages;
-    p->pages = pages;
+
+    if (!is_device_state) {
+        assert(!p->pages->num);
+    } else {
+        assert(!p->device_state->buf);
+    }
+
+    p->is_device_state_job = is_device_state;
+
+    if (!is_device_state) {
+        MultiFDPages_t *pages = multifd_send_state->pages;
+
+        multifd_send_state->pages = p->pages;
+        p->pages = pages;
+    } else {
+        MultiFDDeviceState_t *device_state = multifd_send_state->device_state;
+
+        multifd_send_state->device_state = p->device_state;
+        p->device_state = device_state;
+    }
+
     /*
-     * Making sure p->pages is setup before marking pending_job=true. Pairs
-     * with the qatomic_load_acquire() in multifd_send_thread().
+     * Making sure p->pages or p->device state is setup before marking
+     * pending_job=true. Pairs with the qatomic_load_acquire() in multifd_send_thread().
      */
     qatomic_store_release(&p->pending_job, true);
     qemu_sem_post(&p->sem);
@@ -707,7 +793,7 @@ retry:
      * After flush, always retry.
      */
     if (pages->block != block || multifd_queue_full(pages)) {
-        if (!multifd_send_pages()) {
+        if (!multifd_send_queue_job(false)) {
             return false;
         }
         goto retry;
@@ -718,6 +804,28 @@ retry:
     return true;
 }
 
+int multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                               char *data, size_t len)
+{
+    /* Device state submissions can come from multiple threads */
+    QEMU_LOCK_GUARD(&multifd_send_state->queue_job_mutex);
+    MultiFDDeviceState_t *device_state = multifd_send_state->device_state;
+
+    assert(!device_state->buf);
+    device_state->idstr = g_strdup(idstr);
+    device_state->instance_id = instance_id;
+    device_state->buf = g_memdup2(data, len);
+    device_state->buf_len = len;
+
+    if (!multifd_send_queue_job(true)) {
+        g_clear_pointer(&device_state->idstr, g_free);
+        g_clear_pointer(&device_state->buf, g_free);
+        return -1;
+    }
+
+    return 0;
+}
+
 /* Multifd send side hit an error; remember it and prepare to quit */
 static void multifd_send_set_error(Error *err)
 {
@@ -822,10 +930,12 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     multifd_pages_clear(p->pages);
     p->pages = NULL;
     p->packet_len = 0;
+    g_clear_pointer(&p->packet_device_state, g_free);
     g_free(p->packet);
     p->packet = NULL;
     g_free(p->iov);
     p->iov = NULL;
+    g_clear_pointer(&p->device_state, multifd_device_state_free);
     multifd_send_state->ops->send_cleanup(p, errp);
 
     return *errp == NULL;
@@ -840,7 +950,9 @@ static void multifd_send_cleanup_state(void)
     g_free(multifd_send_state->params);
     multifd_send_state->params = NULL;
     multifd_pages_clear(multifd_send_state->pages);
+    g_clear_pointer(&multifd_send_state->device_state, multifd_device_state_free);
     multifd_send_state->pages = NULL;
+    qemu_mutex_destroy(&multifd_send_state->queue_job_mutex);
     g_free(multifd_send_state);
     multifd_send_state = NULL;
 }
@@ -894,10 +1006,11 @@ int multifd_send_sync_main(void)
         return 0;
     }
     if (multifd_send_state->pages->num) {
-        if (!multifd_send_pages()) {
+        if (!multifd_send_queue_job(false)) {
             error_report("%s: multifd_send_pages fail", __func__);
             return -1;
         }
+        assert(!multifd_send_state->pages->num);
     }
 
     flush_zero_copy = migrate_zero_copy_send();
@@ -973,17 +1086,22 @@ static void *multifd_send_thread(void *opaque)
          */
         if (qatomic_load_acquire(&p->pending_job)) {
             MultiFDPages_t *pages = p->pages;
+            bool is_device_state = p->is_device_state_job;
+            size_t total_size;
 
             p->flags = 0;
             p->iovs_num = 0;
-            assert(pages->num);
+            assert(is_device_state || pages->num);
 
             ret = multifd_send_state->ops->send_prepare(p, &local_err);
             if (ret != 0) {
                 break;
             }
 
+            total_size = iov_size(p->iov, p->iovs_num);
             if (migrate_mapped_ram()) {
+                assert(!is_device_state);
+
                 ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
                                               p->pages->block, &local_err);
             } else {
@@ -996,12 +1114,18 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
-            stat64_add(&mig_stats.multifd_bytes,
-                       p->next_packet_size + p->packet_len);
-            stat64_add(&mig_stats.normal_pages, pages->normal_num);
-            stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
+            stat64_add(&mig_stats.multifd_bytes, total_size);
+            if (!is_device_state) {
+                stat64_add(&mig_stats.normal_pages, pages->normal_num);
+                stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
+            }
 
-            multifd_pages_reset(p->pages);
+            if (is_device_state) {
+                g_clear_pointer(&p->device_state->idstr, g_free);
+                g_clear_pointer(&p->device_state->buf, g_free);
+            } else {
+                multifd_pages_reset(p->pages);
+            }
             p->next_packet_size = 0;
 
             /*
@@ -1010,6 +1134,7 @@ static void *multifd_send_thread(void *opaque)
              * multifd_send_pages().
              */
             qatomic_store_release(&p->pending_job, false);
+            qatomic_store_release(&p->pending_job_preparing, false);
         } else {
             /*
              * If not a normal job, must be a sync request.  Note that
@@ -1020,7 +1145,7 @@ static void *multifd_send_thread(void *opaque)
 
             if (use_packets) {
                 p->flags = MULTIFD_FLAG_SYNC;
-                multifd_send_fill_packet(p);
+                multifd_send_fill_packet_ram(p);
                 ret = qio_channel_write_all(p->c, (void *)p->packet,
                                             p->packet_len, &local_err);
                 if (ret != 0) {
@@ -1199,9 +1324,11 @@ bool multifd_send_setup(void)
 
     thread_count = migrate_multifd_channels();
     multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
+    qemu_mutex_init(&multifd_send_state->queue_job_mutex);
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
     multifd_send_state->pages = multifd_pages_init(page_count);
     qemu_sem_init(&multifd_send_state->channels_created, 0);
+    multifd_send_state->device_state = g_malloc0(sizeof(*multifd_send_state->device_state));
     qemu_sem_init(&multifd_send_state->channels_ready, 0);
     qatomic_set(&multifd_send_state->exiting, 0);
     multifd_send_state->ops = multifd_ops[migrate_multifd_compression()];
@@ -1215,11 +1342,15 @@ bool multifd_send_setup(void)
         p->pages = multifd_pages_init(page_count);
 
         if (use_packets) {
+            p->device_state = g_malloc0(sizeof(*p->device_state));
+
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
             p->packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
             p->packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
+            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
+            p->packet_device_state->hdr = p->packet->hdr;
 
             /* We need one extra place for the packet header */
             p->iov = g_new0(struct iovec, page_count + 1);
@@ -1786,7 +1917,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
         return false;
     }
 
-    multifd_send_prepare_header(p);
+    multifd_send_prepare_header_ram(p);
 
     return true;
 }
diff --git a/migration/multifd.h b/migration/multifd.h
index 40ee613dd88a..655bec110f87 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -156,18 +156,25 @@ typedef struct {
      * cleared by the multifd sender threads.
      */
     bool pending_job;
+    bool pending_job_preparing;
     bool pending_sync;
-    /* array of pages to sent.
-     * The owner of 'pages' depends of 'pending_job' value:
+
+    /* Whether the pending job is pages (false) or device state (true) */
+    bool is_device_state_job;
+
+    /* Array of pages or device state to be sent (depending on the flag above).
+     * The owner of these depends of 'pending_job' value:
      * pending_job == 0 -> migration_thread can use it.
      * pending_job != 0 -> multifd_channel can use it.
      */
     MultiFDPages_t *pages;
+    MultiFDDeviceState_t *device_state;
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_device_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
@@ -267,18 +274,25 @@ typedef struct {
 } MultiFDMethods;
 
 void multifd_register_ops(int method, MultiFDMethods *ops);
-void multifd_send_fill_packet(MultiFDSendParams *p);
+void multifd_send_fill_packet_ram(MultiFDSendParams *p);
 bool multifd_send_prepare_common(MultiFDSendParams *p);
 void multifd_send_zero_page_detect(MultiFDSendParams *p);
 void multifd_recv_zero_page_process(MultiFDRecvParams *p);
 
-static inline void multifd_send_prepare_header(MultiFDSendParams *p)
+void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
+
+static inline void multifd_send_prepare_header_ram(MultiFDSendParams *p)
 {
     p->iov[0].iov_len = p->packet_len;
     p->iov[0].iov_base = p->packet;
     p->iovs_num++;
 }
 
-void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
+static inline void multifd_send_prepare_header_device_state(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = sizeof(*p->packet_device_state);
+    p->iov[0].iov_base = p->packet_device_state;
+    p->iovs_num++;
+}
 
 #endif


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 10/13] migration/multifd: Add migration_has_device_state_support()
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (8 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 09/13] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 11/13] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h | 1 +
 migration/multifd.c      | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index abf6f33eeae8..4f3de2f23819 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -112,6 +112,7 @@ bool migration_in_bg_snapshot(void);
 void dirty_bitmap_mig_init(void);
 
 /* migration/multifd.c */
+bool migration_has_device_state_support(void);
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                char *data, size_t len);
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 6a7e5d659925..e5f7021465ec 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -804,6 +804,12 @@ retry:
     return true;
 }
 
+bool migration_has_device_state_support(void)
+{
+    return migrate_multifd() && !migrate_mapped_ram() &&
+        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}
+
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                char *data, size_t len)
 {


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 11/13] vfio/migration: Multifd device state transfer support - receive side
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (9 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 10/13] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 12/13] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating
its position in the stream.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
state.

Since it's important to finish loading device state transferred via
the main migration channel (via save_live_iterate handler) before
starting loading the data asynchronously transferred via multifd
a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
mark the end of the main migration channel data.

The device state loading process waits until that flag is seen before
commencing loading of the multifd-transferred device state.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 325 +++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |   9 +-
 include/hw/vfio/vfio-common.h |  14 ++
 3 files changed, 344 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 93f767e3c2dd..719e36800ab5 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -15,6 +15,7 @@
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 
+#include "io/channel-buffer.h"
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/misc.h"
@@ -47,6 +48,7 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE    (0xffffffffef100006ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -55,6 +57,15 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+typedef struct VFIODeviceStatePacket {
+    uint32_t version;
+    uint32_t idx;
+    uint32_t flags;
+    uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -254,6 +265,176 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
+typedef struct LoadedBuffer {
+    bool is_present;
+    char *data;
+    size_t len;
+} LoadedBuffer;
+
+static void loaded_buffer_clear(gpointer data)
+{
+    LoadedBuffer *lb = data;
+
+    if (!lb->is_present) {
+        return;
+    }
+
+    g_clear_pointer(&lb->data, g_free);
+    lb->is_present = false;
+}
+
+static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                                  Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
+    LoadedBuffer *lb;
+
+    if (data_size < sizeof(*packet)) {
+        error_setg(errp, "packet too short at %zu (min is %zu)",
+                   data_size, sizeof(*packet));
+        return -1;
+    }
+
+    if (packet->version != 0) {
+        error_setg(errp, "packet has unknown version %" PRIu32,
+                   packet->version);
+        return -1;
+    }
+
+    if (packet->idx == UINT32_MAX) {
+        error_setg(errp, "packet has too high idx %" PRIu32,
+                   packet->idx);
+        return -1;
+    }
+
+    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+    /* config state packet should be the last one in the stream */
+    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+        migration->load_buf_idx_last = packet->idx;
+    }
+
+    assert(migration->load_bufs);
+    if (packet->idx >= migration->load_bufs->len) {
+        g_array_set_size(migration->load_bufs, packet->idx + 1);
+    }
+
+    lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
+    if (lb->is_present) {
+        error_setg(errp, "state buffer %" PRIu32 " already filled", packet->idx);
+        return -1;
+    }
+
+    assert(packet->idx >= migration->load_buf_idx);
+
+    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
+    lb->len = data_size - sizeof(*packet);
+    lb->is_present = true;
+
+    qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
+
+    return 0;
+}
+
+static void *vfio_load_bufs_thread(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    Error **errp = &migration->load_bufs_thread_errp;
+    g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock(
+        QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));
+    LoadedBuffer *lb;
+
+    while (!migration->load_bufs_device_ready &&
+           !migration->load_bufs_thread_want_exit) {
+        qemu_cond_wait(&migration->load_bufs_device_ready_cond, &migration->load_bufs_mutex);
+    }
+
+    while (!migration->load_bufs_thread_want_exit) {
+        bool starved;
+        ssize_t ret;
+
+        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
+
+        if (migration->load_buf_idx >= migration->load_bufs->len) {
+            assert(migration->load_buf_idx == migration->load_bufs->len);
+            starved = true;
+        } else {
+            lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
+            starved = !lb->is_present;
+        }
+
+        if (starved) {
+            trace_vfio_load_state_device_buffer_starved(vbasedev->name, migration->load_buf_idx);
+            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond, &migration->load_bufs_mutex);
+            continue;
+        }
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last) {
+            break;
+        }
+
+        if (migration->load_buf_idx == 0) {
+            trace_vfio_load_state_device_buffer_start(vbasedev->name);
+        }
+
+        if (lb->len) {
+            g_autofree char *buf = NULL;
+            size_t buf_len;
+            int errno_save;
+
+            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
+                                                           migration->load_buf_idx);
+
+            /* lb might become re-allocated when we drop the lock */
+            buf = g_steal_pointer(&lb->data);
+            buf_len = lb->len;
+
+            /* Loading data to the device takes a while, drop the lock during this process */
+            qemu_mutex_unlock(&migration->load_bufs_mutex);
+            ret = write(migration->data_fd, buf, buf_len);
+            errno_save = errno;
+            qemu_mutex_lock(&migration->load_bufs_mutex);
+
+            if (ret < 0) {
+                error_setg(errp, "write to state buffer %" PRIu32 " failed with %d",
+                           migration->load_buf_idx, errno_save);
+                break;
+            } else if (ret < buf_len) {
+                error_setg(errp, "write to state buffer %" PRIu32 " incomplete %zd / %zu",
+                           migration->load_buf_idx, ret, buf_len);
+                break;
+            }
+
+            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
+                                                         migration->load_buf_idx);
+        }
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
+            trace_vfio_load_state_device_buffer_end(vbasedev->name);
+        }
+
+        migration->load_buf_idx++;
+    }
+
+    if (migration->load_bufs_thread_want_exit &&
+        !*errp) {
+        error_setg(errp, "load bufs thread asked to quit");
+    }
+
+    g_clear_pointer(&locker, qemu_lockable_auto_unlock);
+
+    qemu_loadvm_load_finish_ready_lock();
+    migration->load_bufs_thread_finished = true;
+    qemu_loadvm_load_finish_ready_broadcast();
+    qemu_loadvm_load_finish_ready_unlock();
+
+    return NULL;
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
                                          Error **errp)
 {
@@ -285,6 +466,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     uint64_t data;
 
+    trace_vfio_load_device_config_state_start(vbasedev->name);
+
     if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
         int ret;
 
@@ -303,7 +486,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         return -EINVAL;
     }
 
-    trace_vfio_load_device_config_state(vbasedev->name);
+    trace_vfio_load_device_config_state_end(vbasedev->name);
     return qemu_file_get_error(f);
 }
 
@@ -687,16 +870,69 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
 static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   vbasedev->migration->device_state, errp);
+    if (ret) {
+        return ret;
+    }
+
+    assert(!migration->load_bufs);
+    migration->load_bufs = g_array_new(FALSE, TRUE, sizeof(LoadedBuffer));
+    g_array_set_clear_func(migration->load_bufs, loaded_buffer_clear);
+
+    qemu_mutex_init(&migration->load_bufs_mutex);
+
+    migration->load_bufs_device_ready = false;
+    qemu_cond_init(&migration->load_bufs_device_ready_cond);
+
+    migration->load_buf_idx = 0;
+    migration->load_buf_idx_last = UINT32_MAX;
+    qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
+
+    migration->config_state_loaded_to_dev = false;
+
+    assert(!migration->load_bufs_thread_started);
+
+    migration->load_bufs_thread_finished = false;
+    migration->load_bufs_thread_want_exit = false;
+    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs",
+                       vfio_load_bufs_thread, opaque, QEMU_THREAD_JOINABLE);
 
-    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
-                                    vbasedev->migration->device_state, errp);
+    migration->load_bufs_thread_started = true;
+
+    return 0;
 }
 
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->load_bufs_thread_started) {
+        qemu_mutex_lock(&migration->load_bufs_mutex);
+        migration->load_bufs_thread_want_exit = true;
+        qemu_mutex_unlock(&migration->load_bufs_mutex);
+
+        qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
+        qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
+
+        qemu_thread_join(&migration->load_bufs_thread);
+
+        assert(migration->load_bufs_thread_finished);
+
+        migration->load_bufs_thread_started = false;
+    }
 
     vfio_migration_cleanup(vbasedev);
+
+    g_clear_pointer(&migration->load_bufs, g_array_unref);
+    qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
+    qemu_cond_destroy(&migration->load_bufs_device_ready_cond);
+    qemu_mutex_destroy(&migration->load_bufs_mutex);
+
     trace_vfio_load_cleanup(vbasedev->name);
 
     return 0;
@@ -705,6 +941,7 @@ static int vfio_load_cleanup(void *opaque)
 static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     int ret = 0;
     uint64_t data;
 
@@ -716,6 +953,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
         switch (data) {
         case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
         {
+            migration->config_state_loaded_to_dev = true;
             return vfio_load_device_config_state(f, opaque);
         }
         case VFIO_MIG_FLAG_DEV_SETUP_STATE:
@@ -742,6 +980,15 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE:
+        {
+            QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
+
+            migration->load_bufs_device_ready = true;
+            qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
+
+            break;
+        }
         case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
         {
             if (!vfio_precopy_supported(vbasedev) ||
@@ -774,6 +1021,76 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static int vfio_load_finish(void *opaque, bool *is_finished, Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    g_autoptr(QemuLockable) locker = NULL;
+    LoadedBuffer *lb;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f_out = NULL, *f_in = NULL;
+    uint64_t mig_header;
+    int ret;
+
+    if (migration->config_state_loaded_to_dev) {
+        *is_finished = true;
+        return 0;
+    }
+
+    if (!migration->load_bufs_thread_finished) {
+        assert(migration->load_bufs_thread_started);
+        *is_finished = false;
+        return 0;
+    }
+
+    if (migration->load_bufs_thread_errp) {
+        error_propagate(errp, g_steal_pointer(&migration->load_bufs_thread_errp));
+        return -1;
+    }
+
+    locker = qemu_lockable_auto_lock(QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));
+
+    assert(migration->load_buf_idx == migration->load_buf_idx_last);
+    lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
+    assert(lb->is_present);
+
+    bioc = qio_channel_buffer_new(lb->len);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
+
+    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
+    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
+
+    ret = qemu_fflush(f_out);
+    if (ret) {
+        error_setg(errp, "load device config state file flush failed with %d", ret);
+        g_clear_pointer(&f_out, qemu_fclose);
+        return -1;
+    }
+
+    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
+    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
+
+    mig_header = qemu_get_be64(f_in);
+    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+        error_setg(errp, "load device config state invalid header %"PRIu64, mig_header);
+        g_clear_pointer(&f_out, qemu_fclose);
+        g_clear_pointer(&f_in, qemu_fclose);
+        return -1;
+    }
+
+    ret = vfio_load_device_config_state(f_in, opaque);
+    g_clear_pointer(&f_out, qemu_fclose);
+    g_clear_pointer(&f_in, qemu_fclose);
+    if (ret < 0) {
+        error_setg(errp, "load device config state failed with %d", ret);
+        return -1;
+    }
+
+    migration->config_state_loaded_to_dev = true;
+    *is_finished = true;
+    return 0;
+}
+
 static bool vfio_switchover_ack_needed(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -794,6 +1111,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .load_state_buffer = vfio_load_state_buffer,
+    .load_finish = vfio_load_finish,
     .switchover_ack_needed = vfio_switchover_ack_needed,
 };
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 814000796687..7f224e4d240f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,9 +148,16 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_device_config_state_start(const char *name) " (%s)"
+vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
+vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_start(const char *name) " (%s)"
+vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_end(const char *name) " (%s)"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 510818f4dae3..aa8476a859a6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -74,6 +74,20 @@ typedef struct VFIOMigration {
 
     bool save_iterate_run;
     bool save_iterate_empty_hit;
+    QemuThread load_bufs_thread;
+    Error *load_bufs_thread_errp;
+    bool load_bufs_thread_started;
+    bool load_bufs_thread_finished;
+    bool load_bufs_thread_want_exit;
+
+    GArray *load_bufs;
+    bool load_bufs_device_ready;
+    QemuCond load_bufs_device_ready_cond;
+    QemuCond load_bufs_buffer_ready_cond;
+    QemuMutex load_bufs_mutex;
+    uint32_t load_buf_idx;
+    uint32_t load_buf_idx_last;
+    bool config_state_loaded_to_dev;
 } VFIOMigration;
 
 struct VFIOGroup;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 12/13] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (10 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 11/13] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-18 16:12 ` [PATCH v1 13/13] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
  2024-06-23 20:27 ` [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This property allows configuring at runtime whether to send the
particular device state via multifd channels when live migrating that
device.

It is ignored on the receive side and defaults to "false" for bit stream
compatibility with older QEMU versions.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/pci.c                 | 7 +++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 74a79bdf61f9..e2ac1db96002 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3346,6 +3346,8 @@ static void vfio_instance_init(Object *obj)
     pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
 }
 
+static PropertyInfo qdev_prop_bool_mutable;
+
 static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
     DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
@@ -3367,6 +3369,8 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
+                vbasedev.migration_multifd_transfer, qdev_prop_bool_mutable, bool),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
@@ -3464,6 +3468,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    qdev_prop_bool_mutable = qdev_prop_bool;
+    qdev_prop_bool_mutable.realized_set_allowed = true;
+
     type_register_static(&vfio_pci_dev_info);
     type_register_static(&vfio_pci_nohotplug_dev_info);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index aa8476a859a6..bc85891d8fff 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -132,6 +132,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
+    bool migration_multifd_transfer;
     bool migration_events;
     VFIODeviceOps *ops;
     unsigned int num_irqs;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 13/13] vfio/migration: Multifd device state transfer support - send side
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (11 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 12/13] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2024-06-18 16:12 ` Maciej S. Szmigiero
  2024-06-23 20:27 ` [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
  13 siblings, 0 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-18 16:12 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Implement the multifd device state transfer via additional per-device
thread spawned from save_live_complete_precopy_begin handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
x-migration-multifd-transfer device property value.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 207 ++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +
 include/hw/vfio/vfio-common.h |   9 ++
 3 files changed, 219 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 719e36800ab5..28a835f8a945 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -643,6 +643,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
+    /* Make a copy of this setting at the start in case it is changed mid-migration */
+    migration->multifd_transfer = vbasedev->migration_multifd_transfer;
+
+    if (migration->multifd_transfer && !migration_has_device_state_support()) {
+        error_setg(errp,
+                   "%s: Multifd device transfer requested but unsupported in the current config",
+                   vbasedev->name);
+        return -EINVAL;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
@@ -692,6 +702,8 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
     return ret;
 }
 
+static void vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev);
+
 static void vfio_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -699,6 +711,8 @@ static void vfio_save_cleanup(void *opaque)
     Error *local_err = NULL;
     int ret;
 
+    vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
     /*
      * Changing device state from STOP_COPY to STOP can take time. Do it here,
      * after migration has completed, so it won't increase downtime.
@@ -712,6 +726,7 @@ static void vfio_save_cleanup(void *opaque)
         }
     }
 
+    g_clear_pointer(&migration->idstr, g_free);
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
     migration->precopy_init_size = 0;
@@ -823,10 +838,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     ssize_t data_size;
     int ret;
     Error *local_err = NULL;
 
+    if (migration->multifd_transfer) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return 0;
+    }
+
     trace_vfio_save_complete_precopy_started(vbasedev->name);
 
     /* We reach here with device state STOP or STOP_COPY only */
@@ -852,12 +874,188 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev, uint32_t idx)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f = NULL;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    size_t packet_len;
+
+    bioc = qio_channel_buffer_new(0);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+    f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+    ret = vfio_save_device_config_state(f, vbasedev, NULL);
+    if (ret) {
+        return ret;
+    }
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        goto ret_close_file;
+    }
+
+    packet_len = sizeof(*packet) + bioc->usage;
+    packet = g_malloc0(packet_len);
+    packet->idx = idx;
+    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+    memcpy(&packet->data, bioc->data, bioc->usage);
+
+    ret = multifd_queue_device_state(migration->idstr, migration->instance_id,
+                                     (char *)packet, packet_len);
+
+    bytes_transferred += packet_len;
+
+ret_close_file:
+    g_clear_pointer(&f, qemu_fclose);
+    return ret;
+}
+
+static void *vfio_save_complete_precopy_async_thread(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int *ret = &migration->save_complete_precopy_thread_ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    uint32_t idx;
+
+    /* We reach here with device state STOP or STOP_COPY only */
+    *ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                    VFIO_DEVICE_STATE_STOP, NULL);
+    if (*ret) {
+        return NULL;
+    }
+
+    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+
+    for (idx = 0; ; idx++) {
+        ssize_t data_size;
+        size_t packet_size;
+
+        data_size = read(migration->data_fd, &packet->data,
+                         migration->data_buffer_size);
+        if (data_size < 0) {
+            if (errno != ENOMSG) {
+                *ret = -errno;
+                return NULL;
+            }
+
+            /*
+             * Pre-copy emptied all the device state for now. For more information,
+             * please refer to the Linux kernel VFIO uAPI.
+             */
+            data_size = 0;
+        }
+
+        if (data_size == 0)
+            break;
+
+        packet->idx = idx;
+        packet_size = sizeof(*packet) + data_size;
+
+        *ret = multifd_queue_device_state(migration->idstr, migration->instance_id,
+                                          (char *)packet, packet_size);
+        if (*ret) {
+            return NULL;
+        }
+
+        bytes_transferred += packet_size;
+    }
+
+    *ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idx);
+    if (*ret) {
+        return NULL;
+    }
+
+    trace_vfio_save_complete_precopy_async_finished(vbasedev->name);
+
+    return NULL;
+}
+
+static int vfio_save_complete_precopy_begin(QEMUFile *f,
+                                            char *idstr, uint32_t instance_id,
+                                            void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    migration->save_complete_precopy_thread_ret = 0;
+
+    if (!migration->multifd_transfer) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return 0;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE);
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        return ret;
+    }
+
+    assert(!migration->save_complete_precopy_thread_started);
+
+    assert(!migration->idstr);
+    migration->idstr = g_strdup(idstr);
+    migration->instance_id = instance_id;
+
+    qemu_thread_create(&migration->save_complete_precopy_thread,
+                       "vfio-save_complete_precopy",
+                       vfio_save_complete_precopy_async_thread,
+                       opaque, QEMU_THREAD_JOINABLE);
+
+    migration->save_complete_precopy_thread_started = true;
+
+    trace_vfio_save_complete_precopy_async_started(vbasedev->name, idstr, instance_id);
+
+    return 0;
+}
+
+static void vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration->save_complete_precopy_thread_started) {
+        return;
+    }
+
+    qemu_thread_join(&migration->save_complete_precopy_thread);
+
+    migration->save_complete_precopy_thread_started = false;
+
+    trace_vfio_save_complete_precopy_async_joined(vbasedev->name,
+                                                  migration->save_complete_precopy_thread_ret);
+}
+
+static int vfio_save_complete_precopy_end(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
+    return migration->save_complete_precopy_thread_ret;
+}
+
 static void vfio_save_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     Error *local_err = NULL;
     int ret;
 
+    if (migration->multifd_transfer) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return;
+    }
+
     ret = vfio_save_device_config_state(f, opaque, &local_err);
     if (ret) {
         error_prepend(&local_err,
@@ -1106,6 +1304,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .state_pending_exact = vfio_state_pending_exact,
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy_begin = vfio_save_complete_precopy_begin,
+    .save_live_complete_precopy_end = vfio_save_complete_precopy_end,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -1127,6 +1327,10 @@ static void vfio_vmstate_change_prepare(void *opaque, bool running,
     Error *local_err = NULL;
     int ret;
 
+    if (running) {
+        vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+    }
+
     new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
                     VFIO_DEVICE_STATE_PRE_COPY_P2P :
                     VFIO_DEVICE_STATE_RUNNING_P2P;
@@ -1153,6 +1357,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
     int ret;
 
     if (running) {
+        /* In case "prepare" callback wasn't registered */
+        vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
         new_state =
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 7f224e4d240f..569bb02434f1 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -166,6 +166,9 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_complete_precopy_started(const char *name) " (%s)"
+vfio_save_complete_precopy_async_started(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
+vfio_save_complete_precopy_async_finished(const char *name) " (%s)"
+vfio_save_complete_precopy_async_joined(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_save_iterate_started(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bc85891d8fff..2d76d3fc8bba 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -70,16 +70,25 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    bool multifd_transfer;
     bool initial_data_sent;
 
     bool save_iterate_run;
     bool save_iterate_empty_hit;
+
+    QemuThread save_complete_precopy_thread;
+    int save_complete_precopy_thread_ret;
+    bool save_complete_precopy_thread_started;
+
     QemuThread load_bufs_thread;
     Error *load_bufs_thread_errp;
     bool load_bufs_thread_started;
     bool load_bufs_thread_finished;
     bool load_bufs_thread_want_exit;
 
+    char *idstr;
+    uint32_t instance_id;
+
     GArray *load_bufs;
     bool load_bufs_device_ready;
     QemuCond load_bufs_device_ready_cond;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (12 preceding siblings ...)
  2024-06-18 16:12 ` [PATCH v1 13/13] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2024-06-23 20:27 ` Peter Xu
  2024-06-24 19:51   ` Maciej S. Szmigiero
  13 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2024-06-23 20:27 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This is an updated v1 patch series of the RFC (v0) series located here:
> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/

OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de

1. Multifd device state support
===============================

As Fabiano suggested in his RFC post, we may need one more layer of
abstraction to represent VFIO's demand on allowing multifd to send
arbitrary buffer to the wire.  This can be more than "how to pass the
device state buffer to the sender threads".

So far, MultiFDMethods is only about RAM.  If you pull the latest master
branch Fabiano just merged yet two more RAM compressors that are extended
on top of MultiFDMethods model.  However still they're all about RAM.  I
think it's better to keep it this way, so maybe MultiFDMethods should some
day be called MultiFDRamMethods.

multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc
buffers like what VFIO is using. multifd_send_zero_page_detect() may not be
needed either for arbitrary buffers.  Most of those are still page-based.

I think it also means we shouldn't call ->send_prepare() when multifd send
thread notices that it's going to send a VFIO buffer.  So it should look
like this:

  int type = multifd_payload_type(p->data);
  if (type == MULTIFD_PAYLOAD_RAM) {
      multifd_send_state->ops->send_prepare(p, &local_err);
  } else {
      // VFIO buffers should belong here
      assert(type == MULTIFD_PAYLOAD_DEVICE_STATE);
      ...
  }

It also means it shouldn't contain code like:

nocomp_send_prepare():
    if (p->is_device_state_job) {
        return nocomp_send_prepare_device_state(p, errp);
    } else {
        return nocomp_send_prepare_ram(p, errp);
    }

nocomp should only exist in RAM world, not VFIO's.

And it looks like you agree with Fabiano's RFC proposal, please work on top
of that to provide that layer.  Please make sure it outputs the minimum in
"$ git grep device_state migration/multifd.c" when you work on the new
version.  Currently:

$ git grep device_state migration/multifd.c | wc -l
59

The hope is zero, or at least a minimum with good reasons.

2. Frequent mallocs/frees
=========================

Fabiano's series can also help to address some of these, but it looks like
this series used malloc/free more than the opaque data buffer.  This is not
required to get things merged, but it'll be nice to avoid those if possible.

3. load_state_buffer() and VFIODeviceStatePacket protocol
=========================================================

VFIODeviceStatePacket is the new protocol you introduced into multifd
packets, along with the new load_state_buffer() hook for loading such
buffers.  My question is whether it's needed at all, or.. whether it can be
more generic (and also easier) to just allow taking any device state in the
multifd packets, then load it with vmstate load().

I mean, the vmstate_load() should really have worked on these buffers, if
after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
first flag (uint64), size as the 2nd, then (2) load that rest buffer into
VFIO kernel driver.  That is the same to happen during the blackout window.
It's not clear to me why load_state_buffer() is needed.

I also see that you're also using exactly the same chunk size for such
buffering (VFIOMigration.data_buffer_size).

I think you have a "reason": VFIODeviceStatePacket and loading of the
buffer data resolved one major issue that wasn't there before but start to
have now: multifd allows concurrent arrivals of vfio buffers, even if the
buffer *must* be sequentially loaded.

That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
used to ask nVidia people on whether the VFIO get_state/set_state interface
can allow indexing or tagging of buffers but I never get a real response.
IMHO that'll be extremely helpful for migration purpose on concurrency if
it can happen, rather than using a serialized buffer.  It means
concurrently save/load one VFIO device could be extremely hard, if not
impossible.

Now in your series IIUC you resolved that by using vfio_load_bufs_thread(),
holding off the load process but only until sequential buffers are
received.  I think that causes one issue that I'll mention below as a
separate topic.  But besides that, my point is, this is not the reason that
you need to introduce VFIODeviceStatePacket, load_state_buffer() and so on.
My understanding is that we do need one way to re-serialize the buffers,
but it doesn't need load_state_buffer(), instead it can call vmstate_load()
in order, properly invoke vfio_load_state() with the right buffers.  It'll
just be nice if VFIO can keep its "load state" logic at one place.

One benefit of that is with such a more generic framework, QEMU can easily
extend this infra to other device states, so that logically we can consider
sending non-VFIO device states also in the multifd buffers.  However with
your current solution, new structures are needed, new hooks, a lot of new
codes around, however less problems it solved..  That's not optimal.

4. Risk of OOM on unlimited VFIO buffering
==========================================

This follows with above bullet, but my pure question to ask here is how
does VFIO guarantees no OOM condition by buffering VFIO state?

I mean, currently your proposal used vfio_load_bufs_thread() as a separate
thread to only load the vfio states until sequential data is received,
however is there an upper limit of how much buffering it could do?  IOW:

vfio_load_state_buffer():

  if (packet->idx >= migration->load_bufs->len) {
      g_array_set_size(migration->load_bufs, packet->idx + 1);
  }

  lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
  ...
  lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
  lb->len = data_size - sizeof(*packet);
  lb->is_present = true;

What if garray keeps growing with lb->data allocated, which triggers the
memcg limit of the process (if QEMU is in such process)?  Or just deplete
host memory and causing OOM kill.

I think we may need to find a way to throttle max memory usage of such
buffering.

So far this will be more of a problem indeed if this will be done during
VFIO iteration phases, but I still hope a solution can work with both
iteration phase and the switchover phase, even if you only do that in
switchover phase (and I don't know why you don't care about VFIO iteration
phase, if you cared enough on how VFIO works now with migration.. literally
that should help VFIO migrates faster on 25G+ networks, with/without a
shorter blackout window).

5. Worker thread model
======================

I'm so far not happy with what this proposal suggests on creating the
threads, also the two new hooks mostly just to create these threads..

I know I suggested that.. but that's comparing to what I read in the even
earlier version, and sorry I wasn't able to suggest something better at
that time because I simply thought less.

As I mentioned in the other reply elsewhere, I think we should firstly have
these threads ready to take data at the start of migration, so that it'll
work when someone wants to add vfio iteration support.  Then the jobs
(mostly what vfio_save_complete_precopy_async_thread() does now) can be
enqueued into the thread pools.

It's better to create the thread pool owned by migration, rather than
threads owned by VFIO, because it also paves way for non-VFIO device state
save()s, as I mentioned also above on the multifd packet header.  Maybe we
can have a flag in the packet header saying "this is device xxx's state,
just load it".

I'd start looking at util/thread-pool.c, removing all the AIO implications
but simply provide a raw thread pool for what thread_pool_submit() is
doing.

I know this is a lot, but I really think this is the right thing.. but we
can discuss, and you can correct me on my mistakes if there is.

If you want I can have a look at this pool model and prepare a patch, so
you can work on other vfio relevant stuff and pick that up, if that helps
you trying to reach the goal of landing this whole stuff in 9.1.

But I hope I explained more or less in this email on why I think this
feature is more involved than it looks like, and not yet mature from
design. And I hope I'm not purly asking for too much: merging this VFIO
series first then refactor on top can mean dropping too much unneeded code
after adding them, not to mention the protocol going to need another break.

It just doesn't sound like the right thing to do.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-23 20:27 ` [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
@ 2024-06-24 19:51   ` Maciej S. Szmigiero
  2024-06-25 17:25     ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-24 19:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Hi Peter,

On 23.06.2024 22:27, Peter Xu wrote:
> On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is an updated v1 patch series of the RFC (v0) series located here:
>> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
> 
> OK I took some hours thinking about this today, and here's some high level
> comments for this series.  I'll start with which are more relevant to what
> Fabiano has already suggested in the other thread, then I'll add some more.
> 
> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de

That's a long list, thanks for these comments.

I have responded to them inline below.

> 1. Multifd device state support
> ===============================
> 
> As Fabiano suggested in his RFC post, we may need one more layer of
> abstraction to represent VFIO's demand on allowing multifd to send
> arbitrary buffer to the wire.  This can be more than "how to pass the
> device state buffer to the sender threads".
> 
> So far, MultiFDMethods is only about RAM.  If you pull the latest master
> branch Fabiano just merged yet two more RAM compressors that are extended
> on top of MultiFDMethods model.  However still they're all about RAM.  I
> think it's better to keep it this way, so maybe MultiFDMethods should some
> day be called MultiFDRamMethods.
> 
> multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc
> buffers like what VFIO is using. multifd_send_zero_page_detect() may not be
> needed either for arbitrary buffers.  Most of those are still page-based.
> 
> I think it also means we shouldn't call ->send_prepare() when multifd send
> thread notices that it's going to send a VFIO buffer.  So it should look
> like this:
> 
>    int type = multifd_payload_type(p->data);
>    if (type == MULTIFD_PAYLOAD_RAM) {
>        multifd_send_state->ops->send_prepare(p, &local_err);
>    } else {
>        // VFIO buffers should belong here
>        assert(type == MULTIFD_PAYLOAD_DEVICE_STATE);
>        ...
>    }
> 
> It also means it shouldn't contain code like:
> 
> nocomp_send_prepare():
>      if (p->is_device_state_job) {
>          return nocomp_send_prepare_device_state(p, errp);
>      } else {
>          return nocomp_send_prepare_ram(p, errp);
>      }
> 
> nocomp should only exist in RAM world, not VFIO's.
> 
> And it looks like you agree with Fabiano's RFC proposal, please work on top
> of that to provide that layer.  Please make sure it outputs the minimum in
> "$ git grep device_state migration/multifd.c" when you work on the new
> version.  Currently:
> 
> $ git grep device_state migration/multifd.c | wc -l
> 59
> 
> The hope is zero, or at least a minimum with good reasons.

I guess you mean "grep -i" in the above example, since otherwise
the above command will find only lowercase "device_state".

On the other hand, your example code above has uppercase
"DEVICE_STATE", suggesting that it might be okay?

Overall, using Fabiano's patch set as a base for mine makes sense to me.

> 2. Frequent mallocs/frees
> =========================
> 
> Fabiano's series can also help to address some of these, but it looks like
> this series used malloc/free more than the opaque data buffer.  This is not
> required to get things merged, but it'll be nice to avoid those if possible.

Ack - as long as its not making the code messy/fragile, of course.

> 3. load_state_buffer() and VFIODeviceStatePacket protocol
> =========================================================
> 
> VFIODeviceStatePacket is the new protocol you introduced into multifd
> packets, along with the new load_state_buffer() hook for loading such
> buffers.  My question is whether it's needed at all, or.. whether it can be
> more generic (and also easier) to just allow taking any device state in the
> multifd packets, then load it with vmstate load().
> 
> I mean, the vmstate_load() should really have worked on these buffers, if
> after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
> first flag (uint64), size as the 2nd, then (2) load that rest buffer into
> VFIO kernel driver.  That is the same to happen during the blackout window.
> It's not clear to me why load_state_buffer() is needed.
> 
> I also see that you're also using exactly the same chunk size for such
> buffering (VFIOMigration.data_buffer_size).
> 
> I think you have a "reason": VFIODeviceStatePacket and loading of the
> buffer data resolved one major issue that wasn't there before but start to
> have now: multifd allows concurrent arrivals of vfio buffers, even if the
> buffer *must* be sequentially loaded.
> 
> That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
> used to ask nVidia people on whether the VFIO get_state/set_state interface
> can allow indexing or tagging of buffers but I never get a real response.
> IMHO that'll be extremely helpful for migration purpose on concurrency if
> it can happen, rather than using a serialized buffer.  It means
> concurrently save/load one VFIO device could be extremely hard, if not
> impossible.

I am pretty sure that the current kernel VFIO interface requires for the
buffers to be loaded in-order - accidentally providing the out of order
definitely breaks the restore procedure.

> Now in your series IIUC you resolved that by using vfio_load_bufs_thread(),
> holding off the load process but only until sequential buffers are
> received.  I think that causes one issue that I'll mention below as a
> separate topic.  But besides that, my point is, this is not the reason that
> you need to introduce VFIODeviceStatePacket, load_state_buffer() and so on.
> My understanding is that we do need one way to re-serialize the buffers,
> but it doesn't need load_state_buffer(), instead it can call vmstate_load()
> in order, properly invoke vfio_load_state() with the right buffers.  It'll
> just be nice if VFIO can keep its "load state" logic at one place.

Re-using the .load_state hook for multifd device state date has a few
additional issues:
* This hook accepts a QEMUFile parameter, not a buffer.

* Due to the above, it (and the functions it calls) expects being able to
read all the required data in one go.

In other words, there's no way for this hook to suspend its execution and
return because the next piece of data it wants hasn't arrived yet.

Specifically, this hook is only able to exit at VFIO_MIG_FLAG_END_OF_STATE
boundaries in the incoming stream.

* The hook is expected to be called from the main migration thread, and
so calls to it are expected to be effectively serialized.

It can also safely call core QEMU functions, like it does from
vfio_load_device_config_state() -> vfio_pci_load_config() -> vmstate_load_state().
This actually fails when called from any other thread (in some memory region
modification function, as far as I remember).

In contrast to that, .load_state_buffer hook is prepared to deal with getting
calls from multiple multifd receive threads.

> One benefit of that is with such a more generic framework, QEMU can easily
> extend this infra to other device states, so that logically we can consider
> sending non-VFIO device states also in the multifd buffers.  However with
> your current solution, new structures are needed, new hooks, a lot of new
> codes around, however less problems it solved..  That's not optimal.

This all relies on a promise than the current .load_state hooks can be
efficiently called from multiple multifd receive threads, which isn't true
today.

Due to the reasons I specified above I think modifying these existing
hooks would be more complex than just introducing a new one with the
proper semantics.

> 4. Risk of OOM on unlimited VFIO buffering
> ==========================================
> 
> This follows with above bullet, but my pure question to ask here is how
> does VFIO guarantees no OOM condition by buffering VFIO state?
> 
> I mean, currently your proposal used vfio_load_bufs_thread() as a separate
> thread to only load the vfio states until sequential data is received,
> however is there an upper limit of how much buffering it could do?  IOW:
> 
> vfio_load_state_buffer():
> 
>    if (packet->idx >= migration->load_bufs->len) {
>        g_array_set_size(migration->load_bufs, packet->idx + 1);
>    }
> 
>    lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
>    ...
>    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>    lb->len = data_size - sizeof(*packet);
>    lb->is_present = true;
> 
> What if garray keeps growing with lb->data allocated, which triggers the
> memcg limit of the process (if QEMU is in such process)?  Or just deplete
> host memory and causing OOM kill.
> 
> I think we may need to find a way to throttle max memory usage of such
> buffering.
> 
> So far this will be more of a problem indeed if this will be done during
> VFIO iteration phases, but I still hope a solution can work with both
> iteration phase and the switchover phase, even if you only do that in
> switchover phase 

Unfortunately, this issue will be hard to fix since the source can
legitimately send the very first buffer (chunk) of data as the last one
(at the very end of the transmission).

In this case, the target will need to buffer nearly the whole data.

We can't stop the receive on any channel, either, since the next missing
buffer can arrive at that channel.

However, I don't think purposely DoSing the target QEMU is a realistic
security concern in the typical live migration scenario.

I mean the source can easily force the target QEMU to exit just by
feeding it wrong migration data.

In case someone really wants to protect against the impact of
theoretically unbounded QEMU memory allocations during live migration
on the rest of the system they can put the target QEMU process
(temporally) into a memory-limited cgroup.

> (and I don't know why you don't care about VFIO iteration
> phase, if you cared enough on how VFIO works now with migration.. literally
> that should help VFIO migrates faster on 25G+ networks, with/without a
> shorter blackout window).

I do care about the VM live phase, too.

Just to keep the complexity in bounds for the first version I wanted to
deal with the most pressing issue first - downtime.

I am not against accommodating the VM live phase changes if they don't
significantly expand the patch set size.

> 5. Worker thread model
> ======================
> 
> I'm so far not happy with what this proposal suggests on creating the
> threads, also the two new hooks mostly just to create these threads..

That VFIO .save_live_complete_precopy_begin handler crates a new
per-device thread is an implementation detail for this particular
driver.

The whole idea behind this and save_live_complete_precopy_end hook was
that details how the particular device driver does its own async saving
is abstracted away from the migration core.

The device then can do what's best / most efficient for it to do.

> I know I suggested that.. but that's comparing to what I read in the even
> earlier version, and sorry I wasn't able to suggest something better at
> that time because I simply thought less.
> 
> As I mentioned in the other reply elsewhere, I think we should firstly have
> these threads ready to take data at the start of migration, so that it'll
> work when someone wants to add vfio iteration support.  Then the jobs
> (mostly what vfio_save_complete_precopy_async_thread() does now) can be
> enqueued into the thread pools.

I'm not sure that we can get way with using fewer threads than devices
as these devices might not support AIO reads from their migration file
descriptor.

mlx5 devices, for example, seems to support only poll()ed / non-blocking
reads at best - with unknown performance in comparison with issuing
blocking reads from dedicated threads.

On the other hand, handling a single device from multiple threads in
parallel is generally not possible due to difficulty of establishing in
which order the buffers were read.

And if we need a per-VFIO device thread anyway then using a thread pool
doesn't help much - but brings extra complexity.

In terms of starting the loading thread earlier to load also VM live
phase data it looks like a small change to the code so it shouldn't be
a problem.

> It's better to create the thread pool owned by migration, rather than
> threads owned by VFIO, because it also paves way for non-VFIO device state
> save()s, as I mentioned also above on the multifd packet header.  Maybe we
> can have a flag in the packet header saying "this is device xxx's state,
> just load it".

I think the same could be done by simply implementing these hooks in other
device types than VFIO, right?

And if we notice that these implementations share a bit of code then we
can think about making a common helper library out of this code.

After, all that's just an implementation detail that does not impact
the underlying bit stream protocol.

> I'd start looking at util/thread-pool.c, removing all the AIO implications
> but simply provide a raw thread pool for what thread_pool_submit() is
> doing.
> 
> I know this is a lot, but I really think this is the right thing.. but we
> can discuss, and you can correct me on my mistakes if there is.
> 
> If you want I can have a look at this pool model and prepare a patch, so
> you can work on other vfio relevant stuff and pick that up, if that helps
> you trying to reach the goal of landing this whole stuff in 9.1.

I would certainly help with the amount of work (thanks for the offer!),
but as I wrote above I am not really convinced that adapting the existing
thread pool utils for this usage is really the way to go here.

> But I hope I explained more or less in this email on why I think this
> feature is more involved than it looks like, and not yet mature from
> design. And I hope I'm not purly asking for too much: merging this VFIO
> series first then refactor on top can mean dropping too much unneeded code
> after adding them, not to mention the protocol going to need another break.
> 
> It just doesn't sound like the right thing to do.
> 
> Thanks,
> 

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-24 19:51   ` Maciej S. Szmigiero
@ 2024-06-25 17:25     ` Peter Xu
  2024-06-25 22:44       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2024-06-25 17:25 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> Hi Peter,

Hi, Maciej,

> 
> On 23.06.2024 22:27, Peter Xu wrote:
> > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > This is an updated v1 patch series of the RFC (v0) series located here:
> > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
> > 
> > OK I took some hours thinking about this today, and here's some high level
> > comments for this series.  I'll start with which are more relevant to what
> > Fabiano has already suggested in the other thread, then I'll add some more.
> > 
> > https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
> 
> That's a long list, thanks for these comments.
> 
> I have responded to them inline below.
> 
> > 1. Multifd device state support
> > ===============================
> > 
> > As Fabiano suggested in his RFC post, we may need one more layer of
> > abstraction to represent VFIO's demand on allowing multifd to send
> > arbitrary buffer to the wire.  This can be more than "how to pass the
> > device state buffer to the sender threads".
> > 
> > So far, MultiFDMethods is only about RAM.  If you pull the latest master
> > branch Fabiano just merged yet two more RAM compressors that are extended
> > on top of MultiFDMethods model.  However still they're all about RAM.  I
> > think it's better to keep it this way, so maybe MultiFDMethods should some
> > day be called MultiFDRamMethods.
> > 
> > multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc
> > buffers like what VFIO is using. multifd_send_zero_page_detect() may not be
> > needed either for arbitrary buffers.  Most of those are still page-based.
> > 
> > I think it also means we shouldn't call ->send_prepare() when multifd send
> > thread notices that it's going to send a VFIO buffer.  So it should look
> > like this:
> > 
> >    int type = multifd_payload_type(p->data);
> >    if (type == MULTIFD_PAYLOAD_RAM) {
> >        multifd_send_state->ops->send_prepare(p, &local_err);
> >    } else {
> >        // VFIO buffers should belong here
> >        assert(type == MULTIFD_PAYLOAD_DEVICE_STATE);
> >        ...
> >    }
> > 
> > It also means it shouldn't contain code like:
> > 
> > nocomp_send_prepare():
> >      if (p->is_device_state_job) {
> >          return nocomp_send_prepare_device_state(p, errp);
> >      } else {
> >          return nocomp_send_prepare_ram(p, errp);
> >      }
> > 
> > nocomp should only exist in RAM world, not VFIO's.
> > 
> > And it looks like you agree with Fabiano's RFC proposal, please work on top
> > of that to provide that layer.  Please make sure it outputs the minimum in
> > "$ git grep device_state migration/multifd.c" when you work on the new
> > version.  Currently:
> > 
> > $ git grep device_state migration/multifd.c | wc -l
> > 59
> > 
> > The hope is zero, or at least a minimum with good reasons.
> 
> I guess you mean "grep -i" in the above example, since otherwise
> the above command will find only lowercase "device_state".
> 
> On the other hand, your example code above has uppercase
> "DEVICE_STATE", suggesting that it might be okay?

Yes that's definitely ok.  I could be over-cautious when I used the grep
example, but I hope you get my point where we should remove that
device_state pointer, rather than as generic as supporting multifd to take
device state buffers.

Especially, if that idea will also apply on top of non-VFIO devices,
e.g. when we extend that to a normal VMSD buffer to be delivered too using
that same mechanism, then I think it's even okay to have some
"device_state" there.  Personally I would still think those as justified
usages that are generic enough; after all I don't expect multifd to send
other things besides RAM and generic terms of device states.

> 
> Overall, using Fabiano's patch set as a base for mine makes sense to me.
> 
> > 2. Frequent mallocs/frees
> > =========================
> > 
> > Fabiano's series can also help to address some of these, but it looks like
> > this series used malloc/free more than the opaque data buffer.  This is not
> > required to get things merged, but it'll be nice to avoid those if possible.
> 
> Ack - as long as its not making the code messy/fragile, of course.
> 
> > 3. load_state_buffer() and VFIODeviceStatePacket protocol
> > =========================================================
> > 
> > VFIODeviceStatePacket is the new protocol you introduced into multifd
> > packets, along with the new load_state_buffer() hook for loading such
> > buffers.  My question is whether it's needed at all, or.. whether it can be
> > more generic (and also easier) to just allow taking any device state in the
> > multifd packets, then load it with vmstate load().
> > 
> > I mean, the vmstate_load() should really have worked on these buffers, if
> > after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
> > first flag (uint64), size as the 2nd, then (2) load that rest buffer into
> > VFIO kernel driver.  That is the same to happen during the blackout window.
> > It's not clear to me why load_state_buffer() is needed.
> > 
> > I also see that you're also using exactly the same chunk size for such
> > buffering (VFIOMigration.data_buffer_size).
> > 
> > I think you have a "reason": VFIODeviceStatePacket and loading of the
> > buffer data resolved one major issue that wasn't there before but start to
> > have now: multifd allows concurrent arrivals of vfio buffers, even if the
> > buffer *must* be sequentially loaded.
> > 
> > That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
> > used to ask nVidia people on whether the VFIO get_state/set_state interface
> > can allow indexing or tagging of buffers but I never get a real response.
> > IMHO that'll be extremely helpful for migration purpose on concurrency if
> > it can happen, rather than using a serialized buffer.  It means
> > concurrently save/load one VFIO device could be extremely hard, if not
> > impossible.
> 
> I am pretty sure that the current kernel VFIO interface requires for the
> buffers to be loaded in-order - accidentally providing the out of order
> definitely breaks the restore procedure.

Ah, I didn't mean that we need to do it with the current API.  I'm talking
about whether it's possible to have a v2 that will support those otherwise
we'll need to do "workarounds" like what you're doing with "unlimited
buffer these on dest, until we receive continuous chunk of data" tricks.

And even with that trick, it'll still need to be serialized on the read()
syscall so it won't scale either if the state is huge.  For that issue
there's no workaround we can do from userspace.

> 
> > Now in your series IIUC you resolved that by using vfio_load_bufs_thread(),
> > holding off the load process but only until sequential buffers are
> > received.  I think that causes one issue that I'll mention below as a
> > separate topic.  But besides that, my point is, this is not the reason that
> > you need to introduce VFIODeviceStatePacket, load_state_buffer() and so on.
> > My understanding is that we do need one way to re-serialize the buffers,
> > but it doesn't need load_state_buffer(), instead it can call vmstate_load()
> > in order, properly invoke vfio_load_state() with the right buffers.  It'll
> > just be nice if VFIO can keep its "load state" logic at one place.
> 
> Re-using the .load_state hook for multifd device state date has a few
> additional issues:
> * This hook accepts a QEMUFile parameter, not a buffer.
> 
> * Due to the above, it (and the functions it calls) expects being able to
> read all the required data in one go.
> 
> In other words, there's no way for this hook to suspend its execution and
> return because the next piece of data it wants hasn't arrived yet.
> 
> Specifically, this hook is only able to exit at VFIO_MIG_FLAG_END_OF_STATE
> boundaries in the incoming stream.
> 
> * The hook is expected to be called from the main migration thread, and
> so calls to it are expected to be effectively serialized.
> 
> It can also safely call core QEMU functions, like it does from
> vfio_load_device_config_state() -> vfio_pci_load_config() -> vmstate_load_state().
> This actually fails when called from any other thread (in some memory region
> modification function, as far as I remember).
> 
> In contrast to that, .load_state_buffer hook is prepared to deal with getting
> calls from multiple multifd receive threads.

Ah yes, I forgot we're still using QEMUFiles to load states... that's a
pity, and that makes sense.  Also, when I read this again I noticed indeed
any channel/file based approach won't work for VFIO at least, due to the
fact that it needs to do caching for out-of-order buffers issue.

Then when I looked closer, it's a pity that "next_packet_size" field is not
after "flags"; that should really be part of your new MultiFDPacketHdr_t,
but indeed we won't achieve that without breaking current protocol.

When at it, if you want maybe you can also start to rename MultiFDPacket_t
to MultiFDRAMPacket_t already.

Now I think I agree with you: using a buffer as a generic concept to take
device states seems like a good idea, and we can start doing that from
VFIO.  Please keep that in mind that in all these paths it should be
generic enough for non-VFIO to do so.  From that POV I think your series
did a good job already indeed from this perspective, so now I think I'm ok
with it.

There's another slight unfortunate where MultiFDPacketDeviceState_t will
always need to send the idstr[]..  I think it is a good start, though, as
generic VMSD migrations will also need that (QEMU_VM_SECTION_FULL) so it's
not super efficient but generic enough.  You might want to look at how
current migration tackles that with load_section_id field, but that can be
for later, just fyi.  I wonder how find_se() could impact perf when
there're lots of devices, I never really measured it.

> 
> > One benefit of that is with such a more generic framework, QEMU can easily
> > extend this infra to other device states, so that logically we can consider
> > sending non-VFIO device states also in the multifd buffers.  However with
> > your current solution, new structures are needed, new hooks, a lot of new
> > codes around, however less problems it solved..  That's not optimal.
> 
> This all relies on a promise than the current .load_state hooks can be
> efficiently called from multiple multifd receive threads, which isn't true
> today.
> 
> Due to the reasons I specified above I think modifying these existing
> hooks would be more complex than just introducing a new one with the
> proper semantics.
> 
> > 4. Risk of OOM on unlimited VFIO buffering
> > ==========================================
> > 
> > This follows with above bullet, but my pure question to ask here is how
> > does VFIO guarantees no OOM condition by buffering VFIO state?
> > 
> > I mean, currently your proposal used vfio_load_bufs_thread() as a separate
> > thread to only load the vfio states until sequential data is received,
> > however is there an upper limit of how much buffering it could do?  IOW:
> > 
> > vfio_load_state_buffer():
> > 
> >    if (packet->idx >= migration->load_bufs->len) {
> >        g_array_set_size(migration->load_bufs, packet->idx + 1);
> >    }
> > 
> >    lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
> >    ...
> >    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
> >    lb->len = data_size - sizeof(*packet);
> >    lb->is_present = true;
> > 
> > What if garray keeps growing with lb->data allocated, which triggers the
> > memcg limit of the process (if QEMU is in such process)?  Or just deplete
> > host memory and causing OOM kill.
> > 
> > I think we may need to find a way to throttle max memory usage of such
> > buffering.
> > 
> > So far this will be more of a problem indeed if this will be done during
> > VFIO iteration phases, but I still hope a solution can work with both
> > iteration phase and the switchover phase, even if you only do that in
> > switchover phase
> 
> Unfortunately, this issue will be hard to fix since the source can
> legitimately send the very first buffer (chunk) of data as the last one
> (at the very end of the transmission).
> 
> In this case, the target will need to buffer nearly the whole data.
> 
> We can't stop the receive on any channel, either, since the next missing
> buffer can arrive at that channel.
> 
> However, I don't think purposely DoSing the target QEMU is a realistic
> security concern in the typical live migration scenario.
> 
> I mean the source can easily force the target QEMU to exit just by
> feeding it wrong migration data.
> 
> In case someone really wants to protect against the impact of
> theoretically unbounded QEMU memory allocations during live migration
> on the rest of the system they can put the target QEMU process
> (temporally) into a memory-limited cgroup.

Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
exactly talking about the generic case where QEMU (either src or dest, in
that case normally both) is put into the memcg and if QEMU uses too much
memory it'll literally get killed even if no DoS issue at all.

In short, we hopefully will have a design that will always work with QEMU
running in a container, without 0.5% chance dest qemu being killed, if you
see what I meant.

The upper bound of VFIO buffering will be needed so the admin can add that
on top of the memcg limit and as long as QEMU keeps its words it'll always
work without sudden death.

I think I have some idea about resolving this problem.  That idea can
further complicate the protocol a little bit.  But before that let's see
whether we can reach an initial consensus on this matter first, on whether
this is a sane request.  In short, we'll need to start to have a
configurable size to say how much VFIO can buffer, maybe per-device, or
globally.  Then based on that we need to have some logic guarantee that
over-mem won't happen, also without heavily affecting concurrency (e.g.,
single thread is definitely safe and without caching, but it can be
slower).

> 
> > (and I don't know why you don't care about VFIO iteration
> > phase, if you cared enough on how VFIO works now with migration.. literally
> > that should help VFIO migrates faster on 25G+ networks, with/without a
> > shorter blackout window).
> 
> I do care about the VM live phase, too.
> 
> Just to keep the complexity in bounds for the first version I wanted to
> deal with the most pressing issue first - downtime.
> 
> I am not against accommodating the VM live phase changes if they don't
> significantly expand the patch set size.

Makes sense.  Again, as long as the qemu (with this series applied) can
also take iterative vfio datas in the newer QEMUs if that'll come, I'll
be perfectly happy with that.

Or one step back, if that can't be achieved, I hope we figure out the
complexity and justify that, rather than completely ignoring the iteration
phase, then that'll be also good enough to me.

> > 5. Worker thread model
> > ======================
> > 
> > I'm so far not happy with what this proposal suggests on creating the
> > threads, also the two new hooks mostly just to create these threads..
> 
> That VFIO .save_live_complete_precopy_begin handler crates a new
> per-device thread is an implementation detail for this particular
> driver.
> 
> The whole idea behind this and save_live_complete_precopy_end hook was
> that details how the particular device driver does its own async saving
> is abstracted away from the migration core.
> 
> The device then can do what's best / most efficient for it to do.

Yes, and what I was thinking is whether it does it in form of "enqueue a
task to migration worker threads", rather than "creating its own threads in
the device hooks, and managing those threads alone".

It's all about whether such threading can be reused by non-VFIO devices.
They can't be reused if VFIO is in charge here, and it will make migration
less generic.

My current opinion is they can and should be re-usable. Consider if someone
starts to teach multifd carry non-vfio data (e.g. a generic VMSD), then we
can enqueue a task, do e.g. ioctl(KVM_GET_REGS) in those threads (rather
than VFIO read()).

> 
> > I know I suggested that.. but that's comparing to what I read in the even
> > earlier version, and sorry I wasn't able to suggest something better at
> > that time because I simply thought less.
> > 
> > As I mentioned in the other reply elsewhere, I think we should firstly have
> > these threads ready to take data at the start of migration, so that it'll
> > work when someone wants to add vfio iteration support.  Then the jobs
> > (mostly what vfio_save_complete_precopy_async_thread() does now) can be
> > enqueued into the thread pools.
> 
> I'm not sure that we can get way with using fewer threads than devices
> as these devices might not support AIO reads from their migration file
> descriptor.

It doesn't need to use AIO reads - I'll be happy if the thread model can be
generic, VFIO can still enqueue a task that does blocking reads.

It can take a lot of time, but it's fine: others who like to enqueue too
and see all threads busy, they should simply block there and waiting for
the worker threads to be freed again.  It's the same when there's no
migration worker threads as it means the read() will block the main
migration thread.

Now we can have multiple worker threads doing things concurrently if
possible (some of them may not, especially when BQL will be required, but
that's a separate thing, and many device save()s may not need BQL, and when
it needs we can take it in the enqueued tasks).

> 
> mlx5 devices, for example, seems to support only poll()ed / non-blocking
> reads at best - with unknown performance in comparison with issuing
> blocking reads from dedicated threads.
> 
> On the other hand, handling a single device from multiple threads in
> parallel is generally not possible due to difficulty of establishing in
> which order the buffers were read.
> 
> And if we need a per-VFIO device thread anyway then using a thread pool
> doesn't help much - but brings extra complexity.
> 
> In terms of starting the loading thread earlier to load also VM live
> phase data it looks like a small change to the code so it shouldn't be
> a problem.

That's good to know.  Please still consider a generic thread model and see
what that would also work for your VFIO use case.

If you see what thread-pool.c did right now is it'll dynamically create
threads on the fly.  I think that's something we can do too but just apply
an upper limit to the thread numbers.

> 
> > It's better to create the thread pool owned by migration, rather than
> > threads owned by VFIO, because it also paves way for non-VFIO device state
> > save()s, as I mentioned also above on the multifd packet header.  Maybe we
> > can have a flag in the packet header saying "this is device xxx's state,
> > just load it".
> 
> I think the same could be done by simply implementing these hooks in other
> device types than VFIO, right?
> 
> And if we notice that these implementations share a bit of code then we
> can think about making a common helper library out of this code.
> 
> After, all that's just an implementation detail that does not impact
> the underlying bit stream protocol.

You're correct.

However, it still affects a few things.

Firstly, it may mean that we may not even need those two extra vmstate
hooks: the enqueue can happen already with save_state() if the migration
worker model exists.

So instead of this:

        vfio_save_state():
        if (migration->multifd_transfer) {
                /* Emit dummy NOP data */
                qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
                return;
        }

We can already do:

        if (migration->multifd_transfer) {
                // enqueue task to load state for this vfio device
                ...
                return;
        }

IMHO it'll be much cleaner in VFIO code, and much cleaner too for migration
code.

Another (possibly personal) reason is, I will not dare to touch VFIO code
too much to do such a refactoring later.  I simply don't have the VFIO
devices around and I won't be able to test.  So comparing to other things,
I hope VFIO stuff can land more stable than others because I am not
confident at least myself to clean it.

I simply also don't like random threads floating around, considering that
how we already have slightly a mess with migration on other reasons (we can
still have random TLS threads floating around, I think... and they can
cause very hard to debug issues). I feel shaky to maintain it when any
device can also start to create whatever threads they can during migration.

> 
> > I'd start looking at util/thread-pool.c, removing all the AIO implications
> > but simply provide a raw thread pool for what thread_pool_submit() is
> > doing.
> > 
> > I know this is a lot, but I really think this is the right thing.. but we
> > can discuss, and you can correct me on my mistakes if there is.
> > 
> > If you want I can have a look at this pool model and prepare a patch, so
> > you can work on other vfio relevant stuff and pick that up, if that helps
> > you trying to reach the goal of landing this whole stuff in 9.1.
> 
> I would certainly help with the amount of work (thanks for the offer!),
> but as I wrote above I am not really convinced that adapting the existing
> thread pool utils for this usage is really the way to go here.

Yes let's figure that out first on the uncertainties, the plan is for
later.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-25 17:25     ` Peter Xu
@ 2024-06-25 22:44       ` Maciej S. Szmigiero
  2024-06-26  1:51         ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-25 22:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 25.06.2024 19:25, Peter Xu wrote:
> On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
>> Hi Peter,
> 
> Hi, Maciej,
> 
>>
>> On 23.06.2024 22:27, Peter Xu wrote:
>>> On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> This is an updated v1 patch series of the RFC (v0) series located here:
>>>> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
>>>
>>> OK I took some hours thinking about this today, and here's some high level
>>> comments for this series.  I'll start with which are more relevant to what
>>> Fabiano has already suggested in the other thread, then I'll add some more.
>>>
>>> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
>>
>> That's a long list, thanks for these comments.
>>
>> I have responded to them inline below.
>>
(..)
>>
>>> 3. load_state_buffer() and VFIODeviceStatePacket protocol
>>> =========================================================
>>>
>>> VFIODeviceStatePacket is the new protocol you introduced into multifd
>>> packets, along with the new load_state_buffer() hook for loading such
>>> buffers.  My question is whether it's needed at all, or.. whether it can be
>>> more generic (and also easier) to just allow taking any device state in the
>>> multifd packets, then load it with vmstate load().
>>>
>>> I mean, the vmstate_load() should really have worked on these buffers, if
>>> after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
>>> first flag (uint64), size as the 2nd, then (2) load that rest buffer into
>>> VFIO kernel driver.  That is the same to happen during the blackout window.
>>> It's not clear to me why load_state_buffer() is needed.
>>>
>>> I also see that you're also using exactly the same chunk size for such
>>> buffering (VFIOMigration.data_buffer_size).
>>>
>>> I think you have a "reason": VFIODeviceStatePacket and loading of the
>>> buffer data resolved one major issue that wasn't there before but start to
>>> have now: multifd allows concurrent arrivals of vfio buffers, even if the
>>> buffer *must* be sequentially loaded.
>>>
>>> That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
>>> used to ask nVidia people on whether the VFIO get_state/set_state interface
>>> can allow indexing or tagging of buffers but I never get a real response.
>>> IMHO that'll be extremely helpful for migration purpose on concurrency if
>>> it can happen, rather than using a serialized buffer.  It means
>>> concurrently save/load one VFIO device could be extremely hard, if not
>>> impossible.
>>
>> I am pretty sure that the current kernel VFIO interface requires for the
>> buffers to be loaded in-order - accidentally providing the out of order
>> definitely breaks the restore procedure.
> 
> Ah, I didn't mean that we need to do it with the current API.  I'm talking
> about whether it's possible to have a v2 that will support those otherwise
> we'll need to do "workarounds" like what you're doing with "unlimited
> buffer these on dest, until we receive continuous chunk of data" tricks.

Better kernel API might be possible in the long term but for now we have
to live with what we have right now.

After all, adding true unordered loading - I mean not just moving the
reassembly process from QEMU to the kernel but making the device itself
accept buffers out out order - will likely be pretty complex (requiring
adding such functionality to the device firmware, etc).

> And even with that trick, it'll still need to be serialized on the read()
> syscall so it won't scale either if the state is huge.  For that issue
> there's no workaround we can do from userspace.

The read() calls for multiple VFIO devices can be issued in parallel,
and in fact they are in my patch set.

(..)
>>> 4. Risk of OOM on unlimited VFIO buffering
>>> ==========================================
>>>
>>> This follows with above bullet, but my pure question to ask here is how
>>> does VFIO guarantees no OOM condition by buffering VFIO state?
>>>
>>> I mean, currently your proposal used vfio_load_bufs_thread() as a separate
>>> thread to only load the vfio states until sequential data is received,
>>> however is there an upper limit of how much buffering it could do?  IOW:
>>>
>>> vfio_load_state_buffer():
>>>
>>>     if (packet->idx >= migration->load_bufs->len) {
>>>         g_array_set_size(migration->load_bufs, packet->idx + 1);
>>>     }
>>>
>>>     lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
>>>     ...
>>>     lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>>>     lb->len = data_size - sizeof(*packet);
>>>     lb->is_present = true;
>>>
>>> What if garray keeps growing with lb->data allocated, which triggers the
>>> memcg limit of the process (if QEMU is in such process)?  Or just deplete
>>> host memory and causing OOM kill.
>>>
>>> I think we may need to find a way to throttle max memory usage of such
>>> buffering.
>>>
>>> So far this will be more of a problem indeed if this will be done during
>>> VFIO iteration phases, but I still hope a solution can work with both
>>> iteration phase and the switchover phase, even if you only do that in
>>> switchover phase
>>
>> Unfortunately, this issue will be hard to fix since the source can
>> legitimately send the very first buffer (chunk) of data as the last one
>> (at the very end of the transmission).
>>
>> In this case, the target will need to buffer nearly the whole data.
>>
>> We can't stop the receive on any channel, either, since the next missing
>> buffer can arrive at that channel.
>>
>> However, I don't think purposely DoSing the target QEMU is a realistic
>> security concern in the typical live migration scenario.
>>
>> I mean the source can easily force the target QEMU to exit just by
>> feeding it wrong migration data.
>>
>> In case someone really wants to protect against the impact of
>> theoretically unbounded QEMU memory allocations during live migration
>> on the rest of the system they can put the target QEMU process
>> (temporally) into a memory-limited cgroup.
> 
> Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
> exactly talking about the generic case where QEMU (either src or dest, in
> that case normally both) is put into the memcg and if QEMU uses too much
> memory it'll literally get killed even if no DoS issue at all.
> 
> In short, we hopefully will have a design that will always work with QEMU
> running in a container, without 0.5% chance dest qemu being killed, if you
> see what I meant.
> 
> The upper bound of VFIO buffering will be needed so the admin can add that
> on top of the memcg limit and as long as QEMU keeps its words it'll always
> work without sudden death.
> 
> I think I have some idea about resolving this problem.  That idea can
> further complicate the protocol a little bit.  But before that let's see
> whether we can reach an initial consensus on this matter first, on whether
> this is a sane request.  In short, we'll need to start to have a
> configurable size to say how much VFIO can buffer, maybe per-device, or
> globally.  Then based on that we need to have some logic guarantee that
> over-mem won't happen, also without heavily affecting concurrency (e.g.,
> single thread is definitely safe and without caching, but it can be
> slower).

Here, I think I can add a per-device limit parameter on the number of
buffers received out-of-order or waiting to be loaded into the device -
with a reasonable default.

(..)
>>> 5. Worker thread model
>>> ======================
>>>
>>> I'm so far not happy with what this proposal suggests on creating the
>>> threads, also the two new hooks mostly just to create these threads..
>>
>> That VFIO .save_live_complete_precopy_begin handler crates a new
>> per-device thread is an implementation detail for this particular
>> driver.
>>
>> The whole idea behind this and save_live_complete_precopy_end hook was
>> that details how the particular device driver does its own async saving
>> is abstracted away from the migration core.
>>
>> The device then can do what's best / most efficient for it to do.
> 
> Yes, and what I was thinking is whether it does it in form of "enqueue a
> task to migration worker threads", rather than "creating its own threads in
> the device hooks, and managing those threads alone".
> 
> It's all about whether such threading can be reused by non-VFIO devices.
> They can't be reused if VFIO is in charge here, and it will make migration
> less generic.
> 
> My current opinion is they can and should be re-usable. Consider if someone
> starts to teach multifd carry non-vfio data (e.g. a generic VMSD), then we
> can enqueue a task, do e.g. ioctl(KVM_GET_REGS) in those threads (rather
> than VFIO read()).

Theoretically, it's obviously possible to wrap every operation in a request
to some thread pool.


But that would bring a lot of complexity, since instead of performing these
operation directly now the requester will need to:
1) Prepare some "Operation" structure with the parameters of the requested
operation (task).
In your case this could be QEMU_OP_GET_VCPU_REGS operation using
"OperationGetVCPURegs" struct containing vCPU number parameter = 1.

2) Submit this operation to the thread pool and wait for it to complete,

3) Thread pool needs to check whether it has any free threads in the pool
available to perform this operation.

If not, and the count of threads that are CPU-bound (~aren't sleeping on
some I/O operation) is less than the number of logical CPUs in the system
the thread pool needs to spawn a new thread since there's some CPU capacity
available,

4) The operation needs to be dispatched to the actual execution thread,

5) The execution thread needs to figure out which operation it needs to
actually do, fetch the necessary parameters from the proper "Operation"
structure, maybe take the necessary locks, before it can actually perform
the requested operation,

6) The execution thread needs to serialize (write) the operation result
back into some "OperationResult" structure, like "OperationGetVCPURegsResult",

7) The execution thread needs to submit this result back to the requester,

8) The thread pool needs to decide whether to keep this (now idle) execution
thread in the pool as a reserve thread or terminate it immediately,

9) The requester needs to be resumed somehow (returned from wait) now that
the operation it requested is complete,

10) The requester needs the fetch the operation results from the proper
"OperationResult" structure and decode them accordingly.


As you can see, that's *a lot* of extra code that needs to be maintained
for just a single operation type.

>>
>>> I know I suggested that.. but that's comparing to what I read in the even
>>> earlier version, and sorry I wasn't able to suggest something better at
>>> that time because I simply thought less.
>>>
>>> As I mentioned in the other reply elsewhere, I think we should firstly have
>>> these threads ready to take data at the start of migration, so that it'll
>>> work when someone wants to add vfio iteration support.  Then the jobs
>>> (mostly what vfio_save_complete_precopy_async_thread() does now) can be
>>> enqueued into the thread pools.
>>
>> I'm not sure that we can get way with using fewer threads than devices
>> as these devices might not support AIO reads from their migration file
>> descriptor.
> 
> It doesn't need to use AIO reads - I'll be happy if the thread model can be
> generic, VFIO can still enqueue a task that does blocking reads.
> 
> It can take a lot of time, but it's fine: others who like to enqueue too
> and see all threads busy, they should simply block there and waiting for
> the worker threads to be freed again.  It's the same when there's no
> migration worker threads as it means the read() will block the main
> migration thread.

Oh no, waiting for another device blocking read to complete before
scheduling another device blocking read is surely going to negatively
impact the performance.

For best performance we need to maximize parallelism - that means
reading (and loading) all the VFIO devices present in parallel.

The whole point of having per-device threads is for the whole operation
to be I/O bound but never CPU bound on a reasonably fast machine - and
especially not number-of-threads-in-pool bound.

> Now we can have multiple worker threads doing things concurrently if
> possible (some of them may not, especially when BQL will be required, but
> that's a separate thing, and many device save()s may not need BQL, and when
> it needs we can take it in the enqueued tasks).
> 
>>
>> mlx5 devices, for example, seems to support only poll()ed / non-blocking
>> reads at best - with unknown performance in comparison with issuing
>> blocking reads from dedicated threads.
>>
>> On the other hand, handling a single device from multiple threads in
>> parallel is generally not possible due to difficulty of establishing in
>> which order the buffers were read.
>>
>> And if we need a per-VFIO device thread anyway then using a thread pool
>> doesn't help much - but brings extra complexity.
>>
>> In terms of starting the loading thread earlier to load also VM live
>> phase data it looks like a small change to the code so it shouldn't be
>> a problem.
> 
> That's good to know.  Please still consider a generic thread model and see
> what that would also work for your VFIO use case.
> 
> If you see what thread-pool.c did right now is it'll dynamically create
> threads on the fly.  I think that's something we can do too but just apply
> an upper limit to the thread numbers.

We have an upper limit on the count of saving threads already - it's the
count of VFIO devices in the VM.

The API in util/thread-pool.c is very basic and basically only allows
submitting either AIO operations or generic function call operation
but still within some AioContext.

There's almost none of the operation execution logic I described above -
all of these would need to be written and maintained.

>>
>>> It's better to create the thread pool owned by migration, rather than
>>> threads owned by VFIO, because it also paves way for non-VFIO device state
>>> save()s, as I mentioned also above on the multifd packet header.  Maybe we
>>> can have a flag in the packet header saying "this is device xxx's state,
>>> just load it".
>>
>> I think the same could be done by simply implementing these hooks in other
>> device types than VFIO, right?
>>
>> And if we notice that these implementations share a bit of code then we
>> can think about making a common helper library out of this code.
>>
>> After, all that's just an implementation detail that does not impact
>> the underlying bit stream protocol.
> 
> You're correct.
> 
> However, it still affects a few things.
> 
> Firstly, it may mean that we may not even need those two extra vmstate
> hooks: the enqueue can happen already with save_state() if the migration
> worker model exists.
> 
> So instead of this:
> 
>          vfio_save_state():
>          if (migration->multifd_transfer) {
>                  /* Emit dummy NOP data */
>                  qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>                  return;
>          }
> 
> We can already do:
> 
>          if (migration->multifd_transfer) {
>                  // enqueue task to load state for this vfio device
>                  ...
>                  return;
>          }
> 
> IMHO it'll be much cleaner in VFIO code, and much cleaner too for migration
> code.

The save_state hook is executed too late - only after all iterable
hooks have already transferred all their data.

We want to start saving this device state as early as possible to not
have to wait for any other device to transfer its data first.

That's why the code introduces save_live_complete_precopy_begin hook
that's guaranteed to be the very first hook called during switchover
phase device state saving.

> Another (possibly personal) reason is, I will not dare to touch VFIO code
> too much to do such a refactoring later.  I simply don't have the VFIO
> devices around and I won't be able to test.  So comparing to other things,
> I hope VFIO stuff can land more stable than others because I am not
> confident at least myself to clean it.

That's a fair request, will keep this on mind.

> I simply also don't like random threads floating around, considering that
> how we already have slightly a mess with migration on other reasons (we can
> still have random TLS threads floating around, I think... and they can
> cause very hard to debug issues). I feel shaky to maintain it when any
> device can also start to create whatever threads they can during migration.

The threads themselves aren't very expensive as long as their number
is kept within reasonable bounds.

4 additional threads (present only during active migration operation)
with 4 VFIO devices is really not a lot.

(..)
> 
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-25 22:44       ` Maciej S. Szmigiero
@ 2024-06-26  1:51         ` Peter Xu
  2024-06-26 15:47           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2024-06-26  1:51 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> On 25.06.2024 19:25, Peter Xu wrote:
> > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> > > Hi Peter,
> > 
> > Hi, Maciej,
> > 
> > > 
> > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > This is an updated v1 patch series of the RFC (v0) series located here:
> > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
> > > > 
> > > > OK I took some hours thinking about this today, and here's some high level
> > > > comments for this series.  I'll start with which are more relevant to what
> > > > Fabiano has already suggested in the other thread, then I'll add some more.
> > > > 
> > > > https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
> > > 
> > > That's a long list, thanks for these comments.
> > > 
> > > I have responded to them inline below.
> > > 
> (..)
> > > 
> > > > 3. load_state_buffer() and VFIODeviceStatePacket protocol
> > > > =========================================================
> > > > 
> > > > VFIODeviceStatePacket is the new protocol you introduced into multifd
> > > > packets, along with the new load_state_buffer() hook for loading such
> > > > buffers.  My question is whether it's needed at all, or.. whether it can be
> > > > more generic (and also easier) to just allow taking any device state in the
> > > > multifd packets, then load it with vmstate load().
> > > > 
> > > > I mean, the vmstate_load() should really have worked on these buffers, if
> > > > after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
> > > > first flag (uint64), size as the 2nd, then (2) load that rest buffer into
> > > > VFIO kernel driver.  That is the same to happen during the blackout window.
> > > > It's not clear to me why load_state_buffer() is needed.
> > > > 
> > > > I also see that you're also using exactly the same chunk size for such
> > > > buffering (VFIOMigration.data_buffer_size).
> > > > 
> > > > I think you have a "reason": VFIODeviceStatePacket and loading of the
> > > > buffer data resolved one major issue that wasn't there before but start to
> > > > have now: multifd allows concurrent arrivals of vfio buffers, even if the
> > > > buffer *must* be sequentially loaded.
> > > > 
> > > > That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
> > > > used to ask nVidia people on whether the VFIO get_state/set_state interface
> > > > can allow indexing or tagging of buffers but I never get a real response.
> > > > IMHO that'll be extremely helpful for migration purpose on concurrency if
> > > > it can happen, rather than using a serialized buffer.  It means
> > > > concurrently save/load one VFIO device could be extremely hard, if not
> > > > impossible.
> > > 
> > > I am pretty sure that the current kernel VFIO interface requires for the
> > > buffers to be loaded in-order - accidentally providing the out of order
> > > definitely breaks the restore procedure.
> > 
> > Ah, I didn't mean that we need to do it with the current API.  I'm talking
> > about whether it's possible to have a v2 that will support those otherwise
> > we'll need to do "workarounds" like what you're doing with "unlimited
> > buffer these on dest, until we receive continuous chunk of data" tricks.
> 
> Better kernel API might be possible in the long term but for now we have
> to live with what we have right now.
> 
> After all, adding true unordered loading - I mean not just moving the
> reassembly process from QEMU to the kernel but making the device itself
> accept buffers out out order - will likely be pretty complex (requiring
> adding such functionality to the device firmware, etc).

I would expect the device will need to be able to provision the device
states so it became smaller objects rather than one binary object, then
either tag-able or address-able on those objects.

> 
> > And even with that trick, it'll still need to be serialized on the read()
> > syscall so it won't scale either if the state is huge.  For that issue
> > there's no workaround we can do from userspace.
> 
> The read() calls for multiple VFIO devices can be issued in parallel,
> and in fact they are in my patch set.

I was talking about concurrency for one device.

> 
> (..)
> > > > 4. Risk of OOM on unlimited VFIO buffering
> > > > ==========================================
> > > > 
> > > > This follows with above bullet, but my pure question to ask here is how
> > > > does VFIO guarantees no OOM condition by buffering VFIO state?
> > > > 
> > > > I mean, currently your proposal used vfio_load_bufs_thread() as a separate
> > > > thread to only load the vfio states until sequential data is received,
> > > > however is there an upper limit of how much buffering it could do?  IOW:
> > > > 
> > > > vfio_load_state_buffer():
> > > > 
> > > >     if (packet->idx >= migration->load_bufs->len) {
> > > >         g_array_set_size(migration->load_bufs, packet->idx + 1);
> > > >     }
> > > > 
> > > >     lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
> > > >     ...
> > > >     lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
> > > >     lb->len = data_size - sizeof(*packet);
> > > >     lb->is_present = true;
> > > > 
> > > > What if garray keeps growing with lb->data allocated, which triggers the
> > > > memcg limit of the process (if QEMU is in such process)?  Or just deplete
> > > > host memory and causing OOM kill.
> > > > 
> > > > I think we may need to find a way to throttle max memory usage of such
> > > > buffering.
> > > > 
> > > > So far this will be more of a problem indeed if this will be done during
> > > > VFIO iteration phases, but I still hope a solution can work with both
> > > > iteration phase and the switchover phase, even if you only do that in
> > > > switchover phase
> > > 
> > > Unfortunately, this issue will be hard to fix since the source can
> > > legitimately send the very first buffer (chunk) of data as the last one
> > > (at the very end of the transmission).
> > > 
> > > In this case, the target will need to buffer nearly the whole data.
> > > 
> > > We can't stop the receive on any channel, either, since the next missing
> > > buffer can arrive at that channel.
> > > 
> > > However, I don't think purposely DoSing the target QEMU is a realistic
> > > security concern in the typical live migration scenario.
> > > 
> > > I mean the source can easily force the target QEMU to exit just by
> > > feeding it wrong migration data.
> > > 
> > > In case someone really wants to protect against the impact of
> > > theoretically unbounded QEMU memory allocations during live migration
> > > on the rest of the system they can put the target QEMU process
> > > (temporally) into a memory-limited cgroup.
> > 
> > Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
> > exactly talking about the generic case where QEMU (either src or dest, in
> > that case normally both) is put into the memcg and if QEMU uses too much
> > memory it'll literally get killed even if no DoS issue at all.
> > 
> > In short, we hopefully will have a design that will always work with QEMU
> > running in a container, without 0.5% chance dest qemu being killed, if you
> > see what I meant.
> > 
> > The upper bound of VFIO buffering will be needed so the admin can add that
> > on top of the memcg limit and as long as QEMU keeps its words it'll always
> > work without sudden death.
> > 
> > I think I have some idea about resolving this problem.  That idea can
> > further complicate the protocol a little bit.  But before that let's see
> > whether we can reach an initial consensus on this matter first, on whether
> > this is a sane request.  In short, we'll need to start to have a
> > configurable size to say how much VFIO can buffer, maybe per-device, or
> > globally.  Then based on that we need to have some logic guarantee that
> > over-mem won't happen, also without heavily affecting concurrency (e.g.,
> > single thread is definitely safe and without caching, but it can be
> > slower).
> 
> Here, I think I can add a per-device limit parameter on the number of
> buffers received out-of-order or waiting to be loaded into the device -
> with a reasonable default.

Yes that should work.

I don't even expect people would change that, but this might be the
information people will need to know before putting it into a container if
it's larger than how qemu dynamically consumes memories here and there.
I'd expect it is still small enough so nobody will notice it (maybe a few
tens of MBs? but just wildly guessing, where tens of MBs could fall into
the "noise" memory allocation window of a VM).

> 
> (..)
> > > > 5. Worker thread model
> > > > ======================
> > > > 
> > > > I'm so far not happy with what this proposal suggests on creating the
> > > > threads, also the two new hooks mostly just to create these threads..
> > > 
> > > That VFIO .save_live_complete_precopy_begin handler crates a new
> > > per-device thread is an implementation detail for this particular
> > > driver.
> > > 
> > > The whole idea behind this and save_live_complete_precopy_end hook was
> > > that details how the particular device driver does its own async saving
> > > is abstracted away from the migration core.
> > > 
> > > The device then can do what's best / most efficient for it to do.
> > 
> > Yes, and what I was thinking is whether it does it in form of "enqueue a
> > task to migration worker threads", rather than "creating its own threads in
> > the device hooks, and managing those threads alone".
> > 
> > It's all about whether such threading can be reused by non-VFIO devices.
> > They can't be reused if VFIO is in charge here, and it will make migration
> > less generic.
> > 
> > My current opinion is they can and should be re-usable. Consider if someone
> > starts to teach multifd carry non-vfio data (e.g. a generic VMSD), then we
> > can enqueue a task, do e.g. ioctl(KVM_GET_REGS) in those threads (rather
> > than VFIO read()).
> 
> Theoretically, it's obviously possible to wrap every operation in a request
> to some thread pool.
> 
> 
> But that would bring a lot of complexity, since instead of performing these
> operation directly now the requester will need to:
> 1) Prepare some "Operation" structure with the parameters of the requested
> operation (task).
> In your case this could be QEMU_OP_GET_VCPU_REGS operation using
> "OperationGetVCPURegs" struct containing vCPU number parameter = 1.

Why such complexity is needed?

Can it be as simple as func(opaque) to be queued, then here
func==vfio_save_complete_precopy_async_thread, opaque=VFIODevice*?

> 
> 2) Submit this operation to the thread pool and wait for it to complete,

VFIO doesn't need to have its own code waiting.  If this pool is for
migration purpose in general, qemu migration framework will need to wait at
some point for all jobs to finish before moving on.  Perhaps it should be
at the end of the non-iterative session.

> 
> 3) Thread pool needs to check whether it has any free threads in the pool
> available to perform this operation.
> 
> If not, and the count of threads that are CPU-bound (~aren't sleeping on
> some I/O operation) is less than the number of logical CPUs in the system
> the thread pool needs to spawn a new thread since there's some CPU capacity
> available,

For this one it can follow what thread-pool.c is doing, and the upper bound
of n-threads can start from simple, e.g. min(n_channels_multifd, 8)?

> 
> 4) The operation needs to be dispatched to the actual execution thread,
> 
> 5) The execution thread needs to figure out which operation it needs to
> actually do, fetch the necessary parameters from the proper "Operation"
> structure, maybe take the necessary locks, before it can actually perform
> the requested operation,
> 
> 6) The execution thread needs to serialize (write) the operation result
> back into some "OperationResult" structure, like "OperationGetVCPURegsResult",

I think in this simplest case, the thread should simply run fn(opaque), in
which it should start to call multifd_queue_device_state() and queue
multifd jobs from the worker thread instead of the vfio dedicated threads.
I don't yet expect much to change in your code from that regard inside what
vfio_save_complete_precopy_async_thread() used to do.

> 
> 7) The execution thread needs to submit this result back to the requester,
> 
> 8) The thread pool needs to decide whether to keep this (now idle) execution
> thread in the pool as a reserve thread or terminate it immediately,
> 
> 9) The requester needs to be resumed somehow (returned from wait) now that
> the operation it requested is complete,
> 
> 10) The requester needs the fetch the operation results from the proper
> "OperationResult" structure and decode them accordingly.
> 
> 
> As you can see, that's *a lot* of extra code that needs to be maintained
> for just a single operation type.

I don't yet know why you designed it so complicated, but if I missed
something above please let me know.

> 
> > > 
> > > > I know I suggested that.. but that's comparing to what I read in the even
> > > > earlier version, and sorry I wasn't able to suggest something better at
> > > > that time because I simply thought less.
> > > > 
> > > > As I mentioned in the other reply elsewhere, I think we should firstly have
> > > > these threads ready to take data at the start of migration, so that it'll
> > > > work when someone wants to add vfio iteration support.  Then the jobs
> > > > (mostly what vfio_save_complete_precopy_async_thread() does now) can be
> > > > enqueued into the thread pools.
> > > 
> > > I'm not sure that we can get way with using fewer threads than devices
> > > as these devices might not support AIO reads from their migration file
> > > descriptor.
> > 
> > It doesn't need to use AIO reads - I'll be happy if the thread model can be
> > generic, VFIO can still enqueue a task that does blocking reads.
> > 
> > It can take a lot of time, but it's fine: others who like to enqueue too
> > and see all threads busy, they should simply block there and waiting for
> > the worker threads to be freed again.  It's the same when there's no
> > migration worker threads as it means the read() will block the main
> > migration thread.
> 
> Oh no, waiting for another device blocking read to complete before
> scheduling another device blocking read is surely going to negatively
> impact the performance.

There can be e.g. 8 worker threads.  If you want you can make sure the
worker threads are at least more than vfio threads.  Then it will guarantee
vfio will dump / save() one device per thread concurrently.

> 
> For best performance we need to maximize parallelism - that means
> reading (and loading) all the VFIO devices present in parallel.
> 
> The whole point of having per-device threads is for the whole operation
> to be I/O bound but never CPU bound on a reasonably fast machine - and
> especially not number-of-threads-in-pool bound.
> 
> > Now we can have multiple worker threads doing things concurrently if
> > possible (some of them may not, especially when BQL will be required, but
> > that's a separate thing, and many device save()s may not need BQL, and when
> > it needs we can take it in the enqueued tasks).
> > 
> > > 
> > > mlx5 devices, for example, seems to support only poll()ed / non-blocking
> > > reads at best - with unknown performance in comparison with issuing
> > > blocking reads from dedicated threads.
> > > 
> > > On the other hand, handling a single device from multiple threads in
> > > parallel is generally not possible due to difficulty of establishing in
> > > which order the buffers were read.
> > > 
> > > And if we need a per-VFIO device thread anyway then using a thread pool
> > > doesn't help much - but brings extra complexity.
> > > 
> > > In terms of starting the loading thread earlier to load also VM live
> > > phase data it looks like a small change to the code so it shouldn't be
> > > a problem.
> > 
> > That's good to know.  Please still consider a generic thread model and see
> > what that would also work for your VFIO use case.
> > 
> > If you see what thread-pool.c did right now is it'll dynamically create
> > threads on the fly.  I think that's something we can do too but just apply
> > an upper limit to the thread numbers.
> 
> We have an upper limit on the count of saving threads already - it's the
> count of VFIO devices in the VM.
> 
> The API in util/thread-pool.c is very basic and basically only allows
> submitting either AIO operations or generic function call operation
> but still within some AioContext.

What I'm saying is a thread pool _without_ aio.  I think it might be called
ThreadPoolRaw and let ThreadPool depend on it, but I didn't further check yet.

> 
> There's almost none of the operation execution logic I described above -
> all of these would need to be written and maintained.
> 
> > > 
> > > > It's better to create the thread pool owned by migration, rather than
> > > > threads owned by VFIO, because it also paves way for non-VFIO device state
> > > > save()s, as I mentioned also above on the multifd packet header.  Maybe we
> > > > can have a flag in the packet header saying "this is device xxx's state,
> > > > just load it".
> > > 
> > > I think the same could be done by simply implementing these hooks in other
> > > device types than VFIO, right?
> > > 
> > > And if we notice that these implementations share a bit of code then we
> > > can think about making a common helper library out of this code.
> > > 
> > > After, all that's just an implementation detail that does not impact
> > > the underlying bit stream protocol.
> > 
> > You're correct.
> > 
> > However, it still affects a few things.
> > 
> > Firstly, it may mean that we may not even need those two extra vmstate
> > hooks: the enqueue can happen already with save_state() if the migration
> > worker model exists.
> > 
> > So instead of this:
> > 
> >          vfio_save_state():
> >          if (migration->multifd_transfer) {
> >                  /* Emit dummy NOP data */
> >                  qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >                  return;
> >          }
> > 
> > We can already do:
> > 
> >          if (migration->multifd_transfer) {
> >                  // enqueue task to load state for this vfio device
> >                  ...
> >                  return;
> >          }
> > 
> > IMHO it'll be much cleaner in VFIO code, and much cleaner too for migration
> > code.
> 
> The save_state hook is executed too late - only after all iterable
> hooks have already transferred all their data.
> 
> We want to start saving this device state as early as possible to not
> have to wait for any other device to transfer its data first.
> 
> That's why the code introduces save_live_complete_precopy_begin hook
> that's guaranteed to be the very first hook called during switchover
> phase device state saving.

I think I mis-typed.. What I wanted to say is vfio_save_complete_precopy(),
not vfio_save_state().

There will be one challenge though where RAM is also an iterable, so RAM's
save_live_complete_precopy() can delay VFIO's, even if it simply only need
to enqueue a job.

Two solutions I can think of:

  (1) Provide a separate hook, e.g. save_live_complete_precopy_async(),
  when save_live_complete_precopy_async(opaque) is provided, instead of
  calling save_live_complete_precopy(), we inject that job into the worker
  threads.  In that case we can loop over *_precopy_async() before all the
  rest *_precopy() calls.

  (2) Make RAM's save_live_complete_precopy() also does similar enqueue
  when multifd enabled, so RAM will be saved in the worker thread too.

However (2) can have other issues to work out.  Do you think (1) is still
doable?

> 
> > Another (possibly personal) reason is, I will not dare to touch VFIO code
> > too much to do such a refactoring later.  I simply don't have the VFIO
> > devices around and I won't be able to test.  So comparing to other things,
> > I hope VFIO stuff can land more stable than others because I am not
> > confident at least myself to clean it.
> 
> That's a fair request, will keep this on mind.
> 
> > I simply also don't like random threads floating around, considering that
> > how we already have slightly a mess with migration on other reasons (we can
> > still have random TLS threads floating around, I think... and they can
> > cause very hard to debug issues). I feel shaky to maintain it when any
> > device can also start to create whatever threads they can during migration.
> 
> The threads themselves aren't very expensive as long as their number
> is kept within reasonable bounds.
> 
> 4 additional threads (present only during active migration operation)
> with 4 VFIO devices is really not a lot.

It's not about number, it's about management, and when something crashed at
some unwanted point, then we may want to know what happened to those
threads and how to recycle them.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-26  1:51         ` Peter Xu
@ 2024-06-26 15:47           ` Maciej S. Szmigiero
  2024-06-26 16:23             ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-26 15:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.06.2024 03:51, Peter Xu wrote:
> On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
>> On 25.06.2024 19:25, Peter Xu wrote:
>>> On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
>>>> Hi Peter,
>>>
>>> Hi, Maciej,
>>>
>>>>
>>>> On 23.06.2024 22:27, Peter Xu wrote:
>>>>> On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> This is an updated v1 patch series of the RFC (v0) series located here:
>>>>>> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
>>>>>
>>>>> OK I took some hours thinking about this today, and here's some high level
>>>>> comments for this series.  I'll start with which are more relevant to what
>>>>> Fabiano has already suggested in the other thread, then I'll add some more.
>>>>>
>>>>> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
>>>>
>>>> That's a long list, thanks for these comments.
>>>>
>>>> I have responded to them inline below.
>>>>
>> (..)
>>>>
>>>>> 3. load_state_buffer() and VFIODeviceStatePacket protocol
>>>>> =========================================================
>>>>>
>>>>> VFIODeviceStatePacket is the new protocol you introduced into multifd
>>>>> packets, along with the new load_state_buffer() hook for loading such
>>>>> buffers.  My question is whether it's needed at all, or.. whether it can be
>>>>> more generic (and also easier) to just allow taking any device state in the
>>>>> multifd packets, then load it with vmstate load().
>>>>>
>>>>> I mean, the vmstate_load() should really have worked on these buffers, if
>>>>> after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
>>>>> first flag (uint64), size as the 2nd, then (2) load that rest buffer into
>>>>> VFIO kernel driver.  That is the same to happen during the blackout window.
>>>>> It's not clear to me why load_state_buffer() is needed.
>>>>>
>>>>> I also see that you're also using exactly the same chunk size for such
>>>>> buffering (VFIOMigration.data_buffer_size).
>>>>>
>>>>> I think you have a "reason": VFIODeviceStatePacket and loading of the
>>>>> buffer data resolved one major issue that wasn't there before but start to
>>>>> have now: multifd allows concurrent arrivals of vfio buffers, even if the
>>>>> buffer *must* be sequentially loaded.
>>>>>
>>>>> That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
>>>>> used to ask nVidia people on whether the VFIO get_state/set_state interface
>>>>> can allow indexing or tagging of buffers but I never get a real response.
>>>>> IMHO that'll be extremely helpful for migration purpose on concurrency if
>>>>> it can happen, rather than using a serialized buffer.  It means
>>>>> concurrently save/load one VFIO device could be extremely hard, if not
>>>>> impossible.
>>>>
>>>> I am pretty sure that the current kernel VFIO interface requires for the
>>>> buffers to be loaded in-order - accidentally providing the out of order
>>>> definitely breaks the restore procedure.
>>>
>>> Ah, I didn't mean that we need to do it with the current API.  I'm talking
>>> about whether it's possible to have a v2 that will support those otherwise
>>> we'll need to do "workarounds" like what you're doing with "unlimited
>>> buffer these on dest, until we receive continuous chunk of data" tricks.
>>
>> Better kernel API might be possible in the long term but for now we have
>> to live with what we have right now.
>>
>> After all, adding true unordered loading - I mean not just moving the
>> reassembly process from QEMU to the kernel but making the device itself
>> accept buffers out out order - will likely be pretty complex (requiring
>> adding such functionality to the device firmware, etc).
> 
> I would expect the device will need to be able to provision the device
> states so it became smaller objects rather than one binary object, then
> either tag-able or address-able on those objects.
> 
>>
>>> And even with that trick, it'll still need to be serialized on the read()
>>> syscall so it won't scale either if the state is huge.  For that issue
>>> there's no workaround we can do from userspace.
>>
>> The read() calls for multiple VFIO devices can be issued in parallel,
>> and in fact they are in my patch set.
> 
> I was talking about concurrency for one device.

AFAIK with the current hardware the read speed is limited by the device
itself, so adding additional reading threads wouldn't help.

Once someone has the hardware which is limited by single reading thread
that person can add the necessary kernel API (including unordered
loading) and then extend QEMU with such support.

>>
>> (..)
>>>>> 4. Risk of OOM on unlimited VFIO buffering
>>>>> ==========================================
>>>>>
>>>>> This follows with above bullet, but my pure question to ask here is how
>>>>> does VFIO guarantees no OOM condition by buffering VFIO state?
>>>>>
>>>>> I mean, currently your proposal used vfio_load_bufs_thread() as a separate
>>>>> thread to only load the vfio states until sequential data is received,
>>>>> however is there an upper limit of how much buffering it could do?  IOW:
>>>>>
>>>>> vfio_load_state_buffer():
>>>>>
>>>>>      if (packet->idx >= migration->load_bufs->len) {
>>>>>          g_array_set_size(migration->load_bufs, packet->idx + 1);
>>>>>      }
>>>>>
>>>>>      lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
>>>>>      ...
>>>>>      lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>>>>>      lb->len = data_size - sizeof(*packet);
>>>>>      lb->is_present = true;
>>>>>
>>>>> What if garray keeps growing with lb->data allocated, which triggers the
>>>>> memcg limit of the process (if QEMU is in such process)?  Or just deplete
>>>>> host memory and causing OOM kill.
>>>>>
>>>>> I think we may need to find a way to throttle max memory usage of such
>>>>> buffering.
>>>>>
>>>>> So far this will be more of a problem indeed if this will be done during
>>>>> VFIO iteration phases, but I still hope a solution can work with both
>>>>> iteration phase and the switchover phase, even if you only do that in
>>>>> switchover phase
>>>>
>>>> Unfortunately, this issue will be hard to fix since the source can
>>>> legitimately send the very first buffer (chunk) of data as the last one
>>>> (at the very end of the transmission).
>>>>
>>>> In this case, the target will need to buffer nearly the whole data.
>>>>
>>>> We can't stop the receive on any channel, either, since the next missing
>>>> buffer can arrive at that channel.
>>>>
>>>> However, I don't think purposely DoSing the target QEMU is a realistic
>>>> security concern in the typical live migration scenario.
>>>>
>>>> I mean the source can easily force the target QEMU to exit just by
>>>> feeding it wrong migration data.
>>>>
>>>> In case someone really wants to protect against the impact of
>>>> theoretically unbounded QEMU memory allocations during live migration
>>>> on the rest of the system they can put the target QEMU process
>>>> (temporally) into a memory-limited cgroup.
>>>
>>> Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
>>> exactly talking about the generic case where QEMU (either src or dest, in
>>> that case normally both) is put into the memcg and if QEMU uses too much
>>> memory it'll literally get killed even if no DoS issue at all.
>>>
>>> In short, we hopefully will have a design that will always work with QEMU
>>> running in a container, without 0.5% chance dest qemu being killed, if you
>>> see what I meant.
>>>
>>> The upper bound of VFIO buffering will be needed so the admin can add that
>>> on top of the memcg limit and as long as QEMU keeps its words it'll always
>>> work without sudden death.
>>>
>>> I think I have some idea about resolving this problem.  That idea can
>>> further complicate the protocol a little bit.  But before that let's see
>>> whether we can reach an initial consensus on this matter first, on whether
>>> this is a sane request.  In short, we'll need to start to have a
>>> configurable size to say how much VFIO can buffer, maybe per-device, or
>>> globally.  Then based on that we need to have some logic guarantee that
>>> over-mem won't happen, also without heavily affecting concurrency (e.g.,
>>> single thread is definitely safe and without caching, but it can be
>>> slower).
>>
>> Here, I think I can add a per-device limit parameter on the number of
>> buffers received out-of-order or waiting to be loaded into the device -
>> with a reasonable default.
> 
> Yes that should work.
> 
> I don't even expect people would change that, but this might be the
> information people will need to know before putting it into a container if
> it's larger than how qemu dynamically consumes memories here and there.
> I'd expect it is still small enough so nobody will notice it (maybe a few
> tens of MBs? but just wildly guessing, where tens of MBs could fall into
> the "noise" memory allocation window of a VM).

The single buffer size is 8 MiB so I think the safe default should be
allowing 2 times the number of multifd channels.

With 5 multifd channels that's 10 buffers * 8 MiB = 80 MiB worst
case buffering per device.

But this will need to be determined experimentally once such parameter
is added to be sure it's enough.

>>
>> (..)
>>>>> 5. Worker thread model
>>>>> ======================
>>>>>
>>>>> I'm so far not happy with what this proposal suggests on creating the
>>>>> threads, also the two new hooks mostly just to create these threads..
>>>>
>>>> That VFIO .save_live_complete_precopy_begin handler crates a new
>>>> per-device thread is an implementation detail for this particular
>>>> driver.
>>>>
>>>> The whole idea behind this and save_live_complete_precopy_end hook was
>>>> that details how the particular device driver does its own async saving
>>>> is abstracted away from the migration core.
>>>>
>>>> The device then can do what's best / most efficient for it to do.
>>>
>>> Yes, and what I was thinking is whether it does it in form of "enqueue a
>>> task to migration worker threads", rather than "creating its own threads in
>>> the device hooks, and managing those threads alone".
>>>
>>> It's all about whether such threading can be reused by non-VFIO devices.
>>> They can't be reused if VFIO is in charge here, and it will make migration
>>> less generic.
>>>
>>> My current opinion is they can and should be re-usable. Consider if someone
>>> starts to teach multifd carry non-vfio data (e.g. a generic VMSD), then we
>>> can enqueue a task, do e.g. ioctl(KVM_GET_REGS) in those threads (rather
>>> than VFIO read()).
>>
>> Theoretically, it's obviously possible to wrap every operation in a request
>> to some thread pool.
>>
>>
>> But that would bring a lot of complexity, since instead of performing these
>> operation directly now the requester will need to:
>> 1) Prepare some "Operation" structure with the parameters of the requested
>> operation (task).
>> In your case this could be QEMU_OP_GET_VCPU_REGS operation using
>> "OperationGetVCPURegs" struct containing vCPU number parameter = 1.
> 
> Why such complexity is needed?

I just gave an example how implementing running a individual task like
"ioctl(KVM_GET_REGS)" (that you suggested above) in such thread pool would
look like.
  
> Can it be as simple as func(opaque) to be queued, then here
> func==vfio_save_complete_precopy_async_thread, opaque=VFIODevice*?

That would be possible, although in both implementations of:
1) adding a new thread pool type and wrapping device reading thread
creation around such pool, OR:
2) a direct qemu_thread_create() call.
the number of threads actually created would be the same.

That's unless someone sets the multifd channel count below the number
of VFIO devices - but one might argue that's not really a configuration
where good performance is expected anyway.

>>
>> 2) Submit this operation to the thread pool and wait for it to complete,
> 
> VFIO doesn't need to have its own code waiting.  If this pool is for
> migration purpose in general, qemu migration framework will need to wait at
> some point for all jobs to finish before moving on.  Perhaps it should be
> at the end of the non-iterative session.

So essentially, instead of calling save_live_complete_precopy_end handlers
from the migration code you would like to hard-code its current VFIO
implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().

Only it wouldn't be then called VFIO precopy async thread terminate but some
generic device state async precopy thread terminate function.

>>
>> 3) Thread pool needs to check whether it has any free threads in the pool
>> available to perform this operation.
>>
>> If not, and the count of threads that are CPU-bound (~aren't sleeping on
>> some I/O operation) is less than the number of logical CPUs in the system
>> the thread pool needs to spawn a new thread since there's some CPU capacity
>> available,
> 
> For this one it can follow what thread-pool.c is doing, and the upper bound
> of n-threads can start from simple, e.g. min(n_channels_multifd, 8)?

It needs to be min(n_channels_multifd, n_device_state_devices), because
with 9 such devices and 9 multifd channels we need at least 9 threads.

>>
>> 4) The operation needs to be dispatched to the actual execution thread,
>>
>> 5) The execution thread needs to figure out which operation it needs to
>> actually do, fetch the necessary parameters from the proper "Operation"
>> structure, maybe take the necessary locks, before it can actually perform
>> the requested operation,
>>
>> 6) The execution thread needs to serialize (write) the operation result
>> back into some "OperationResult" structure, like "OperationGetVCPURegsResult",
> 
> I think in this simplest case, the thread should simply run fn(opaque), in
> which it should start to call multifd_queue_device_state() and queue
> multifd jobs from the worker thread instead of the vfio dedicated threads.
> I don't yet expect much to change in your code from that regard inside what
> vfio_save_complete_precopy_async_thread() used to do.
> 
>>
>> 7) The execution thread needs to submit this result back to the requester,
>>
>> 8) The thread pool needs to decide whether to keep this (now idle) execution
>> thread in the pool as a reserve thread or terminate it immediately,
>>
>> 9) The requester needs to be resumed somehow (returned from wait) now that
>> the operation it requested is complete,
>>
>> 10) The requester needs the fetch the operation results from the proper
>> "OperationResult" structure and decode them accordingly.
>>
>>
>> As you can see, that's *a lot* of extra code that needs to be maintained
>> for just a single operation type.
> 
> I don't yet know why you designed it so complicated, but if I missed
> something above please let me know.

I explained above that's how running your example of "ioctl(KVM_GET_REGS)"
in such thread pool would look like.
(It wasn't a proposal to be actually implemented to be clear)

>>
>>>>
>>>>> I know I suggested that.. but that's comparing to what I read in the even
>>>>> earlier version, and sorry I wasn't able to suggest something better at
>>>>> that time because I simply thought less.
>>>>>
>>>>> As I mentioned in the other reply elsewhere, I think we should firstly have
>>>>> these threads ready to take data at the start of migration, so that it'll
>>>>> work when someone wants to add vfio iteration support.  Then the jobs
>>>>> (mostly what vfio_save_complete_precopy_async_thread() does now) can be
>>>>> enqueued into the thread pools.
>>>>
>>>> I'm not sure that we can get way with using fewer threads than devices
>>>> as these devices might not support AIO reads from their migration file
>>>> descriptor.
>>>
>>> It doesn't need to use AIO reads - I'll be happy if the thread model can be
>>> generic, VFIO can still enqueue a task that does blocking reads.
>>>
>>> It can take a lot of time, but it's fine: others who like to enqueue too
>>> and see all threads busy, they should simply block there and waiting for
>>> the worker threads to be freed again.  It's the same when there's no
>>> migration worker threads as it means the read() will block the main
>>> migration thread.
>>
>> Oh no, waiting for another device blocking read to complete before
>> scheduling another device blocking read is surely going to negatively
>> impact the performance.
> 
> There can be e.g. 8 worker threads.  If you want you can make sure the
> worker threads are at least more than vfio threads.  Then it will guarantee
> vfio will dump / save() one device per thread concurrently.

Yes, I wrote this requirement above as
n_threads = min(n_channels_multifd, n_device_state_devices).

>>
>> For best performance we need to maximize parallelism - that means
>> reading (and loading) all the VFIO devices present in parallel.
>>
>> The whole point of having per-device threads is for the whole operation
>> to be I/O bound but never CPU bound on a reasonably fast machine - and
>> especially not number-of-threads-in-pool bound.
>>
>>> Now we can have multiple worker threads doing things concurrently if
>>> possible (some of them may not, especially when BQL will be required, but
>>> that's a separate thing, and many device save()s may not need BQL, and when
>>> it needs we can take it in the enqueued tasks).
>>>
>>>>
>>>> mlx5 devices, for example, seems to support only poll()ed / non-blocking
>>>> reads at best - with unknown performance in comparison with issuing
>>>> blocking reads from dedicated threads.
>>>>
>>>> On the other hand, handling a single device from multiple threads in
>>>> parallel is generally not possible due to difficulty of establishing in
>>>> which order the buffers were read.
>>>>
>>>> And if we need a per-VFIO device thread anyway then using a thread pool
>>>> doesn't help much - but brings extra complexity.
>>>>
>>>> In terms of starting the loading thread earlier to load also VM live
>>>> phase data it looks like a small change to the code so it shouldn't be
>>>> a problem.
>>>
>>> That's good to know.  Please still consider a generic thread model and see
>>> what that would also work for your VFIO use case.
>>>
>>> If you see what thread-pool.c did right now is it'll dynamically create
>>> threads on the fly.  I think that's something we can do too but just apply
>>> an upper limit to the thread numbers.
>>
>> We have an upper limit on the count of saving threads already - it's the
>> count of VFIO devices in the VM.
>>
>> The API in util/thread-pool.c is very basic and basically only allows
>> submitting either AIO operations or generic function call operation
>> but still within some AioContext.
> 
> What I'm saying is a thread pool _without_ aio.  I think it might be called
> ThreadPoolRaw and let ThreadPool depend on it, but I didn't further check yet.

So it's not using an existing thread pool implementation from util/thread-pool.c
but essentially creating a new one - with probably some code commonality
with the existing AIO one.

That's possible but since util/thread-pool.c AFAIK isn't owned by the
migration subsystem such new implementation will probably need also review by
other QEMU maintainers.

>>
>> There's almost none of the operation execution logic I described above -
>> all of these would need to be written and maintained.
>>
>>>>
>>>>> It's better to create the thread pool owned by migration, rather than
>>>>> threads owned by VFIO, because it also paves way for non-VFIO device state
>>>>> save()s, as I mentioned also above on the multifd packet header.  Maybe we
>>>>> can have a flag in the packet header saying "this is device xxx's state,
>>>>> just load it".
>>>>
>>>> I think the same could be done by simply implementing these hooks in other
>>>> device types than VFIO, right?
>>>>
>>>> And if we notice that these implementations share a bit of code then we
>>>> can think about making a common helper library out of this code.
>>>>
>>>> After, all that's just an implementation detail that does not impact
>>>> the underlying bit stream protocol.
>>>
>>> You're correct.
>>>
>>> However, it still affects a few things.
>>>
>>> Firstly, it may mean that we may not even need those two extra vmstate
>>> hooks: the enqueue can happen already with save_state() if the migration
>>> worker model exists.
>>>
>>> So instead of this:
>>>
>>>           vfio_save_state():
>>>           if (migration->multifd_transfer) {
>>>                   /* Emit dummy NOP data */
>>>                   qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>                   return;
>>>           }
>>>
>>> We can already do:
>>>
>>>           if (migration->multifd_transfer) {
>>>                   // enqueue task to load state for this vfio device
>>>                   ...
>>>                   return;
>>>           }
>>>
>>> IMHO it'll be much cleaner in VFIO code, and much cleaner too for migration
>>> code.
>>
>> The save_state hook is executed too late - only after all iterable
>> hooks have already transferred all their data.
>>
>> We want to start saving this device state as early as possible to not
>> have to wait for any other device to transfer its data first.
>>
>> That's why the code introduces save_live_complete_precopy_begin hook
>> that's guaranteed to be the very first hook called during switchover
>> phase device state saving.
> 
> I think I mis-typed.. What I wanted to say is vfio_save_complete_precopy(),
> not vfio_save_state().
> 
> There will be one challenge though where RAM is also an iterable, so RAM's
> save_live_complete_precopy() can delay VFIO's, even if it simply only need
> to enqueue a job.
> 
> Two solutions I can think of:
> 
>    (1) Provide a separate hook, e.g. save_live_complete_precopy_async(),
>    when save_live_complete_precopy_async(opaque) is provided, instead of
>    calling save_live_complete_precopy(), we inject that job into the worker
>    threads.  In that case we can loop over *_precopy_async() before all the
>    rest *_precopy() calls.

That's basically the approach the current patch set is using, just not using
pool worker threads (yet).

Only the hook was renamed from save_live_complete_precopy_async to
save_live_complete_precopy_begin upon your comment on RFC requesting that.

>    (2) Make RAM's save_live_complete_precopy() also does similar enqueue
>    when multifd enabled, so RAM will be saved in the worker thread too.
> 
> However (2) can have other issues to work out.  Do you think (1) is still
> doable?
> 

Yes, I think (1) is the correct way to do it.

>>
>>> Another (possibly personal) reason is, I will not dare to touch VFIO code
>>> too much to do such a refactoring later.  I simply don't have the VFIO
>>> devices around and I won't be able to test.  So comparing to other things,
>>> I hope VFIO stuff can land more stable than others because I am not
>>> confident at least myself to clean it.
>>
>> That's a fair request, will keep this on mind.
>>
>>> I simply also don't like random threads floating around, considering that
>>> how we already have slightly a mess with migration on other reasons (we can
>>> still have random TLS threads floating around, I think... and they can
>>> cause very hard to debug issues). I feel shaky to maintain it when any
>>> device can also start to create whatever threads they can during migration.
>>
>> The threads themselves aren't very expensive as long as their number
>> is kept within reasonable bounds.
>>
>> 4 additional threads (present only during active migration operation)
>> with 4 VFIO devices is really not a lot.
> 
> It's not about number, it's about management, and when something crashed at
> some unwanted point, then we may want to know what happened to those
> threads and how to recycle them.

I guess if you are more comfortable with maintaining code written in such
way then that's some argument for it too.

> 
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-26 15:47           ` Maciej S. Szmigiero
@ 2024-06-26 16:23             ` Peter Xu
  2024-06-27  9:14               ` Maciej S. Szmigiero
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2024-06-26 16:23 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
> On 26.06.2024 03:51, Peter Xu wrote:
> > On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> > > On 25.06.2024 19:25, Peter Xu wrote:
> > > > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > Hi Peter,
> > > > 
> > > > Hi, Maciej,
> > > > 
> > > > > 
> > > > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > 
> > > > > > > This is an updated v1 patch series of the RFC (v0) series located here:
> > > > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
> > > > > > 
> > > > > > OK I took some hours thinking about this today, and here's some high level
> > > > > > comments for this series.  I'll start with which are more relevant to what
> > > > > > Fabiano has already suggested in the other thread, then I'll add some more.
> > > > > > 
> > > > > > https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
> > > > > 
> > > > > That's a long list, thanks for these comments.
> > > > > 
> > > > > I have responded to them inline below.
> > > > > 
> > > (..)
> > > > > 
> > > > > > 3. load_state_buffer() and VFIODeviceStatePacket protocol
> > > > > > =========================================================
> > > > > > 
> > > > > > VFIODeviceStatePacket is the new protocol you introduced into multifd
> > > > > > packets, along with the new load_state_buffer() hook for loading such
> > > > > > buffers.  My question is whether it's needed at all, or.. whether it can be
> > > > > > more generic (and also easier) to just allow taking any device state in the
> > > > > > multifd packets, then load it with vmstate load().
> > > > > > 
> > > > > > I mean, the vmstate_load() should really have worked on these buffers, if
> > > > > > after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
> > > > > > first flag (uint64), size as the 2nd, then (2) load that rest buffer into
> > > > > > VFIO kernel driver.  That is the same to happen during the blackout window.
> > > > > > It's not clear to me why load_state_buffer() is needed.
> > > > > > 
> > > > > > I also see that you're also using exactly the same chunk size for such
> > > > > > buffering (VFIOMigration.data_buffer_size).
> > > > > > 
> > > > > > I think you have a "reason": VFIODeviceStatePacket and loading of the
> > > > > > buffer data resolved one major issue that wasn't there before but start to
> > > > > > have now: multifd allows concurrent arrivals of vfio buffers, even if the
> > > > > > buffer *must* be sequentially loaded.
> > > > > > 
> > > > > > That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
> > > > > > used to ask nVidia people on whether the VFIO get_state/set_state interface
> > > > > > can allow indexing or tagging of buffers but I never get a real response.
> > > > > > IMHO that'll be extremely helpful for migration purpose on concurrency if
> > > > > > it can happen, rather than using a serialized buffer.  It means
> > > > > > concurrently save/load one VFIO device could be extremely hard, if not
> > > > > > impossible.
> > > > > 
> > > > > I am pretty sure that the current kernel VFIO interface requires for the
> > > > > buffers to be loaded in-order - accidentally providing the out of order
> > > > > definitely breaks the restore procedure.
> > > > 
> > > > Ah, I didn't mean that we need to do it with the current API.  I'm talking
> > > > about whether it's possible to have a v2 that will support those otherwise
> > > > we'll need to do "workarounds" like what you're doing with "unlimited
> > > > buffer these on dest, until we receive continuous chunk of data" tricks.
> > > 
> > > Better kernel API might be possible in the long term but for now we have
> > > to live with what we have right now.
> > > 
> > > After all, adding true unordered loading - I mean not just moving the
> > > reassembly process from QEMU to the kernel but making the device itself
> > > accept buffers out out order - will likely be pretty complex (requiring
> > > adding such functionality to the device firmware, etc).
> > 
> > I would expect the device will need to be able to provision the device
> > states so it became smaller objects rather than one binary object, then
> > either tag-able or address-able on those objects.
> > 
> > > 
> > > > And even with that trick, it'll still need to be serialized on the read()
> > > > syscall so it won't scale either if the state is huge.  For that issue
> > > > there's no workaround we can do from userspace.
> > > 
> > > The read() calls for multiple VFIO devices can be issued in parallel,
> > > and in fact they are in my patch set.
> > 
> > I was talking about concurrency for one device.
> 
> AFAIK with the current hardware the read speed is limited by the device
> itself, so adding additional reading threads wouldn't help.

OK.

> 
> Once someone has the hardware which is limited by single reading thread
> that person can add the necessary kernel API (including unordered
> loading) and then extend QEMU with such support.
> 
> > > 
> > > (..)
> > > > > > 4. Risk of OOM on unlimited VFIO buffering
> > > > > > ==========================================
> > > > > > 
> > > > > > This follows with above bullet, but my pure question to ask here is how
> > > > > > does VFIO guarantees no OOM condition by buffering VFIO state?
> > > > > > 
> > > > > > I mean, currently your proposal used vfio_load_bufs_thread() as a separate
> > > > > > thread to only load the vfio states until sequential data is received,
> > > > > > however is there an upper limit of how much buffering it could do?  IOW:
> > > > > > 
> > > > > > vfio_load_state_buffer():
> > > > > > 
> > > > > >      if (packet->idx >= migration->load_bufs->len) {
> > > > > >          g_array_set_size(migration->load_bufs, packet->idx + 1);
> > > > > >      }
> > > > > > 
> > > > > >      lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
> > > > > >      ...
> > > > > >      lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
> > > > > >      lb->len = data_size - sizeof(*packet);
> > > > > >      lb->is_present = true;
> > > > > > 
> > > > > > What if garray keeps growing with lb->data allocated, which triggers the
> > > > > > memcg limit of the process (if QEMU is in such process)?  Or just deplete
> > > > > > host memory and causing OOM kill.
> > > > > > 
> > > > > > I think we may need to find a way to throttle max memory usage of such
> > > > > > buffering.
> > > > > > 
> > > > > > So far this will be more of a problem indeed if this will be done during
> > > > > > VFIO iteration phases, but I still hope a solution can work with both
> > > > > > iteration phase and the switchover phase, even if you only do that in
> > > > > > switchover phase
> > > > > 
> > > > > Unfortunately, this issue will be hard to fix since the source can
> > > > > legitimately send the very first buffer (chunk) of data as the last one
> > > > > (at the very end of the transmission).
> > > > > 
> > > > > In this case, the target will need to buffer nearly the whole data.
> > > > > 
> > > > > We can't stop the receive on any channel, either, since the next missing
> > > > > buffer can arrive at that channel.
> > > > > 
> > > > > However, I don't think purposely DoSing the target QEMU is a realistic
> > > > > security concern in the typical live migration scenario.
> > > > > 
> > > > > I mean the source can easily force the target QEMU to exit just by
> > > > > feeding it wrong migration data.
> > > > > 
> > > > > In case someone really wants to protect against the impact of
> > > > > theoretically unbounded QEMU memory allocations during live migration
> > > > > on the rest of the system they can put the target QEMU process
> > > > > (temporally) into a memory-limited cgroup.
> > > > 
> > > > Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
> > > > exactly talking about the generic case where QEMU (either src or dest, in
> > > > that case normally both) is put into the memcg and if QEMU uses too much
> > > > memory it'll literally get killed even if no DoS issue at all.
> > > > 
> > > > In short, we hopefully will have a design that will always work with QEMU
> > > > running in a container, without 0.5% chance dest qemu being killed, if you
> > > > see what I meant.
> > > > 
> > > > The upper bound of VFIO buffering will be needed so the admin can add that
> > > > on top of the memcg limit and as long as QEMU keeps its words it'll always
> > > > work without sudden death.
> > > > 
> > > > I think I have some idea about resolving this problem.  That idea can
> > > > further complicate the protocol a little bit.  But before that let's see
> > > > whether we can reach an initial consensus on this matter first, on whether
> > > > this is a sane request.  In short, we'll need to start to have a
> > > > configurable size to say how much VFIO can buffer, maybe per-device, or
> > > > globally.  Then based on that we need to have some logic guarantee that
> > > > over-mem won't happen, also without heavily affecting concurrency (e.g.,
> > > > single thread is definitely safe and without caching, but it can be
> > > > slower).
> > > 
> > > Here, I think I can add a per-device limit parameter on the number of
> > > buffers received out-of-order or waiting to be loaded into the device -
> > > with a reasonable default.
> > 
> > Yes that should work.
> > 
> > I don't even expect people would change that, but this might be the
> > information people will need to know before putting it into a container if
> > it's larger than how qemu dynamically consumes memories here and there.
> > I'd expect it is still small enough so nobody will notice it (maybe a few
> > tens of MBs? but just wildly guessing, where tens of MBs could fall into
> > the "noise" memory allocation window of a VM).
> 
> The single buffer size is 8 MiB so I think the safe default should be
> allowing 2 times the number of multifd channels.
> 
> With 5 multifd channels that's 10 buffers * 8 MiB = 80 MiB worst
> case buffering per device.
> 
> But this will need to be determined experimentally once such parameter
> is added to be sure it's enough.

Yes you may want to test it with a new logic to be able to throttle sending
on src qemu (because otherwise when dest qemu is very unlucky its buffer is
full but still the initial index chunk is missing), then making sure it's
relatively small but hopefully still keep the flow running as much as
possible.

> 
> > > 
> > > (..)
> > > > > > 5. Worker thread model
> > > > > > ======================
> > > > > > 
> > > > > > I'm so far not happy with what this proposal suggests on creating the
> > > > > > threads, also the two new hooks mostly just to create these threads..
> > > > > 
> > > > > That VFIO .save_live_complete_precopy_begin handler crates a new
> > > > > per-device thread is an implementation detail for this particular
> > > > > driver.
> > > > > 
> > > > > The whole idea behind this and save_live_complete_precopy_end hook was
> > > > > that details how the particular device driver does its own async saving
> > > > > is abstracted away from the migration core.
> > > > > 
> > > > > The device then can do what's best / most efficient for it to do.
> > > > 
> > > > Yes, and what I was thinking is whether it does it in form of "enqueue a
> > > > task to migration worker threads", rather than "creating its own threads in
> > > > the device hooks, and managing those threads alone".
> > > > 
> > > > It's all about whether such threading can be reused by non-VFIO devices.
> > > > They can't be reused if VFIO is in charge here, and it will make migration
> > > > less generic.
> > > > 
> > > > My current opinion is they can and should be re-usable. Consider if someone
> > > > starts to teach multifd carry non-vfio data (e.g. a generic VMSD), then we
> > > > can enqueue a task, do e.g. ioctl(KVM_GET_REGS) in those threads (rather
> > > > than VFIO read()).
> > > 
> > > Theoretically, it's obviously possible to wrap every operation in a request
> > > to some thread pool.
> > > 
> > > 
> > > But that would bring a lot of complexity, since instead of performing these
> > > operation directly now the requester will need to:
> > > 1) Prepare some "Operation" structure with the parameters of the requested
> > > operation (task).
> > > In your case this could be QEMU_OP_GET_VCPU_REGS operation using
> > > "OperationGetVCPURegs" struct containing vCPU number parameter = 1.
> > 
> > Why such complexity is needed?
> 
> I just gave an example how implementing running a individual task like
> "ioctl(KVM_GET_REGS)" (that you suggested above) in such thread pool would
> look like.
> > Can it be as simple as func(opaque) to be queued, then here
> > func==vfio_save_complete_precopy_async_thread, opaque=VFIODevice*?
> 
> That would be possible, although in both implementations of:
> 1) adding a new thread pool type and wrapping device reading thread
> creation around such pool, OR:
> 2) a direct qemu_thread_create() call.
> the number of threads actually created would be the same.

Again, it's not about the number of threads that I worry.

If you create one thread but hard to manage it's the same.

OTOH if it's a common model I think it's fine if you create 16 or 32,
especially if when most of them are either idle or doing block IOs, then
they'll be put to sleep anyway.  That's not a concern at all.

> 
> That's unless someone sets the multifd channel count below the number
> of VFIO devices - but one might argue that's not really a configuration
> where good performance is expected anyway.
> 
> > > 
> > > 2) Submit this operation to the thread pool and wait for it to complete,
> > 
> > VFIO doesn't need to have its own code waiting.  If this pool is for
> > migration purpose in general, qemu migration framework will need to wait at
> > some point for all jobs to finish before moving on.  Perhaps it should be
> > at the end of the non-iterative session.
> 
> So essentially, instead of calling save_live_complete_precopy_end handlers
> from the migration code you would like to hard-code its current VFIO
> implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().
> 
> Only it wouldn't be then called VFIO precopy async thread terminate but some
> generic device state async precopy thread terminate function.

I don't understand what did you mean by "hard code".

What I was saying is if we target the worker thread pool to be used for
"concurrently dump vmstates", then it'll make sense to make sure all the
jobs there were flushed after qemu dumps all non-iterables (because this
should be the last step of the switchover).

I expect it looks like this:

  while (pool->active_threads) {
      qemu_sem_wait(&pool->job_done);
  }

> 
> > > 
> > > 3) Thread pool needs to check whether it has any free threads in the pool
> > > available to perform this operation.
> > > 
> > > If not, and the count of threads that are CPU-bound (~aren't sleeping on
> > > some I/O operation) is less than the number of logical CPUs in the system
> > > the thread pool needs to spawn a new thread since there's some CPU capacity
> > > available,
> > 
> > For this one it can follow what thread-pool.c is doing, and the upper bound
> > of n-threads can start from simple, e.g. min(n_channels_multifd, 8)?
> 
> It needs to be min(n_channels_multifd, n_device_state_devices), because
> with 9 such devices and 9 multifd channels we need at least 9 threads.
> 
> > > 
> > > 4) The operation needs to be dispatched to the actual execution thread,
> > > 
> > > 5) The execution thread needs to figure out which operation it needs to
> > > actually do, fetch the necessary parameters from the proper "Operation"
> > > structure, maybe take the necessary locks, before it can actually perform
> > > the requested operation,
> > > 
> > > 6) The execution thread needs to serialize (write) the operation result
> > > back into some "OperationResult" structure, like "OperationGetVCPURegsResult",
> > 
> > I think in this simplest case, the thread should simply run fn(opaque), in
> > which it should start to call multifd_queue_device_state() and queue
> > multifd jobs from the worker thread instead of the vfio dedicated threads.
> > I don't yet expect much to change in your code from that regard inside what
> > vfio_save_complete_precopy_async_thread() used to do.
> > 
> > > 
> > > 7) The execution thread needs to submit this result back to the requester,
> > > 
> > > 8) The thread pool needs to decide whether to keep this (now idle) execution
> > > thread in the pool as a reserve thread or terminate it immediately,
> > > 
> > > 9) The requester needs to be resumed somehow (returned from wait) now that
> > > the operation it requested is complete,
> > > 
> > > 10) The requester needs the fetch the operation results from the proper
> > > "OperationResult" structure and decode them accordingly.
> > > 
> > > 
> > > As you can see, that's *a lot* of extra code that needs to be maintained
> > > for just a single operation type.
> > 
> > I don't yet know why you designed it so complicated, but if I missed
> > something above please let me know.
> 
> I explained above that's how running your example of "ioctl(KVM_GET_REGS)"
> in such thread pool would look like.
> (It wasn't a proposal to be actually implemented to be clear)
> 
> > > 
> > > > > 
> > > > > > I know I suggested that.. but that's comparing to what I read in the even
> > > > > > earlier version, and sorry I wasn't able to suggest something better at
> > > > > > that time because I simply thought less.
> > > > > > 
> > > > > > As I mentioned in the other reply elsewhere, I think we should firstly have
> > > > > > these threads ready to take data at the start of migration, so that it'll
> > > > > > work when someone wants to add vfio iteration support.  Then the jobs
> > > > > > (mostly what vfio_save_complete_precopy_async_thread() does now) can be
> > > > > > enqueued into the thread pools.
> > > > > 
> > > > > I'm not sure that we can get way with using fewer threads than devices
> > > > > as these devices might not support AIO reads from their migration file
> > > > > descriptor.
> > > > 
> > > > It doesn't need to use AIO reads - I'll be happy if the thread model can be
> > > > generic, VFIO can still enqueue a task that does blocking reads.
> > > > 
> > > > It can take a lot of time, but it's fine: others who like to enqueue too
> > > > and see all threads busy, they should simply block there and waiting for
> > > > the worker threads to be freed again.  It's the same when there's no
> > > > migration worker threads as it means the read() will block the main
> > > > migration thread.
> > > 
> > > Oh no, waiting for another device blocking read to complete before
> > > scheduling another device blocking read is surely going to negatively
> > > impact the performance.
> > 
> > There can be e.g. 8 worker threads.  If you want you can make sure the
> > worker threads are at least more than vfio threads.  Then it will guarantee
> > vfio will dump / save() one device per thread concurrently.
> 
> Yes, I wrote this requirement above as
> n_threads = min(n_channels_multifd, n_device_state_devices).
> 
> > > 
> > > For best performance we need to maximize parallelism - that means
> > > reading (and loading) all the VFIO devices present in parallel.
> > > 
> > > The whole point of having per-device threads is for the whole operation
> > > to be I/O bound but never CPU bound on a reasonably fast machine - and
> > > especially not number-of-threads-in-pool bound.
> > > 
> > > > Now we can have multiple worker threads doing things concurrently if
> > > > possible (some of them may not, especially when BQL will be required, but
> > > > that's a separate thing, and many device save()s may not need BQL, and when
> > > > it needs we can take it in the enqueued tasks).
> > > > 
> > > > > 
> > > > > mlx5 devices, for example, seems to support only poll()ed / non-blocking
> > > > > reads at best - with unknown performance in comparison with issuing
> > > > > blocking reads from dedicated threads.
> > > > > 
> > > > > On the other hand, handling a single device from multiple threads in
> > > > > parallel is generally not possible due to difficulty of establishing in
> > > > > which order the buffers were read.
> > > > > 
> > > > > And if we need a per-VFIO device thread anyway then using a thread pool
> > > > > doesn't help much - but brings extra complexity.
> > > > > 
> > > > > In terms of starting the loading thread earlier to load also VM live
> > > > > phase data it looks like a small change to the code so it shouldn't be
> > > > > a problem.
> > > > 
> > > > That's good to know.  Please still consider a generic thread model and see
> > > > what that would also work for your VFIO use case.
> > > > 
> > > > If you see what thread-pool.c did right now is it'll dynamically create
> > > > threads on the fly.  I think that's something we can do too but just apply
> > > > an upper limit to the thread numbers.
> > > 
> > > We have an upper limit on the count of saving threads already - it's the
> > > count of VFIO devices in the VM.
> > > 
> > > The API in util/thread-pool.c is very basic and basically only allows
> > > submitting either AIO operations or generic function call operation
> > > but still within some AioContext.
> > 
> > What I'm saying is a thread pool _without_ aio.  I think it might be called
> > ThreadPoolRaw and let ThreadPool depend on it, but I didn't further check yet.
> 
> So it's not using an existing thread pool implementation from util/thread-pool.c
> but essentially creating a new one - with probably some code commonality
> with the existing AIO one.
> 
> That's possible but since util/thread-pool.c AFAIK isn't owned by the
> migration subsystem such new implementation will probably need also review by
> other QEMU maintainers.

Yes, that's how we normally should do it.  Obviously you still want to push
that in 9.1, so if you want you can create that pool implementation under
migration/, and we can try to move it over as a future rework, having block
people review that later.

> 
> > > 
> > > There's almost none of the operation execution logic I described above -
> > > all of these would need to be written and maintained.
> > > 
> > > > > 
> > > > > > It's better to create the thread pool owned by migration, rather than
> > > > > > threads owned by VFIO, because it also paves way for non-VFIO device state
> > > > > > save()s, as I mentioned also above on the multifd packet header.  Maybe we
> > > > > > can have a flag in the packet header saying "this is device xxx's state,
> > > > > > just load it".
> > > > > 
> > > > > I think the same could be done by simply implementing these hooks in other
> > > > > device types than VFIO, right?
> > > > > 
> > > > > And if we notice that these implementations share a bit of code then we
> > > > > can think about making a common helper library out of this code.
> > > > > 
> > > > > After, all that's just an implementation detail that does not impact
> > > > > the underlying bit stream protocol.
> > > > 
> > > > You're correct.
> > > > 
> > > > However, it still affects a few things.
> > > > 
> > > > Firstly, it may mean that we may not even need those two extra vmstate
> > > > hooks: the enqueue can happen already with save_state() if the migration
> > > > worker model exists.
> > > > 
> > > > So instead of this:
> > > > 
> > > >           vfio_save_state():
> > > >           if (migration->multifd_transfer) {
> > > >                   /* Emit dummy NOP data */
> > > >                   qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > >                   return;
> > > >           }
> > > > 
> > > > We can already do:
> > > > 
> > > >           if (migration->multifd_transfer) {
> > > >                   // enqueue task to load state for this vfio device
> > > >                   ...
> > > >                   return;
> > > >           }
> > > > 
> > > > IMHO it'll be much cleaner in VFIO code, and much cleaner too for migration
> > > > code.
> > > 
> > > The save_state hook is executed too late - only after all iterable
> > > hooks have already transferred all their data.
> > > 
> > > We want to start saving this device state as early as possible to not
> > > have to wait for any other device to transfer its data first.
> > > 
> > > That's why the code introduces save_live_complete_precopy_begin hook
> > > that's guaranteed to be the very first hook called during switchover
> > > phase device state saving.
> > 
> > I think I mis-typed.. What I wanted to say is vfio_save_complete_precopy(),
> > not vfio_save_state().
> > 
> > There will be one challenge though where RAM is also an iterable, so RAM's
> > save_live_complete_precopy() can delay VFIO's, even if it simply only need
> > to enqueue a job.
> > 
> > Two solutions I can think of:
> > 
> >    (1) Provide a separate hook, e.g. save_live_complete_precopy_async(),
> >    when save_live_complete_precopy_async(opaque) is provided, instead of
> >    calling save_live_complete_precopy(), we inject that job into the worker
> >    threads.  In that case we can loop over *_precopy_async() before all the
> >    rest *_precopy() calls.
> 
> That's basically the approach the current patch set is using, just not using
> pool worker threads (yet).
> 
> Only the hook was renamed from save_live_complete_precopy_async to
> save_live_complete_precopy_begin upon your comment on RFC requesting that.
> 
> >    (2) Make RAM's save_live_complete_precopy() also does similar enqueue
> >    when multifd enabled, so RAM will be saved in the worker thread too.
> > 
> > However (2) can have other issues to work out.  Do you think (1) is still
> > doable?
> > 
> 
> Yes, I think (1) is the correct way to do it.

I don't think "correct" is the correct word to put it.. it's really a
matter of whether you want to push this earlier in-tree.

The 2nd proposal will be more than correct to me, IMHO.  That'll be really
helpful too also to VFIO when RAM can be saved concurrently, then it means
these things can be done all concurrently:

  - VFIO, one thread per one device
  - RAM, one thread
  - non-iterables

Otherwise 2+3 needs to be serialized.

If you're looking for downtime optimizations that may also relevant, afaiu.

And that's also one of the major points why I want to convince you not to
use a separate vfio thread, because AFAICT we simply have other users.

> 
> > > 
> > > > Another (possibly personal) reason is, I will not dare to touch VFIO code
> > > > too much to do such a refactoring later.  I simply don't have the VFIO
> > > > devices around and I won't be able to test.  So comparing to other things,
> > > > I hope VFIO stuff can land more stable than others because I am not
> > > > confident at least myself to clean it.
> > > 
> > > That's a fair request, will keep this on mind.
> > > 
> > > > I simply also don't like random threads floating around, considering that
> > > > how we already have slightly a mess with migration on other reasons (we can
> > > > still have random TLS threads floating around, I think... and they can
> > > > cause very hard to debug issues). I feel shaky to maintain it when any
> > > > device can also start to create whatever threads they can during migration.
> > > 
> > > The threads themselves aren't very expensive as long as their number
> > > is kept within reasonable bounds.
> > > 
> > > 4 additional threads (present only during active migration operation)
> > > with 4 VFIO devices is really not a lot.
> > 
> > It's not about number, it's about management, and when something crashed at
> > some unwanted point, then we may want to know what happened to those
> > threads and how to recycle them.
> 
> I guess if you are more comfortable with maintaining code written in such
> way then that's some argument for it too.

It's not about my flavour of maintenance.

We used to work on issues where we see a dangling thread operate on
migration objects even if it was created in the _previous_ migration,
cancelled and retried.  And the thread doesn't know that.  It was kind of
leaked and it causes issues hard to debug.

VFIO can cause similar thing if it can create some threads that migration
developers may overlook and not easy to manage.  Then it'll be the same
challenge when a vfio thread dangled for some reason and it'll just make
things harder to debug when issue happens.

I want to make sure if ever possible migration framework manages threads on
its own, so no thread will be fiddling around without being noticed.

Not to mention as I mentioned previously, that "having some async model to
dump vmstate" isn't something special to VFIO, it can easily be extended to
either RAM, or other normal VMSDs if we can tackle other issues here and
there.  The general request is the same.  It'll be a chaos if vfio starts
to create its own threads, then vDPA and others.  It is much saner to make
it a generic model to me.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-26 16:23             ` Peter Xu
@ 2024-06-27  9:14               ` Maciej S. Szmigiero
  2024-06-27 14:56                 ` Peter Xu
  2024-06-27 15:09                 ` Peter Xu
  0 siblings, 2 replies; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-27  9:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.06.2024 18:23, Peter Xu wrote:
> On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
>> On 26.06.2024 03:51, Peter Xu wrote:
>>> On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
>>>> On 25.06.2024 19:25, Peter Xu wrote:
>>>>> On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
>>>>>> Hi Peter,
>>>>>
>>>>> Hi, Maciej,
>>>>>
>>>>>>
>>>>>> On 23.06.2024 22:27, Peter Xu wrote:
>>>>>>> On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>
>>>>>>>> This is an updated v1 patch series of the RFC (v0) series located here:
>>>>>>>> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
>>>>>>>
>>>>>>> OK I took some hours thinking about this today, and here's some high level
>>>>>>> comments for this series.  I'll start with which are more relevant to what
>>>>>>> Fabiano has already suggested in the other thread, then I'll add some more.
>>>>>>>
>>>>>>> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
>>>>>>
>>>>>> That's a long list, thanks for these comments.
>>>>>>
>>>>>> I have responded to them inline below.
>>>>>>
(..)
>>>>>>> 4. Risk of OOM on unlimited VFIO buffering
>>>>>>> ==========================================
>>>>>>>
>>>>>>> This follows with above bullet, but my pure question to ask here is how
>>>>>>> does VFIO guarantees no OOM condition by buffering VFIO state?
>>>>>>>
>>>>>>> I mean, currently your proposal used vfio_load_bufs_thread() as a separate
>>>>>>> thread to only load the vfio states until sequential data is received,
>>>>>>> however is there an upper limit of how much buffering it could do?  IOW:
>>>>>>>
>>>>>>> vfio_load_state_buffer():
>>>>>>>
>>>>>>>       if (packet->idx >= migration->load_bufs->len) {
>>>>>>>           g_array_set_size(migration->load_bufs, packet->idx + 1);
>>>>>>>       }
>>>>>>>
>>>>>>>       lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
>>>>>>>       ...
>>>>>>>       lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>>>>>>>       lb->len = data_size - sizeof(*packet);
>>>>>>>       lb->is_present = true;
>>>>>>>
>>>>>>> What if garray keeps growing with lb->data allocated, which triggers the
>>>>>>> memcg limit of the process (if QEMU is in such process)?  Or just deplete
>>>>>>> host memory and causing OOM kill.
>>>>>>>
>>>>>>> I think we may need to find a way to throttle max memory usage of such
>>>>>>> buffering.
>>>>>>>
>>>>>>> So far this will be more of a problem indeed if this will be done during
>>>>>>> VFIO iteration phases, but I still hope a solution can work with both
>>>>>>> iteration phase and the switchover phase, even if you only do that in
>>>>>>> switchover phase
>>>>>>
>>>>>> Unfortunately, this issue will be hard to fix since the source can
>>>>>> legitimately send the very first buffer (chunk) of data as the last one
>>>>>> (at the very end of the transmission).
>>>>>>
>>>>>> In this case, the target will need to buffer nearly the whole data.
>>>>>>
>>>>>> We can't stop the receive on any channel, either, since the next missing
>>>>>> buffer can arrive at that channel.
>>>>>>
>>>>>> However, I don't think purposely DoSing the target QEMU is a realistic
>>>>>> security concern in the typical live migration scenario.
>>>>>>
>>>>>> I mean the source can easily force the target QEMU to exit just by
>>>>>> feeding it wrong migration data.
>>>>>>
>>>>>> In case someone really wants to protect against the impact of
>>>>>> theoretically unbounded QEMU memory allocations during live migration
>>>>>> on the rest of the system they can put the target QEMU process
>>>>>> (temporally) into a memory-limited cgroup.
>>>>>
>>>>> Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
>>>>> exactly talking about the generic case where QEMU (either src or dest, in
>>>>> that case normally both) is put into the memcg and if QEMU uses too much
>>>>> memory it'll literally get killed even if no DoS issue at all.
>>>>>
>>>>> In short, we hopefully will have a design that will always work with QEMU
>>>>> running in a container, without 0.5% chance dest qemu being killed, if you
>>>>> see what I meant.
>>>>>
>>>>> The upper bound of VFIO buffering will be needed so the admin can add that
>>>>> on top of the memcg limit and as long as QEMU keeps its words it'll always
>>>>> work without sudden death.
>>>>>
>>>>> I think I have some idea about resolving this problem.  That idea can
>>>>> further complicate the protocol a little bit.  But before that let's see
>>>>> whether we can reach an initial consensus on this matter first, on whether
>>>>> this is a sane request.  In short, we'll need to start to have a
>>>>> configurable size to say how much VFIO can buffer, maybe per-device, or
>>>>> globally.  Then based on that we need to have some logic guarantee that
>>>>> over-mem won't happen, also without heavily affecting concurrency (e.g.,
>>>>> single thread is definitely safe and without caching, but it can be
>>>>> slower).
>>>>
>>>> Here, I think I can add a per-device limit parameter on the number of
>>>> buffers received out-of-order or waiting to be loaded into the device -
>>>> with a reasonable default.
>>>
>>> Yes that should work.
>>>
>>> I don't even expect people would change that, but this might be the
>>> information people will need to know before putting it into a container if
>>> it's larger than how qemu dynamically consumes memories here and there.
>>> I'd expect it is still small enough so nobody will notice it (maybe a few
>>> tens of MBs? but just wildly guessing, where tens of MBs could fall into
>>> the "noise" memory allocation window of a VM).
>>
>> The single buffer size is 8 MiB so I think the safe default should be
>> allowing 2 times the number of multifd channels.
>>
>> With 5 multifd channels that's 10 buffers * 8 MiB = 80 MiB worst
>> case buffering per device.
>>
>> But this will need to be determined experimentally once such parameter
>> is added to be sure it's enough.
> 
> Yes you may want to test it with a new logic to be able to throttle sending
> on src qemu (because otherwise when dest qemu is very unlucky its buffer is
> full but still the initial index chunk is missing), then making sure it's
> relatively small but hopefully still keep the flow running as much as
> possible.
> 
>>
>>>>
>>>> (..)
>>>>>>> 5. Worker thread model
>>>>>>> ======================
>>>>>>>
>>>>>>> I'm so far not happy with what this proposal suggests on creating the
>>>>>>> threads, also the two new hooks mostly just to create these threads..
>>>>>>
>>>>>> That VFIO .save_live_complete_precopy_begin handler crates a new
>>>>>> per-device thread is an implementation detail for this particular
>>>>>> driver.
>>>>>>
>>>>>> The whole idea behind this and save_live_complete_precopy_end hook was
>>>>>> that details how the particular device driver does its own async saving
>>>>>> is abstracted away from the migration core.
>>>>>>
>>>>>> The device then can do what's best / most efficient for it to do.
>>>>>
>>>>> Yes, and what I was thinking is whether it does it in form of "enqueue a
>>>>> task to migration worker threads", rather than "creating its own threads in
>>>>> the device hooks, and managing those threads alone".
>>>>>
>>>>> It's all about whether such threading can be reused by non-VFIO devices.
>>>>> They can't be reused if VFIO is in charge here, and it will make migration
>>>>> less generic.
>>>>>
>>>>> My current opinion is they can and should be re-usable. Consider if someone
>>>>> starts to teach multifd carry non-vfio data (e.g. a generic VMSD), then we
>>>>> can enqueue a task, do e.g. ioctl(KVM_GET_REGS) in those threads (rather
>>>>> than VFIO read()).
>>>>
>>>> Theoretically, it's obviously possible to wrap every operation in a request
>>>> to some thread pool.
>>>>
>>>>
>>>> But that would bring a lot of complexity, since instead of performing these
>>>> operation directly now the requester will need to:
>>>> 1) Prepare some "Operation" structure with the parameters of the requested
>>>> operation (task).
>>>> In your case this could be QEMU_OP_GET_VCPU_REGS operation using
>>>> "OperationGetVCPURegs" struct containing vCPU number parameter = 1.
>>>
>>> Why such complexity is needed?
>>
>> I just gave an example how implementing running a individual task like
>> "ioctl(KVM_GET_REGS)" (that you suggested above) in such thread pool would
>> look like.
>>> Can it be as simple as func(opaque) to be queued, then here
>>> func==vfio_save_complete_precopy_async_thread, opaque=VFIODevice*?
>>
>> That would be possible, although in both implementations of:
>> 1) adding a new thread pool type and wrapping device reading thread
>> creation around such pool, OR:
>> 2) a direct qemu_thread_create() call.
>> the number of threads actually created would be the same.
> 
> Again, it's not about the number of threads that I worry.
> 
> If you create one thread but hard to manage it's the same.
> 
> OTOH if it's a common model I think it's fine if you create 16 or 32,
> especially if when most of them are either idle or doing block IOs, then
> they'll be put to sleep anyway.  That's not a concern at all.
> 
>>
>> That's unless someone sets the multifd channel count below the number
>> of VFIO devices - but one might argue that's not really a configuration
>> where good performance is expected anyway.
>>
>>>>
>>>> 2) Submit this operation to the thread pool and wait for it to complete,
>>>
>>> VFIO doesn't need to have its own code waiting.  If this pool is for
>>> migration purpose in general, qemu migration framework will need to wait at
>>> some point for all jobs to finish before moving on.  Perhaps it should be
>>> at the end of the non-iterative session.
>>
>> So essentially, instead of calling save_live_complete_precopy_end handlers
>> from the migration code you would like to hard-code its current VFIO
>> implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().
>>
>> Only it wouldn't be then called VFIO precopy async thread terminate but some
>> generic device state async precopy thread terminate function.
> 
> I don't understand what did you mean by "hard code".

"Hard code" wasn't maybe the best expression here.

I meant the move of the functionality that's provided by
vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
to the common migration code.

> What I was saying is if we target the worker thread pool to be used for
> "concurrently dump vmstates", then it'll make sense to make sure all the
> jobs there were flushed after qemu dumps all non-iterables (because this
> should be the last step of the switchover).
> 
> I expect it looks like this:
> 
>    while (pool->active_threads) {
>        qemu_sem_wait(&pool->job_done);
>    }
> 
>>
>>>>
>>>> 3) Thread pool needs to check whether it has any free threads in the pool
>>>> available to perform this operation.
>>>>
>>>> If not, and the count of threads that are CPU-bound (~aren't sleeping on
>>>> some I/O operation) is less than the number of logical CPUs in the system
>>>> the thread pool needs to spawn a new thread since there's some CPU capacity
>>>> available,
>>>
>>> For this one it can follow what thread-pool.c is doing, and the upper bound
>>> of n-threads can start from simple, e.g. min(n_channels_multifd, 8)?
>>
>> It needs to be min(n_channels_multifd, n_device_state_devices), because
>> with 9 such devices and 9 multifd channels we need at least 9 threads.
>>
>>>>
>>>> 4) The operation needs to be dispatched to the actual execution thread,
>>>>
>>>> 5) The execution thread needs to figure out which operation it needs to
>>>> actually do, fetch the necessary parameters from the proper "Operation"
>>>> structure, maybe take the necessary locks, before it can actually perform
>>>> the requested operation,
>>>>
>>>> 6) The execution thread needs to serialize (write) the operation result
>>>> back into some "OperationResult" structure, like "OperationGetVCPURegsResult",
>>>
>>> I think in this simplest case, the thread should simply run fn(opaque), in
>>> which it should start to call multifd_queue_device_state() and queue
>>> multifd jobs from the worker thread instead of the vfio dedicated threads.
>>> I don't yet expect much to change in your code from that regard inside what
>>> vfio_save_complete_precopy_async_thread() used to do.
>>>
>>>>
>>>> 7) The execution thread needs to submit this result back to the requester,
>>>>
>>>> 8) The thread pool needs to decide whether to keep this (now idle) execution
>>>> thread in the pool as a reserve thread or terminate it immediately,
>>>>
>>>> 9) The requester needs to be resumed somehow (returned from wait) now that
>>>> the operation it requested is complete,
>>>>
>>>> 10) The requester needs the fetch the operation results from the proper
>>>> "OperationResult" structure and decode them accordingly.
>>>>
>>>>
>>>> As you can see, that's *a lot* of extra code that needs to be maintained
>>>> for just a single operation type.
>>>
>>> I don't yet know why you designed it so complicated, but if I missed
>>> something above please let me know.
>>
>> I explained above that's how running your example of "ioctl(KVM_GET_REGS)"
>> in such thread pool would look like.
>> (It wasn't a proposal to be actually implemented to be clear)
>>
>>>>
>>>>>>
>>>>>>> I know I suggested that.. but that's comparing to what I read in the even
>>>>>>> earlier version, and sorry I wasn't able to suggest something better at
>>>>>>> that time because I simply thought less.
>>>>>>>
>>>>>>> As I mentioned in the other reply elsewhere, I think we should firstly have
>>>>>>> these threads ready to take data at the start of migration, so that it'll
>>>>>>> work when someone wants to add vfio iteration support.  Then the jobs
>>>>>>> (mostly what vfio_save_complete_precopy_async_thread() does now) can be
>>>>>>> enqueued into the thread pools.
>>>>>>
>>>>>> I'm not sure that we can get way with using fewer threads than devices
>>>>>> as these devices might not support AIO reads from their migration file
>>>>>> descriptor.
>>>>>
>>>>> It doesn't need to use AIO reads - I'll be happy if the thread model can be
>>>>> generic, VFIO can still enqueue a task that does blocking reads.
>>>>>
>>>>> It can take a lot of time, but it's fine: others who like to enqueue too
>>>>> and see all threads busy, they should simply block there and waiting for
>>>>> the worker threads to be freed again.  It's the same when there's no
>>>>> migration worker threads as it means the read() will block the main
>>>>> migration thread.
>>>>
>>>> Oh no, waiting for another device blocking read to complete before
>>>> scheduling another device blocking read is surely going to negatively
>>>> impact the performance.
>>>
>>> There can be e.g. 8 worker threads.  If you want you can make sure the
>>> worker threads are at least more than vfio threads.  Then it will guarantee
>>> vfio will dump / save() one device per thread concurrently.
>>
>> Yes, I wrote this requirement above as
>> n_threads = min(n_channels_multifd, n_device_state_devices).
>>
>>>>
>>>> For best performance we need to maximize parallelism - that means
>>>> reading (and loading) all the VFIO devices present in parallel.
>>>>
>>>> The whole point of having per-device threads is for the whole operation
>>>> to be I/O bound but never CPU bound on a reasonably fast machine - and
>>>> especially not number-of-threads-in-pool bound.
>>>>
>>>>> Now we can have multiple worker threads doing things concurrently if
>>>>> possible (some of them may not, especially when BQL will be required, but
>>>>> that's a separate thing, and many device save()s may not need BQL, and when
>>>>> it needs we can take it in the enqueued tasks).
>>>>>
>>>>>>
>>>>>> mlx5 devices, for example, seems to support only poll()ed / non-blocking
>>>>>> reads at best - with unknown performance in comparison with issuing
>>>>>> blocking reads from dedicated threads.
>>>>>>
>>>>>> On the other hand, handling a single device from multiple threads in
>>>>>> parallel is generally not possible due to difficulty of establishing in
>>>>>> which order the buffers were read.
>>>>>>
>>>>>> And if we need a per-VFIO device thread anyway then using a thread pool
>>>>>> doesn't help much - but brings extra complexity.
>>>>>>
>>>>>> In terms of starting the loading thread earlier to load also VM live
>>>>>> phase data it looks like a small change to the code so it shouldn't be
>>>>>> a problem.
>>>>>
>>>>> That's good to know.  Please still consider a generic thread model and see
>>>>> what that would also work for your VFIO use case.
>>>>>
>>>>> If you see what thread-pool.c did right now is it'll dynamically create
>>>>> threads on the fly.  I think that's something we can do too but just apply
>>>>> an upper limit to the thread numbers.
>>>>
>>>> We have an upper limit on the count of saving threads already - it's the
>>>> count of VFIO devices in the VM.
>>>>
>>>> The API in util/thread-pool.c is very basic and basically only allows
>>>> submitting either AIO operations or generic function call operation
>>>> but still within some AioContext.
>>>
>>> What I'm saying is a thread pool _without_ aio.  I think it might be called
>>> ThreadPoolRaw and let ThreadPool depend on it, but I didn't further check yet.
>>
>> So it's not using an existing thread pool implementation from util/thread-pool.c
>> but essentially creating a new one - with probably some code commonality
>> with the existing AIO one.
>>
>> That's possible but since util/thread-pool.c AFAIK isn't owned by the
>> migration subsystem such new implementation will probably need also review by
>> other QEMU maintainers.
> 
> Yes, that's how we normally should do it.  Obviously you still want to push
> that in 9.1, so if you want you can create that pool implementation under
> migration/, and we can try to move it over as a future rework, having block
> people review that later.

I think that with this thread pool introduction we'll unfortunately almost certainly
need to target this patch set at 9.2, since these overall changes (and Fabiano
patches too) will need good testing, might uncover some performance regressions
(for example related to the number of buffers limit or Fabiano multifd changes),
bring some review comments from other people, etc.

In addition to that, we are in the middle of holiday season and a lot of people
aren't available - like Fabiano said he will be available only in a few weeks.

>>
>>>>
>>>> There's almost none of the operation execution logic I described above -
>>>> all of these would need to be written and maintained.
>>>>
>>>>>>
>>>>>>> It's better to create the thread pool owned by migration, rather than
>>>>>>> threads owned by VFIO, because it also paves way for non-VFIO device state
>>>>>>> save()s, as I mentioned also above on the multifd packet header.  Maybe we
>>>>>>> can have a flag in the packet header saying "this is device xxx's state,
>>>>>>> just load it".
>>>>>>
>>>>>> I think the same could be done by simply implementing these hooks in other
>>>>>> device types than VFIO, right?
>>>>>>
>>>>>> And if we notice that these implementations share a bit of code then we
>>>>>> can think about making a common helper library out of this code.
>>>>>>
>>>>>> After, all that's just an implementation detail that does not impact
>>>>>> the underlying bit stream protocol.
>>>>>
>>>>> You're correct.
>>>>>
>>>>> However, it still affects a few things.
>>>>>
>>>>> Firstly, it may mean that we may not even need those two extra vmstate
>>>>> hooks: the enqueue can happen already with save_state() if the migration
>>>>> worker model exists.
>>>>>
>>>>> So instead of this:
>>>>>
>>>>>            vfio_save_state():
>>>>>            if (migration->multifd_transfer) {
>>>>>                    /* Emit dummy NOP data */
>>>>>                    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>>>                    return;
>>>>>            }
>>>>>
>>>>> We can already do:
>>>>>
>>>>>            if (migration->multifd_transfer) {
>>>>>                    // enqueue task to load state for this vfio device
>>>>>                    ...
>>>>>                    return;
>>>>>            }
>>>>>
>>>>> IMHO it'll be much cleaner in VFIO code, and much cleaner too for migration
>>>>> code.
>>>>
>>>> The save_state hook is executed too late - only after all iterable
>>>> hooks have already transferred all their data.
>>>>
>>>> We want to start saving this device state as early as possible to not
>>>> have to wait for any other device to transfer its data first.
>>>>
>>>> That's why the code introduces save_live_complete_precopy_begin hook
>>>> that's guaranteed to be the very first hook called during switchover
>>>> phase device state saving.
>>>
>>> I think I mis-typed.. What I wanted to say is vfio_save_complete_precopy(),
>>> not vfio_save_state().
>>>
>>> There will be one challenge though where RAM is also an iterable, so RAM's
>>> save_live_complete_precopy() can delay VFIO's, even if it simply only need
>>> to enqueue a job.
>>>
>>> Two solutions I can think of:
>>>
>>>     (1) Provide a separate hook, e.g. save_live_complete_precopy_async(),
>>>     when save_live_complete_precopy_async(opaque) is provided, instead of
>>>     calling save_live_complete_precopy(), we inject that job into the worker
>>>     threads.  In that case we can loop over *_precopy_async() before all the
>>>     rest *_precopy() calls.
>>
>> That's basically the approach the current patch set is using, just not using
>> pool worker threads (yet).
>>
>> Only the hook was renamed from save_live_complete_precopy_async to
>> save_live_complete_precopy_begin upon your comment on RFC requesting that.
>>
>>>     (2) Make RAM's save_live_complete_precopy() also does similar enqueue
>>>     when multifd enabled, so RAM will be saved in the worker thread too.
>>>
>>> However (2) can have other issues to work out.  Do you think (1) is still
>>> doable?
>>>
>>
>> Yes, I think (1) is the correct way to do it.
> 
> I don't think "correct" is the correct word to put it.. it's really a
> matter of whether you want to push this earlier in-tree.
> 
> The 2nd proposal will be more than correct to me, IMHO.  That'll be really
> helpful too also to VFIO when RAM can be saved concurrently, then it means
> these things can be done all concurrently:
> 
>    - VFIO, one thread per one device
>    - RAM, one thread
>    - non-iterables
> 
> Otherwise 2+3 needs to be serialized.
> 
> If you're looking for downtime optimizations that may also relevant, afaiu.

Having RAM sent in parallel with non-iterables would make sense to me,
but I am not 100% sure this is a safe thing to do - after all, currently
non-iterables can rely on the whole RAM being already transferred.

Currently, it seems that only RAM, VFIO, block-dirty-bitmap and some
s390x + ppc specific stuff implements .save_live_complete_precopy hooks.

While I am not really concerned about s390x and ppc we'd need to make
sure that any data transferred via these hooks is transferred asynchronously,
to not delay starting the VFIO transmission.

Anyway, that's probably not for this patch set, since if we start widening
its scope beyond the basic device state transfer framework + VFIO we risk
missing 9.2 too.

> And that's also one of the major points why I want to convince you not to
> use a separate vfio thread, because AFAICT we simply have other users.
> 
>>
>>>>
>>>>> Another (possibly personal) reason is, I will not dare to touch VFIO code
>>>>> too much to do such a refactoring later.  I simply don't have the VFIO
>>>>> devices around and I won't be able to test.  So comparing to other things,
>>>>> I hope VFIO stuff can land more stable than others because I am not
>>>>> confident at least myself to clean it.
>>>>
>>>> That's a fair request, will keep this on mind.
>>>>
>>>>> I simply also don't like random threads floating around, considering that
>>>>> how we already have slightly a mess with migration on other reasons (we can
>>>>> still have random TLS threads floating around, I think... and they can
>>>>> cause very hard to debug issues). I feel shaky to maintain it when any
>>>>> device can also start to create whatever threads they can during migration.
>>>>
>>>> The threads themselves aren't very expensive as long as their number
>>>> is kept within reasonable bounds.
>>>>
>>>> 4 additional threads (present only during active migration operation)
>>>> with 4 VFIO devices is really not a lot.
>>>
>>> It's not about number, it's about management, and when something crashed at
>>> some unwanted point, then we may want to know what happened to those
>>> threads and how to recycle them.
>>
>> I guess if you are more comfortable with maintaining code written in such
>> way then that's some argument for it too.
> 
> It's not about my flavour of maintenance.
> 
> We used to work on issues where we see a dangling thread operate on
> migration objects even if it was created in the _previous_ migration,
> cancelled and retried.  And the thread doesn't know that.  It was kind of
> leaked and it causes issues hard to debug.
> 
> VFIO can cause similar thing if it can create some threads that migration
> developers may overlook and not easy to manage.  Then it'll be the same
> challenge when a vfio thread dangled for some reason and it'll just make
> things harder to debug when issue happens.
> 
> I want to make sure if ever possible migration framework manages threads on
> its own, so no thread will be fiddling around without being noticed.
> 
> Not to mention as I mentioned previously, that "having some async model to
> dump vmstate" isn't something special to VFIO, it can easily be extended to
> either RAM, or other normal VMSDs if we can tackle other issues here and
> there.  The general request is the same.  It'll be a chaos if vfio starts
> to create its own threads, then vDPA and others.  It is much saner to make
> it a generic model to me.

I more or less know now how the v2 of this patch set needs to look like
(at least architecturally).

Will try to prepare something in the coming weeks.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-27  9:14               ` Maciej S. Szmigiero
@ 2024-06-27 14:56                 ` Peter Xu
  2024-07-16 20:10                   ` Maciej S. Szmigiero
  2024-06-27 15:09                 ` Peter Xu
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Xu @ 2024-06-27 14:56 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
> On 26.06.2024 18:23, Peter Xu wrote:
> > On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
> > > On 26.06.2024 03:51, Peter Xu wrote:
> > > > On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> > > > > On 25.06.2024 19:25, Peter Xu wrote:
> > > > > > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > Hi Peter,
> > > > > > 
> > > > > > Hi, Maciej,
> > > > > > 
> > > > > > > 
> > > > > > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > > > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > > 
> > > > > > > > > This is an updated v1 patch series of the RFC (v0) series located here:
> > > > > > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
> > > > > > > > 
> > > > > > > > OK I took some hours thinking about this today, and here's some high level
> > > > > > > > comments for this series.  I'll start with which are more relevant to what
> > > > > > > > Fabiano has already suggested in the other thread, then I'll add some more.
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
> > > > > > > 
> > > > > > > That's a long list, thanks for these comments.
> > > > > > > 
> > > > > > > I have responded to them inline below.
> > > > > > > 
> (..)
> > > > > > > > 4. Risk of OOM on unlimited VFIO buffering
> > > > > > > > ==========================================
> > > > > > > > 
> > > > > > > > This follows with above bullet, but my pure question to ask here is how
> > > > > > > > does VFIO guarantees no OOM condition by buffering VFIO state?
> > > > > > > > 
> > > > > > > > I mean, currently your proposal used vfio_load_bufs_thread() as a separate
> > > > > > > > thread to only load the vfio states until sequential data is received,
> > > > > > > > however is there an upper limit of how much buffering it could do?  IOW:
> > > > > > > > 
> > > > > > > > vfio_load_state_buffer():
> > > > > > > > 
> > > > > > > >       if (packet->idx >= migration->load_bufs->len) {
> > > > > > > >           g_array_set_size(migration->load_bufs, packet->idx + 1);
> > > > > > > >       }
> > > > > > > > 
> > > > > > > >       lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
> > > > > > > >       ...
> > > > > > > >       lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
> > > > > > > >       lb->len = data_size - sizeof(*packet);
> > > > > > > >       lb->is_present = true;
> > > > > > > > 
> > > > > > > > What if garray keeps growing with lb->data allocated, which triggers the
> > > > > > > > memcg limit of the process (if QEMU is in such process)?  Or just deplete
> > > > > > > > host memory and causing OOM kill.
> > > > > > > > 
> > > > > > > > I think we may need to find a way to throttle max memory usage of such
> > > > > > > > buffering.
> > > > > > > > 
> > > > > > > > So far this will be more of a problem indeed if this will be done during
> > > > > > > > VFIO iteration phases, but I still hope a solution can work with both
> > > > > > > > iteration phase and the switchover phase, even if you only do that in
> > > > > > > > switchover phase
> > > > > > > 
> > > > > > > Unfortunately, this issue will be hard to fix since the source can
> > > > > > > legitimately send the very first buffer (chunk) of data as the last one
> > > > > > > (at the very end of the transmission).
> > > > > > > 
> > > > > > > In this case, the target will need to buffer nearly the whole data.
> > > > > > > 
> > > > > > > We can't stop the receive on any channel, either, since the next missing
> > > > > > > buffer can arrive at that channel.
> > > > > > > 
> > > > > > > However, I don't think purposely DoSing the target QEMU is a realistic
> > > > > > > security concern in the typical live migration scenario.
> > > > > > > 
> > > > > > > I mean the source can easily force the target QEMU to exit just by
> > > > > > > feeding it wrong migration data.
> > > > > > > 
> > > > > > > In case someone really wants to protect against the impact of
> > > > > > > theoretically unbounded QEMU memory allocations during live migration
> > > > > > > on the rest of the system they can put the target QEMU process
> > > > > > > (temporally) into a memory-limited cgroup.
> > > > > > 
> > > > > > Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
> > > > > > exactly talking about the generic case where QEMU (either src or dest, in
> > > > > > that case normally both) is put into the memcg and if QEMU uses too much
> > > > > > memory it'll literally get killed even if no DoS issue at all.
> > > > > > 
> > > > > > In short, we hopefully will have a design that will always work with QEMU
> > > > > > running in a container, without 0.5% chance dest qemu being killed, if you
> > > > > > see what I meant.
> > > > > > 
> > > > > > The upper bound of VFIO buffering will be needed so the admin can add that
> > > > > > on top of the memcg limit and as long as QEMU keeps its words it'll always
> > > > > > work without sudden death.
> > > > > > 
> > > > > > I think I have some idea about resolving this problem.  That idea can
> > > > > > further complicate the protocol a little bit.  But before that let's see
> > > > > > whether we can reach an initial consensus on this matter first, on whether
> > > > > > this is a sane request.  In short, we'll need to start to have a
> > > > > > configurable size to say how much VFIO can buffer, maybe per-device, or
> > > > > > globally.  Then based on that we need to have some logic guarantee that
> > > > > > over-mem won't happen, also without heavily affecting concurrency (e.g.,
> > > > > > single thread is definitely safe and without caching, but it can be
> > > > > > slower).
> > > > > 
> > > > > Here, I think I can add a per-device limit parameter on the number of
> > > > > buffers received out-of-order or waiting to be loaded into the device -
> > > > > with a reasonable default.
> > > > 
> > > > Yes that should work.
> > > > 
> > > > I don't even expect people would change that, but this might be the
> > > > information people will need to know before putting it into a container if
> > > > it's larger than how qemu dynamically consumes memories here and there.
> > > > I'd expect it is still small enough so nobody will notice it (maybe a few
> > > > tens of MBs? but just wildly guessing, where tens of MBs could fall into
> > > > the "noise" memory allocation window of a VM).
> > > 
> > > The single buffer size is 8 MiB so I think the safe default should be
> > > allowing 2 times the number of multifd channels.
> > > 
> > > With 5 multifd channels that's 10 buffers * 8 MiB = 80 MiB worst
> > > case buffering per device.
> > > 
> > > But this will need to be determined experimentally once such parameter
> > > is added to be sure it's enough.
> > 
> > Yes you may want to test it with a new logic to be able to throttle sending
> > on src qemu (because otherwise when dest qemu is very unlucky its buffer is
> > full but still the initial index chunk is missing), then making sure it's
> > relatively small but hopefully still keep the flow running as much as
> > possible.
> > 
> > > 
> > > > > 
> > > > > (..)
> > > > > > > > 5. Worker thread model
> > > > > > > > ======================
> > > > > > > > 
> > > > > > > > I'm so far not happy with what this proposal suggests on creating the
> > > > > > > > threads, also the two new hooks mostly just to create these threads..
> > > > > > > 
> > > > > > > That VFIO .save_live_complete_precopy_begin handler crates a new
> > > > > > > per-device thread is an implementation detail for this particular
> > > > > > > driver.
> > > > > > > 
> > > > > > > The whole idea behind this and save_live_complete_precopy_end hook was
> > > > > > > that details how the particular device driver does its own async saving
> > > > > > > is abstracted away from the migration core.
> > > > > > > 
> > > > > > > The device then can do what's best / most efficient for it to do.
> > > > > > 
> > > > > > Yes, and what I was thinking is whether it does it in form of "enqueue a
> > > > > > task to migration worker threads", rather than "creating its own threads in
> > > > > > the device hooks, and managing those threads alone".
> > > > > > 
> > > > > > It's all about whether such threading can be reused by non-VFIO devices.
> > > > > > They can't be reused if VFIO is in charge here, and it will make migration
> > > > > > less generic.
> > > > > > 
> > > > > > My current opinion is they can and should be re-usable. Consider if someone
> > > > > > starts to teach multifd carry non-vfio data (e.g. a generic VMSD), then we
> > > > > > can enqueue a task, do e.g. ioctl(KVM_GET_REGS) in those threads (rather
> > > > > > than VFIO read()).
> > > > > 
> > > > > Theoretically, it's obviously possible to wrap every operation in a request
> > > > > to some thread pool.
> > > > > 
> > > > > 
> > > > > But that would bring a lot of complexity, since instead of performing these
> > > > > operation directly now the requester will need to:
> > > > > 1) Prepare some "Operation" structure with the parameters of the requested
> > > > > operation (task).
> > > > > In your case this could be QEMU_OP_GET_VCPU_REGS operation using
> > > > > "OperationGetVCPURegs" struct containing vCPU number parameter = 1.
> > > > 
> > > > Why such complexity is needed?
> > > 
> > > I just gave an example how implementing running a individual task like
> > > "ioctl(KVM_GET_REGS)" (that you suggested above) in such thread pool would
> > > look like.
> > > > Can it be as simple as func(opaque) to be queued, then here
> > > > func==vfio_save_complete_precopy_async_thread, opaque=VFIODevice*?
> > > 
> > > That would be possible, although in both implementations of:
> > > 1) adding a new thread pool type and wrapping device reading thread
> > > creation around such pool, OR:
> > > 2) a direct qemu_thread_create() call.
> > > the number of threads actually created would be the same.
> > 
> > Again, it's not about the number of threads that I worry.
> > 
> > If you create one thread but hard to manage it's the same.
> > 
> > OTOH if it's a common model I think it's fine if you create 16 or 32,
> > especially if when most of them are either idle or doing block IOs, then
> > they'll be put to sleep anyway.  That's not a concern at all.
> > 
> > > 
> > > That's unless someone sets the multifd channel count below the number
> > > of VFIO devices - but one might argue that's not really a configuration
> > > where good performance is expected anyway.
> > > 
> > > > > 
> > > > > 2) Submit this operation to the thread pool and wait for it to complete,
> > > > 
> > > > VFIO doesn't need to have its own code waiting.  If this pool is for
> > > > migration purpose in general, qemu migration framework will need to wait at
> > > > some point for all jobs to finish before moving on.  Perhaps it should be
> > > > at the end of the non-iterative session.
> > > 
> > > So essentially, instead of calling save_live_complete_precopy_end handlers
> > > from the migration code you would like to hard-code its current VFIO
> > > implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().
> > > 
> > > Only it wouldn't be then called VFIO precopy async thread terminate but some
> > > generic device state async precopy thread terminate function.
> > 
> > I don't understand what did you mean by "hard code".
> 
> "Hard code" wasn't maybe the best expression here.
> 
> I meant the move of the functionality that's provided by
> vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
> to the common migration code.

I see.  That function only does a thread_join() so far.

So can I understand it as below [1] should work for us, and it'll be clean
too (with nothing to hard-code)?

The time to join() the worker threads can be even later, until
migrate_fd_cleanup() on sender side.  You may have a better idea on when
would be the best place to do it when start working on it.

> 
> > What I was saying is if we target the worker thread pool to be used for
> > "concurrently dump vmstates", then it'll make sense to make sure all the
> > jobs there were flushed after qemu dumps all non-iterables (because this
> > should be the last step of the switchover).
> > 
> > I expect it looks like this:
> > 
> >    while (pool->active_threads) {
> >        qemu_sem_wait(&pool->job_done);
> >    }

[1]

> > 
> > > 
> > > > > 
> > > > > 3) Thread pool needs to check whether it has any free threads in the pool
> > > > > available to perform this operation.
> > > > > 
> > > > > If not, and the count of threads that are CPU-bound (~aren't sleeping on
> > > > > some I/O operation) is less than the number of logical CPUs in the system
> > > > > the thread pool needs to spawn a new thread since there's some CPU capacity
> > > > > available,
> > > > 
> > > > For this one it can follow what thread-pool.c is doing, and the upper bound
> > > > of n-threads can start from simple, e.g. min(n_channels_multifd, 8)?
> > > 
> > > It needs to be min(n_channels_multifd, n_device_state_devices), because
> > > with 9 such devices and 9 multifd channels we need at least 9 threads.
> > > 
> > > > > 
> > > > > 4) The operation needs to be dispatched to the actual execution thread,
> > > > > 
> > > > > 5) The execution thread needs to figure out which operation it needs to
> > > > > actually do, fetch the necessary parameters from the proper "Operation"
> > > > > structure, maybe take the necessary locks, before it can actually perform
> > > > > the requested operation,
> > > > > 
> > > > > 6) The execution thread needs to serialize (write) the operation result
> > > > > back into some "OperationResult" structure, like "OperationGetVCPURegsResult",
> > > > 
> > > > I think in this simplest case, the thread should simply run fn(opaque), in
> > > > which it should start to call multifd_queue_device_state() and queue
> > > > multifd jobs from the worker thread instead of the vfio dedicated threads.
> > > > I don't yet expect much to change in your code from that regard inside what
> > > > vfio_save_complete_precopy_async_thread() used to do.
> > > > 
> > > > > 
> > > > > 7) The execution thread needs to submit this result back to the requester,
> > > > > 
> > > > > 8) The thread pool needs to decide whether to keep this (now idle) execution
> > > > > thread in the pool as a reserve thread or terminate it immediately,
> > > > > 
> > > > > 9) The requester needs to be resumed somehow (returned from wait) now that
> > > > > the operation it requested is complete,
> > > > > 
> > > > > 10) The requester needs the fetch the operation results from the proper
> > > > > "OperationResult" structure and decode them accordingly.
> > > > > 
> > > > > 
> > > > > As you can see, that's *a lot* of extra code that needs to be maintained
> > > > > for just a single operation type.
> > > > 
> > > > I don't yet know why you designed it so complicated, but if I missed
> > > > something above please let me know.
> > > 
> > > I explained above that's how running your example of "ioctl(KVM_GET_REGS)"
> > > in such thread pool would look like.
> > > (It wasn't a proposal to be actually implemented to be clear)
> > > 
> > > > > 
> > > > > > > 
> > > > > > > > I know I suggested that.. but that's comparing to what I read in the even
> > > > > > > > earlier version, and sorry I wasn't able to suggest something better at
> > > > > > > > that time because I simply thought less.
> > > > > > > > 
> > > > > > > > As I mentioned in the other reply elsewhere, I think we should firstly have
> > > > > > > > these threads ready to take data at the start of migration, so that it'll
> > > > > > > > work when someone wants to add vfio iteration support.  Then the jobs
> > > > > > > > (mostly what vfio_save_complete_precopy_async_thread() does now) can be
> > > > > > > > enqueued into the thread pools.
> > > > > > > 
> > > > > > > I'm not sure that we can get way with using fewer threads than devices
> > > > > > > as these devices might not support AIO reads from their migration file
> > > > > > > descriptor.
> > > > > > 
> > > > > > It doesn't need to use AIO reads - I'll be happy if the thread model can be
> > > > > > generic, VFIO can still enqueue a task that does blocking reads.
> > > > > > 
> > > > > > It can take a lot of time, but it's fine: others who like to enqueue too
> > > > > > and see all threads busy, they should simply block there and waiting for
> > > > > > the worker threads to be freed again.  It's the same when there's no
> > > > > > migration worker threads as it means the read() will block the main
> > > > > > migration thread.
> > > > > 
> > > > > Oh no, waiting for another device blocking read to complete before
> > > > > scheduling another device blocking read is surely going to negatively
> > > > > impact the performance.
> > > > 
> > > > There can be e.g. 8 worker threads.  If you want you can make sure the
> > > > worker threads are at least more than vfio threads.  Then it will guarantee
> > > > vfio will dump / save() one device per thread concurrently.
> > > 
> > > Yes, I wrote this requirement above as
> > > n_threads = min(n_channels_multifd, n_device_state_devices).
> > > 
> > > > > 
> > > > > For best performance we need to maximize parallelism - that means
> > > > > reading (and loading) all the VFIO devices present in parallel.
> > > > > 
> > > > > The whole point of having per-device threads is for the whole operation
> > > > > to be I/O bound but never CPU bound on a reasonably fast machine - and
> > > > > especially not number-of-threads-in-pool bound.
> > > > > 
> > > > > > Now we can have multiple worker threads doing things concurrently if
> > > > > > possible (some of them may not, especially when BQL will be required, but
> > > > > > that's a separate thing, and many device save()s may not need BQL, and when
> > > > > > it needs we can take it in the enqueued tasks).
> > > > > > 
> > > > > > > 
> > > > > > > mlx5 devices, for example, seems to support only poll()ed / non-blocking
> > > > > > > reads at best - with unknown performance in comparison with issuing
> > > > > > > blocking reads from dedicated threads.
> > > > > > > 
> > > > > > > On the other hand, handling a single device from multiple threads in
> > > > > > > parallel is generally not possible due to difficulty of establishing in
> > > > > > > which order the buffers were read.
> > > > > > > 
> > > > > > > And if we need a per-VFIO device thread anyway then using a thread pool
> > > > > > > doesn't help much - but brings extra complexity.
> > > > > > > 
> > > > > > > In terms of starting the loading thread earlier to load also VM live
> > > > > > > phase data it looks like a small change to the code so it shouldn't be
> > > > > > > a problem.
> > > > > > 
> > > > > > That's good to know.  Please still consider a generic thread model and see
> > > > > > what that would also work for your VFIO use case.
> > > > > > 
> > > > > > If you see what thread-pool.c did right now is it'll dynamically create
> > > > > > threads on the fly.  I think that's something we can do too but just apply
> > > > > > an upper limit to the thread numbers.
> > > > > 
> > > > > We have an upper limit on the count of saving threads already - it's the
> > > > > count of VFIO devices in the VM.
> > > > > 
> > > > > The API in util/thread-pool.c is very basic and basically only allows
> > > > > submitting either AIO operations or generic function call operation
> > > > > but still within some AioContext.
> > > > 
> > > > What I'm saying is a thread pool _without_ aio.  I think it might be called
> > > > ThreadPoolRaw and let ThreadPool depend on it, but I didn't further check yet.
> > > 
> > > So it's not using an existing thread pool implementation from util/thread-pool.c
> > > but essentially creating a new one - with probably some code commonality
> > > with the existing AIO one.
> > > 
> > > That's possible but since util/thread-pool.c AFAIK isn't owned by the
> > > migration subsystem such new implementation will probably need also review by
> > > other QEMU maintainers.
> > 
> > Yes, that's how we normally should do it.  Obviously you still want to push
> > that in 9.1, so if you want you can create that pool implementation under
> > migration/, and we can try to move it over as a future rework, having block
> > people review that later.
> 
> I think that with this thread pool introduction we'll unfortunately almost certainly
> need to target this patch set at 9.2, since these overall changes (and Fabiano
> patches too) will need good testing, might uncover some performance regressions
> (for example related to the number of buffers limit or Fabiano multifd changes),
> bring some review comments from other people, etc.
> 
> In addition to that, we are in the middle of holiday season and a lot of people
> aren't available - like Fabiano said he will be available only in a few weeks.

Right, that's unfortunate.  Let's see, but still I really hope we can also
get some feedback from Fabiano before it lands, even with that we have
chance for 9.1 but it's just challenging, it's the same condition I
mentioned since the 1st email.  And before Fabiano's back (he's the active
maintainer for this release), I'm personally happy if you can propose
something that can land earlier in this release partly.  E.g., if you want
we can at least upstream Fabiano's idea first, or some more on top.

For that, also feel to have a look at my comment today:

https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n

Feel free to comment there too.  There's a tiny uncertainty there so far on
specifying "max size for a device state" if do what I suggested, as multifd
setup will need to allocate an enum buffer suitable for both ram + device.
But I think that's not an issue and you'll tackle that properly when
working on it.  It's more about whether you agree on what I said as a
general concept.

> 
> > > 
> > > > > 
> > > > > There's almost none of the operation execution logic I described above -
> > > > > all of these would need to be written and maintained.
> > > > > 
> > > > > > > 
> > > > > > > > It's better to create the thread pool owned by migration, rather than
> > > > > > > > threads owned by VFIO, because it also paves way for non-VFIO device state
> > > > > > > > save()s, as I mentioned also above on the multifd packet header.  Maybe we
> > > > > > > > can have a flag in the packet header saying "this is device xxx's state,
> > > > > > > > just load it".
> > > > > > > 
> > > > > > > I think the same could be done by simply implementing these hooks in other
> > > > > > > device types than VFIO, right?
> > > > > > > 
> > > > > > > And if we notice that these implementations share a bit of code then we
> > > > > > > can think about making a common helper library out of this code.
> > > > > > > 
> > > > > > > After, all that's just an implementation detail that does not impact
> > > > > > > the underlying bit stream protocol.
> > > > > > 
> > > > > > You're correct.
> > > > > > 
> > > > > > However, it still affects a few things.
> > > > > > 
> > > > > > Firstly, it may mean that we may not even need those two extra vmstate
> > > > > > hooks: the enqueue can happen already with save_state() if the migration
> > > > > > worker model exists.
> > > > > > 
> > > > > > So instead of this:
> > > > > > 
> > > > > >            vfio_save_state():
> > > > > >            if (migration->multifd_transfer) {
> > > > > >                    /* Emit dummy NOP data */
> > > > > >                    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > > > >                    return;
> > > > > >            }
> > > > > > 
> > > > > > We can already do:
> > > > > > 
> > > > > >            if (migration->multifd_transfer) {
> > > > > >                    // enqueue task to load state for this vfio device
> > > > > >                    ...
> > > > > >                    return;
> > > > > >            }
> > > > > > 
> > > > > > IMHO it'll be much cleaner in VFIO code, and much cleaner too for migration
> > > > > > code.
> > > > > 
> > > > > The save_state hook is executed too late - only after all iterable
> > > > > hooks have already transferred all their data.
> > > > > 
> > > > > We want to start saving this device state as early as possible to not
> > > > > have to wait for any other device to transfer its data first.
> > > > > 
> > > > > That's why the code introduces save_live_complete_precopy_begin hook
> > > > > that's guaranteed to be the very first hook called during switchover
> > > > > phase device state saving.
> > > > 
> > > > I think I mis-typed.. What I wanted to say is vfio_save_complete_precopy(),
> > > > not vfio_save_state().
> > > > 
> > > > There will be one challenge though where RAM is also an iterable, so RAM's
> > > > save_live_complete_precopy() can delay VFIO's, even if it simply only need
> > > > to enqueue a job.
> > > > 
> > > > Two solutions I can think of:
> > > > 
> > > >     (1) Provide a separate hook, e.g. save_live_complete_precopy_async(),
> > > >     when save_live_complete_precopy_async(opaque) is provided, instead of
> > > >     calling save_live_complete_precopy(), we inject that job into the worker
> > > >     threads.  In that case we can loop over *_precopy_async() before all the
> > > >     rest *_precopy() calls.
> > > 
> > > That's basically the approach the current patch set is using, just not using
> > > pool worker threads (yet).
> > > 
> > > Only the hook was renamed from save_live_complete_precopy_async to
> > > save_live_complete_precopy_begin upon your comment on RFC requesting that.
> > > 
> > > >     (2) Make RAM's save_live_complete_precopy() also does similar enqueue
> > > >     when multifd enabled, so RAM will be saved in the worker thread too.
> > > > 
> > > > However (2) can have other issues to work out.  Do you think (1) is still
> > > > doable?
> > > > 
> > > 
> > > Yes, I think (1) is the correct way to do it.
> > 
> > I don't think "correct" is the correct word to put it.. it's really a
> > matter of whether you want to push this earlier in-tree.
> > 
> > The 2nd proposal will be more than correct to me, IMHO.  That'll be really
> > helpful too also to VFIO when RAM can be saved concurrently, then it means
> > these things can be done all concurrently:
> > 
> >    - VFIO, one thread per one device
> >    - RAM, one thread
> >    - non-iterables
> > 
> > Otherwise 2+3 needs to be serialized.
> > 
> > If you're looking for downtime optimizations that may also relevant, afaiu.
> 
> Having RAM sent in parallel with non-iterables would make sense to me,
> but I am not 100% sure this is a safe thing to do - after all, currently
> non-iterables can rely on the whole RAM being already transferred.
> 
> Currently, it seems that only RAM, VFIO, block-dirty-bitmap and some
> s390x + ppc specific stuff implements .save_live_complete_precopy hooks.
> 
> While I am not really concerned about s390x and ppc we'd need to make
> sure that any data transferred via these hooks is transferred asynchronously,
> to not delay starting the VFIO transmission.
> 
> Anyway, that's probably not for this patch set, since if we start widening
> its scope beyond the basic device state transfer framework + VFIO we risk
> missing 9.2 too.

IMHO targetting that in 9.1 was simply too optimistic.  Next time if you
want to make sure it'll be in (or at least showing that is the goal), you
should really start early to spin with non-rfc series, rather than waiting.
Maybe that'll make the chance higher.

This series, as the 1st one to introduce (1) device state migrations on
multifd, and (2) async vmstate transfers, just can be involved as I
mentioned, because we may want to do this first and do it right, paving way
for others.

> 
> > And that's also one of the major points why I want to convince you not to
> > use a separate vfio thread, because AFAICT we simply have other users.
> > 
> > > 
> > > > > 
> > > > > > Another (possibly personal) reason is, I will not dare to touch VFIO code
> > > > > > too much to do such a refactoring later.  I simply don't have the VFIO
> > > > > > devices around and I won't be able to test.  So comparing to other things,
> > > > > > I hope VFIO stuff can land more stable than others because I am not
> > > > > > confident at least myself to clean it.
> > > > > 
> > > > > That's a fair request, will keep this on mind.
> > > > > 
> > > > > > I simply also don't like random threads floating around, considering that
> > > > > > how we already have slightly a mess with migration on other reasons (we can
> > > > > > still have random TLS threads floating around, I think... and they can
> > > > > > cause very hard to debug issues). I feel shaky to maintain it when any
> > > > > > device can also start to create whatever threads they can during migration.
> > > > > 
> > > > > The threads themselves aren't very expensive as long as their number
> > > > > is kept within reasonable bounds.
> > > > > 
> > > > > 4 additional threads (present only during active migration operation)
> > > > > with 4 VFIO devices is really not a lot.
> > > > 
> > > > It's not about number, it's about management, and when something crashed at
> > > > some unwanted point, then we may want to know what happened to those
> > > > threads and how to recycle them.
> > > 
> > > I guess if you are more comfortable with maintaining code written in such
> > > way then that's some argument for it too.
> > 
> > It's not about my flavour of maintenance.
> > 
> > We used to work on issues where we see a dangling thread operate on
> > migration objects even if it was created in the _previous_ migration,
> > cancelled and retried.  And the thread doesn't know that.  It was kind of
> > leaked and it causes issues hard to debug.
> > 
> > VFIO can cause similar thing if it can create some threads that migration
> > developers may overlook and not easy to manage.  Then it'll be the same
> > challenge when a vfio thread dangled for some reason and it'll just make
> > things harder to debug when issue happens.
> > 
> > I want to make sure if ever possible migration framework manages threads on
> > its own, so no thread will be fiddling around without being noticed.
> > 
> > Not to mention as I mentioned previously, that "having some async model to
> > dump vmstate" isn't something special to VFIO, it can easily be extended to
> > either RAM, or other normal VMSDs if we can tackle other issues here and
> > there.  The general request is the same.  It'll be a chaos if vfio starts
> > to create its own threads, then vDPA and others.  It is much saner to make
> > it a generic model to me.
> 
> I more or less know now how the v2 of this patch set needs to look like
> (at least architecturally).
> 
> Will try to prepare something in the coming weeks.

Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-27 14:56                 ` Peter Xu
@ 2024-07-16 20:10                   ` Maciej S. Szmigiero
  2024-07-17 18:49                     ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-07-16 20:10 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 27.06.2024 16:56, Peter Xu wrote:
> On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
>> On 26.06.2024 18:23, Peter Xu wrote:
>>> On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
>>>> On 26.06.2024 03:51, Peter Xu wrote:
>>>>> On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 25.06.2024 19:25, Peter Xu wrote:
>>>>>>> On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> Hi, Maciej,
>>>>>>>
>>>>>>>>
>>>>>>>> On 23.06.2024 22:27, Peter Xu wrote:
>>>>>>>>> On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>
>>>>>>>>>> This is an updated v1 patch series of the RFC (v0) series located here:
>>>>>>>>>> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
>>>>>>>>>
>>>>>>>>> OK I took some hours thinking about this today, and here's some high level
>>>>>>>>> comments for this series.  I'll start with which are more relevant to what
>>>>>>>>> Fabiano has already suggested in the other thread, then I'll add some more.
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
>>>>>>>>
>>>>>>>> That's a long list, thanks for these comments.
>>>>>>>>
>>>>>>>> I have responded to them inline below.
>>>>>>>>(..)
>>>>>>
>>>>>> 2) Submit this operation to the thread pool and wait for it to complete,
>>>>>
>>>>> VFIO doesn't need to have its own code waiting.  If this pool is for
>>>>> migration purpose in general, qemu migration framework will need to wait at
>>>>> some point for all jobs to finish before moving on.  Perhaps it should be
>>>>> at the end of the non-iterative session.
>>>>
>>>> So essentially, instead of calling save_live_complete_precopy_end handlers
>>>> from the migration code you would like to hard-code its current VFIO
>>>> implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().
>>>>
>>>> Only it wouldn't be then called VFIO precopy async thread terminate but some
>>>> generic device state async precopy thread terminate function.
>>>
>>> I don't understand what did you mean by "hard code".
>>
>> "Hard code" wasn't maybe the best expression here.
>>
>> I meant the move of the functionality that's provided by
>> vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
>> to the common migration code.
> 
> I see.  That function only does a thread_join() so far.
> 
> So can I understand it as below [1] should work for us, and it'll be clean
> too (with nothing to hard-code)?

It will need some signal to the worker threads pool to terminate before
waiting for them to finish (as the code in [1] just waits).

In the case of current vfio_save_complete_precopy_async_thread() implementation,
this signal isn't necessary as this thread simply terminates when it has read
all the date it needs from the device.

In a worker threads pool case there will be some threads waiting for
jobs to be queued to them and so they will need to be somehow signaled
to exit.

> The time to join() the worker threads can be even later, until
> migrate_fd_cleanup() on sender side.  You may have a better idea on when
> would be the best place to do it when start working on it.
> 
>>
>>> What I was saying is if we target the worker thread pool to be used for
>>> "concurrently dump vmstates", then it'll make sense to make sure all the
>>> jobs there were flushed after qemu dumps all non-iterables (because this
>>> should be the last step of the switchover).
>>>
>>> I expect it looks like this:
>>>
>>>     while (pool->active_threads) {
>>>         qemu_sem_wait(&pool->job_done);
>>>     }
> 
> [1]
> 
(..)
>> I think that with this thread pool introduction we'll unfortunately almost certainly
>> need to target this patch set at 9.2, since these overall changes (and Fabiano
>> patches too) will need good testing, might uncover some performance regressions
>> (for example related to the number of buffers limit or Fabiano multifd changes),
>> bring some review comments from other people, etc.
>>
>> In addition to that, we are in the middle of holiday season and a lot of people
>> aren't available - like Fabiano said he will be available only in a few weeks.
> 
> Right, that's unfortunate.  Let's see, but still I really hope we can also
> get some feedback from Fabiano before it lands, even with that we have
> chance for 9.1 but it's just challenging, it's the same condition I
> mentioned since the 1st email.  And before Fabiano's back (he's the active
> maintainer for this release), I'm personally happy if you can propose
> something that can land earlier in this release partly.  E.g., if you want
> we can at least upstream Fabiano's idea first, or some more on top.
> 
> For that, also feel to have a look at my comment today:
> 
> https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n
> 
> Feel free to comment there too.  There's a tiny uncertainty there so far on
> specifying "max size for a device state" if do what I suggested, as multifd
> setup will need to allocate an enum buffer suitable for both ram + device.
> But I think that's not an issue and you'll tackle that properly when
> working on it.  It's more about whether you agree on what I said as a
> general concept.
> 

Since it seems that the discussion on Fabiano's patch set has subsided I think
I will start by basing my updated patch set on top of his RFC and then if
Fabiano wants to submit v1/v2 of his patch set then I will rebase mine on top
of it.

Otherwise, you can wait until I have a v2 ready and then we can work with that.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-07-16 20:10                   ` Maciej S. Szmigiero
@ 2024-07-17 18:49                     ` Peter Xu
  2024-07-17 20:19                       ` Fabiano Rosas
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2024-07-17 18:49 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Jul 16, 2024 at 10:10:12PM +0200, Maciej S. Szmigiero wrote:
> On 27.06.2024 16:56, Peter Xu wrote:
> > On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
> > > On 26.06.2024 18:23, Peter Xu wrote:
> > > > On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 26.06.2024 03:51, Peter Xu wrote:
> > > > > > On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 25.06.2024 19:25, Peter Xu wrote:
> > > > > > > > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > Hi Peter,
> > > > > > > > 
> > > > > > > > Hi, Maciej,
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > > > > > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > > > > 
> > > > > > > > > > > This is an updated v1 patch series of the RFC (v0) series located here:
> > > > > > > > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
> > > > > > > > > > 
> > > > > > > > > > OK I took some hours thinking about this today, and here's some high level
> > > > > > > > > > comments for this series.  I'll start with which are more relevant to what
> > > > > > > > > > Fabiano has already suggested in the other thread, then I'll add some more.
> > > > > > > > > > 
> > > > > > > > > > https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
> > > > > > > > > 
> > > > > > > > > That's a long list, thanks for these comments.
> > > > > > > > > 
> > > > > > > > > I have responded to them inline below.
> > > > > > > > > (..)
> > > > > > > 
> > > > > > > 2) Submit this operation to the thread pool and wait for it to complete,
> > > > > > 
> > > > > > VFIO doesn't need to have its own code waiting.  If this pool is for
> > > > > > migration purpose in general, qemu migration framework will need to wait at
> > > > > > some point for all jobs to finish before moving on.  Perhaps it should be
> > > > > > at the end of the non-iterative session.
> > > > > 
> > > > > So essentially, instead of calling save_live_complete_precopy_end handlers
> > > > > from the migration code you would like to hard-code its current VFIO
> > > > > implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().
> > > > > 
> > > > > Only it wouldn't be then called VFIO precopy async thread terminate but some
> > > > > generic device state async precopy thread terminate function.
> > > > 
> > > > I don't understand what did you mean by "hard code".
> > > 
> > > "Hard code" wasn't maybe the best expression here.
> > > 
> > > I meant the move of the functionality that's provided by
> > > vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
> > > to the common migration code.
> > 
> > I see.  That function only does a thread_join() so far.
> > 
> > So can I understand it as below [1] should work for us, and it'll be clean
> > too (with nothing to hard-code)?
> 
> It will need some signal to the worker threads pool to terminate before
> waiting for them to finish (as the code in [1] just waits).
> 
> In the case of current vfio_save_complete_precopy_async_thread() implementation,
> this signal isn't necessary as this thread simply terminates when it has read
> all the date it needs from the device.
> 
> In a worker threads pool case there will be some threads waiting for
> jobs to be queued to them and so they will need to be somehow signaled
> to exit.

Right.  We may need something like multifd_send_should_exit() +
MultiFDSendParams.sem.  It'll be nicer if we can generalize that part so
multifd threads can also rebase to that thread model, but maybe I'm asking
too much.

> 
> > The time to join() the worker threads can be even later, until
> > migrate_fd_cleanup() on sender side.  You may have a better idea on when
> > would be the best place to do it when start working on it.
> > 
> > > 
> > > > What I was saying is if we target the worker thread pool to be used for
> > > > "concurrently dump vmstates", then it'll make sense to make sure all the
> > > > jobs there were flushed after qemu dumps all non-iterables (because this
> > > > should be the last step of the switchover).
> > > > 
> > > > I expect it looks like this:
> > > > 
> > > >     while (pool->active_threads) {
> > > >         qemu_sem_wait(&pool->job_done);
> > > >     }
> > 
> > [1]
> > 
> (..)
> > > I think that with this thread pool introduction we'll unfortunately almost certainly
> > > need to target this patch set at 9.2, since these overall changes (and Fabiano
> > > patches too) will need good testing, might uncover some performance regressions
> > > (for example related to the number of buffers limit or Fabiano multifd changes),
> > > bring some review comments from other people, etc.
> > > 
> > > In addition to that, we are in the middle of holiday season and a lot of people
> > > aren't available - like Fabiano said he will be available only in a few weeks.
> > 
> > Right, that's unfortunate.  Let's see, but still I really hope we can also
> > get some feedback from Fabiano before it lands, even with that we have
> > chance for 9.1 but it's just challenging, it's the same condition I
> > mentioned since the 1st email.  And before Fabiano's back (he's the active
> > maintainer for this release), I'm personally happy if you can propose
> > something that can land earlier in this release partly.  E.g., if you want
> > we can at least upstream Fabiano's idea first, or some more on top.
> > 
> > For that, also feel to have a look at my comment today:
> > 
> > https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n
> > 
> > Feel free to comment there too.  There's a tiny uncertainty there so far on
> > specifying "max size for a device state" if do what I suggested, as multifd
> > setup will need to allocate an enum buffer suitable for both ram + device.
> > But I think that's not an issue and you'll tackle that properly when
> > working on it.  It's more about whether you agree on what I said as a
> > general concept.
> > 
> 
> Since it seems that the discussion on Fabiano's patch set has subsided I think
> I will start by basing my updated patch set on top of his RFC and then if
> Fabiano wants to submit v1/v2 of his patch set then I will rebase mine on top
> of it.
> 
> Otherwise, you can wait until I have a v2 ready and then we can work with that.

Oh I thought you already started modifying his patchset.

In this case, AFAIR Fabiano has plan to rework that RFC series, so maybe
you want to double check with him, and can also wait for his new version if
that's easier, because I do expect there'll be major changes.

Fabiano?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-07-17 18:49                     ` Peter Xu
@ 2024-07-17 20:19                       ` Fabiano Rosas
  2024-07-17 21:07                         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 29+ messages in thread
From: Fabiano Rosas @ 2024-07-17 20:19 UTC (permalink / raw)
  To: Peter Xu, Maciej S. Szmigiero
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Tue, Jul 16, 2024 at 10:10:12PM +0200, Maciej S. Szmigiero wrote:
>> On 27.06.2024 16:56, Peter Xu wrote:
>> > On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
>> > > On 26.06.2024 18:23, Peter Xu wrote:
>> > > > On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
>> > > > > On 26.06.2024 03:51, Peter Xu wrote:
>> > > > > > On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
>> > > > > > > On 25.06.2024 19:25, Peter Xu wrote:
>> > > > > > > > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
>> > > > > > > > > Hi Peter,
>> > > > > > > > 
>> > > > > > > > Hi, Maciej,
>> > > > > > > > 
>> > > > > > > > > 
>> > > > > > > > > On 23.06.2024 22:27, Peter Xu wrote:
>> > > > > > > > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
>> > > > > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> > > > > > > > > > > 
>> > > > > > > > > > > This is an updated v1 patch series of the RFC (v0) series located here:
>> > > > > > > > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
>> > > > > > > > > > 
>> > > > > > > > > > OK I took some hours thinking about this today, and here's some high level
>> > > > > > > > > > comments for this series.  I'll start with which are more relevant to what
>> > > > > > > > > > Fabiano has already suggested in the other thread, then I'll add some more.
>> > > > > > > > > > 
>> > > > > > > > > > https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
>> > > > > > > > > 
>> > > > > > > > > That's a long list, thanks for these comments.
>> > > > > > > > > 
>> > > > > > > > > I have responded to them inline below.
>> > > > > > > > > (..)
>> > > > > > > 
>> > > > > > > 2) Submit this operation to the thread pool and wait for it to complete,
>> > > > > > 
>> > > > > > VFIO doesn't need to have its own code waiting.  If this pool is for
>> > > > > > migration purpose in general, qemu migration framework will need to wait at
>> > > > > > some point for all jobs to finish before moving on.  Perhaps it should be
>> > > > > > at the end of the non-iterative session.
>> > > > > 
>> > > > > So essentially, instead of calling save_live_complete_precopy_end handlers
>> > > > > from the migration code you would like to hard-code its current VFIO
>> > > > > implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().
>> > > > > 
>> > > > > Only it wouldn't be then called VFIO precopy async thread terminate but some
>> > > > > generic device state async precopy thread terminate function.
>> > > > 
>> > > > I don't understand what did you mean by "hard code".
>> > > 
>> > > "Hard code" wasn't maybe the best expression here.
>> > > 
>> > > I meant the move of the functionality that's provided by
>> > > vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
>> > > to the common migration code.
>> > 
>> > I see.  That function only does a thread_join() so far.
>> > 
>> > So can I understand it as below [1] should work for us, and it'll be clean
>> > too (with nothing to hard-code)?
>> 
>> It will need some signal to the worker threads pool to terminate before
>> waiting for them to finish (as the code in [1] just waits).
>> 
>> In the case of current vfio_save_complete_precopy_async_thread() implementation,
>> this signal isn't necessary as this thread simply terminates when it has read
>> all the date it needs from the device.
>> 
>> In a worker threads pool case there will be some threads waiting for
>> jobs to be queued to them and so they will need to be somehow signaled
>> to exit.
>
> Right.  We may need something like multifd_send_should_exit() +
> MultiFDSendParams.sem.  It'll be nicer if we can generalize that part so
> multifd threads can also rebase to that thread model, but maybe I'm asking
> too much.
>
>> 
>> > The time to join() the worker threads can be even later, until
>> > migrate_fd_cleanup() on sender side.  You may have a better idea on when
>> > would be the best place to do it when start working on it.
>> > 
>> > > 
>> > > > What I was saying is if we target the worker thread pool to be used for
>> > > > "concurrently dump vmstates", then it'll make sense to make sure all the
>> > > > jobs there were flushed after qemu dumps all non-iterables (because this
>> > > > should be the last step of the switchover).
>> > > > 
>> > > > I expect it looks like this:
>> > > > 
>> > > >     while (pool->active_threads) {
>> > > >         qemu_sem_wait(&pool->job_done);
>> > > >     }
>> > 
>> > [1]
>> > 
>> (..)
>> > > I think that with this thread pool introduction we'll unfortunately almost certainly
>> > > need to target this patch set at 9.2, since these overall changes (and Fabiano
>> > > patches too) will need good testing, might uncover some performance regressions
>> > > (for example related to the number of buffers limit or Fabiano multifd changes),
>> > > bring some review comments from other people, etc.
>> > > 
>> > > In addition to that, we are in the middle of holiday season and a lot of people
>> > > aren't available - like Fabiano said he will be available only in a few weeks.
>> > 
>> > Right, that's unfortunate.  Let's see, but still I really hope we can also
>> > get some feedback from Fabiano before it lands, even with that we have
>> > chance for 9.1 but it's just challenging, it's the same condition I
>> > mentioned since the 1st email.  And before Fabiano's back (he's the active
>> > maintainer for this release), I'm personally happy if you can propose
>> > something that can land earlier in this release partly.  E.g., if you want
>> > we can at least upstream Fabiano's idea first, or some more on top.
>> > 
>> > For that, also feel to have a look at my comment today:
>> > 
>> > https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n
>> > 
>> > Feel free to comment there too.  There's a tiny uncertainty there so far on
>> > specifying "max size for a device state" if do what I suggested, as multifd
>> > setup will need to allocate an enum buffer suitable for both ram + device.
>> > But I think that's not an issue and you'll tackle that properly when
>> > working on it.  It's more about whether you agree on what I said as a
>> > general concept.
>> > 
>> 
>> Since it seems that the discussion on Fabiano's patch set has subsided I think
>> I will start by basing my updated patch set on top of his RFC and then if
>> Fabiano wants to submit v1/v2 of his patch set then I will rebase mine on top
>> of it.
>> 
>> Otherwise, you can wait until I have a v2 ready and then we can work with that.
>
> Oh I thought you already started modifying his patchset.
>
> In this case, AFAIR Fabiano has plan to rework that RFC series, so maybe
> you want to double check with him, and can also wait for his new version if
> that's easier, because I do expect there'll be major changes.
>
> Fabiano?

Don't wait on me. I think I can make the changes Peter suggested without
affecting too much the interfaces used by this series. If it comes to
it, I can rebase this series "under" Maciej's.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-07-17 20:19                       ` Fabiano Rosas
@ 2024-07-17 21:07                         ` Maciej S. Szmigiero
  2024-07-17 21:21                           ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej S. Szmigiero @ 2024-07-17 21:07 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 17.07.2024 22:19, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Jul 16, 2024 at 10:10:12PM +0200, Maciej S. Szmigiero wrote:
>>> On 27.06.2024 16:56, Peter Xu wrote:
>>>> On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
>>>>> On 26.06.2024 18:23, Peter Xu wrote:
>>>>>> On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
>>>>>>> On 26.06.2024 03:51, Peter Xu wrote:
>>>>>>>> On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>> On 25.06.2024 19:25, Peter Xu wrote:
>>>>>>>>>> On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> Hi, Maciej,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 23.06.2024 22:27, Peter Xu wrote:
>>>>>>>>>>>> On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is an updated v1 patch series of the RFC (v0) series located here:
>>>>>>>>>>>>> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/
>>>>>>>>>>>>
>>>>>>>>>>>> OK I took some hours thinking about this today, and here's some high level
>>>>>>>>>>>> comments for this series.  I'll start with which are more relevant to what
>>>>>>>>>>>> Fabiano has already suggested in the other thread, then I'll add some more.
>>>>>>>>>>>>
>>>>>>>>>>>> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
>>>>>>>>>>>
>>>>>>>>>>> That's a long list, thanks for these comments.
>>>>>>>>>>>
>>>>>>>>>>> I have responded to them inline below.
>>>>>>>>>>> (..)
>>>>>>>>>
>>>>>>>>> 2) Submit this operation to the thread pool and wait for it to complete,
>>>>>>>>
>>>>>>>> VFIO doesn't need to have its own code waiting.  If this pool is for
>>>>>>>> migration purpose in general, qemu migration framework will need to wait at
>>>>>>>> some point for all jobs to finish before moving on.  Perhaps it should be
>>>>>>>> at the end of the non-iterative session.
>>>>>>>
>>>>>>> So essentially, instead of calling save_live_complete_precopy_end handlers
>>>>>>> from the migration code you would like to hard-code its current VFIO
>>>>>>> implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate().
>>>>>>>
>>>>>>> Only it wouldn't be then called VFIO precopy async thread terminate but some
>>>>>>> generic device state async precopy thread terminate function.
>>>>>>
>>>>>> I don't understand what did you mean by "hard code".
>>>>>
>>>>> "Hard code" wasn't maybe the best expression here.
>>>>>
>>>>> I meant the move of the functionality that's provided by
>>>>> vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
>>>>> to the common migration code.
>>>>
>>>> I see.  That function only does a thread_join() so far.
>>>>
>>>> So can I understand it as below [1] should work for us, and it'll be clean
>>>> too (with nothing to hard-code)?
>>>
>>> It will need some signal to the worker threads pool to terminate before
>>> waiting for them to finish (as the code in [1] just waits).
>>>
>>> In the case of current vfio_save_complete_precopy_async_thread() implementation,
>>> this signal isn't necessary as this thread simply terminates when it has read
>>> all the date it needs from the device.
>>>
>>> In a worker threads pool case there will be some threads waiting for
>>> jobs to be queued to them and so they will need to be somehow signaled
>>> to exit.
>>
>> Right.  We may need something like multifd_send_should_exit() +
>> MultiFDSendParams.sem.  It'll be nicer if we can generalize that part so
>> multifd threads can also rebase to that thread model, but maybe I'm asking
>> too much.
>>
>>>
>>>> The time to join() the worker threads can be even later, until
>>>> migrate_fd_cleanup() on sender side.  You may have a better idea on when
>>>> would be the best place to do it when start working on it.
>>>>
>>>>>
>>>>>> What I was saying is if we target the worker thread pool to be used for
>>>>>> "concurrently dump vmstates", then it'll make sense to make sure all the
>>>>>> jobs there were flushed after qemu dumps all non-iterables (because this
>>>>>> should be the last step of the switchover).
>>>>>>
>>>>>> I expect it looks like this:
>>>>>>
>>>>>>      while (pool->active_threads) {
>>>>>>          qemu_sem_wait(&pool->job_done);
>>>>>>      }
>>>>
>>>> [1]
>>>>
>>> (..)
>>>>> I think that with this thread pool introduction we'll unfortunately almost certainly
>>>>> need to target this patch set at 9.2, since these overall changes (and Fabiano
>>>>> patches too) will need good testing, might uncover some performance regressions
>>>>> (for example related to the number of buffers limit or Fabiano multifd changes),
>>>>> bring some review comments from other people, etc.
>>>>>
>>>>> In addition to that, we are in the middle of holiday season and a lot of people
>>>>> aren't available - like Fabiano said he will be available only in a few weeks.
>>>>
>>>> Right, that's unfortunate.  Let's see, but still I really hope we can also
>>>> get some feedback from Fabiano before it lands, even with that we have
>>>> chance for 9.1 but it's just challenging, it's the same condition I
>>>> mentioned since the 1st email.  And before Fabiano's back (he's the active
>>>> maintainer for this release), I'm personally happy if you can propose
>>>> something that can land earlier in this release partly.  E.g., if you want
>>>> we can at least upstream Fabiano's idea first, or some more on top.
>>>>
>>>> For that, also feel to have a look at my comment today:
>>>>
>>>> https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n
>>>>
>>>> Feel free to comment there too.  There's a tiny uncertainty there so far on
>>>> specifying "max size for a device state" if do what I suggested, as multifd
>>>> setup will need to allocate an enum buffer suitable for both ram + device.
>>>> But I think that's not an issue and you'll tackle that properly when
>>>> working on it.  It's more about whether you agree on what I said as a
>>>> general concept.
>>>>
>>>
>>> Since it seems that the discussion on Fabiano's patch set has subsided I think
>>> I will start by basing my updated patch set on top of his RFC and then if
>>> Fabiano wants to submit v1/v2 of his patch set then I will rebase mine on top
>>> of it.
>>>
>>> Otherwise, you can wait until I have a v2 ready and then we can work with that.
>>
>> Oh I thought you already started modifying his patchset.
>>
>> In this case, AFAIR Fabiano has plan to rework that RFC series, so maybe
>> you want to double check with him, and can also wait for his new version if
>> that's easier, because I do expect there'll be major changes.
>>
>> Fabiano?
> 
> Don't wait on me. I think I can make the changes Peter suggested without
> affecting too much the interfaces used by this series. If it comes to
> it, I can rebase this series "under" Maciej's.

So to be clear, I should base my series on top of your existing RFC patch set
and then we'll swap these RFC patches for the updated versions, correct?

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-07-17 21:07                         ` Maciej S. Szmigiero
@ 2024-07-17 21:21                           ` Peter Xu
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Xu @ 2024-07-17 21:21 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Jul 17, 2024 at 11:07:43PM +0200, Maciej S. Szmigiero wrote:
> > Don't wait on me. I think I can make the changes Peter suggested without
> > affecting too much the interfaces used by this series. If it comes to
> > it, I can rebase this series "under" Maciej's.
> 
> So to be clear, I should base my series on top of your existing RFC patch set
> and then we'll swap these RFC patches for the updated versions, correct?

I'm not sure that's good.. since the VFIO series should depend heavily on
that RFC series IIUC, and if the RFC is prone to major changes, maybe we
should still work that out (or the next rebase can change a lot, void again
most of VFIO tests to carry out)?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
  2024-06-27  9:14               ` Maciej S. Szmigiero
  2024-06-27 14:56                 ` Peter Xu
@ 2024-06-27 15:09                 ` Peter Xu
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Xu @ 2024-06-27 15:09 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
> Having RAM sent in parallel with non-iterables would make sense to me,
> but I am not 100% sure this is a safe thing to do - after all, currently
> non-iterables can rely on the whole RAM being already transferred.

And I forgot to comment on this one.. but that's a good point.

I think we need further investigation indeed on this one.  Some devices may
need special dependency like what you said either on memory fully loaded,
or something else like BQL, so at least concurrent load() won't work for
the latter.  What I was hoping is that we can start to collect some
time-consuming objects into async-model if they do not have such
dependencies.  The thing in my mind is still vcpus so far: that's what I
observed a major uncertainty on causing major downtimes as well.  I
remember vcpu only needs a loaded memory until KVM_RUN triggering loading
of CR3 so _maybe_ that'll be fine, but that needs some double checks.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-07-17 21:22 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 02/13] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 03/13] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 04/13] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 05/13] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 06/13] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 07/13] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 08/13] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 09/13] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 10/13] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 11/13] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 12/13] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 13/13] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-06-23 20:27 ` [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
2024-06-24 19:51   ` Maciej S. Szmigiero
2024-06-25 17:25     ` Peter Xu
2024-06-25 22:44       ` Maciej S. Szmigiero
2024-06-26  1:51         ` Peter Xu
2024-06-26 15:47           ` Maciej S. Szmigiero
2024-06-26 16:23             ` Peter Xu
2024-06-27  9:14               ` Maciej S. Szmigiero
2024-06-27 14:56                 ` Peter Xu
2024-07-16 20:10                   ` Maciej S. Szmigiero
2024-07-17 18:49                     ` Peter Xu
2024-07-17 20:19                       ` Fabiano Rosas
2024-07-17 21:07                         ` Maciej S. Szmigiero
2024-07-17 21:21                           ` Peter Xu
2024-06-27 15:09                 ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).