[PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer
@ 2024-08-27 17:54 Maciej S. Szmigiero
  2024-08-27 17:54 ` [PATCH v2 01/17] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
                   ` (18 more replies)
  0 siblings, 19 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is an updated v2 patch series of the v1 series located here:
https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/

Changes from v1:
* Extended the QEMU thread-pool with non-AIO (generic) pool support,
implemented automatic memory management support for its work element
function argument.

* Introduced a multifd device state save thread pool, ported the VFIO
multifd device state save implementation to use this thread pool instead
of VFIO internally managed individual threads.

* Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/

* Moved device state related multifd code to new multifd-device-state.c
file where it made sense.

* Implemented a max in-flight VFIO device state buffer count limit to
allow capping the maximum recipient memory usage.

* Removed unnecessary explicit memory barriers from multifd_send().

* A few small changes like updated comments, code formatting,
fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.


For convenience, this patch set is also available as a git tree:
https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio

Based-on: <20240823173911.6712-1-farosas@suse.de>


Maciej S. Szmigiero (17):
  vfio/migration: Add save_{iterate,complete_precopy}_started trace
    events
  migration/ram: Add load start trace event
  migration/multifd: Zero p->flags before starting filling a packet
  thread-pool: Add a DestroyNotify parameter to
    thread_pool_submit{,_aio)()
  thread-pool: Implement non-AIO (generic) pool support
  migration: Add save_live_complete_precopy_{begin,end} handlers
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: Add load_finish handler and associated functions
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Convert multifd_send()::next_channel to atomic
  migration/multifd: Add an explicit MultiFDSendData destructor
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add migration_has_device_state_support()
  migration: Add save_live_complete_precopy_thread handler
  vfio/migration: Multifd device state transfer support - receive side
  vfio/migration: Add x-migration-multifd-transfer VFIO property
  vfio/migration: Multifd device state transfer support - send side

 backends/tpm/tpm_backend.c       |   2 +-
 block/file-win32.c               |   2 +-
 hw/9pfs/coth.c                   |   3 +-
 hw/ppc/spapr_nvdimm.c            |   4 +-
 hw/vfio/migration.c              | 520 ++++++++++++++++++++++++++++++-
 hw/vfio/pci.c                    |   9 +
 hw/vfio/trace-events             |  14 +-
 hw/virtio/virtio-pmem.c          |   2 +-
 include/block/thread-pool.h      |  12 +-
 include/hw/vfio/vfio-common.h    |  22 ++
 include/migration/misc.h         |  15 +
 include/migration/register.h     |  97 ++++++
 include/qemu/typedefs.h          |   4 +
 migration/meson.build            |   1 +
 migration/migration.c            |   6 +
 migration/migration.h            |   3 +
 migration/multifd-device-state.c | 193 ++++++++++++
 migration/multifd-nocomp.c       |   9 +-
 migration/multifd-qpl.c          |   2 +-
 migration/multifd-uadk.c         |   2 +-
 migration/multifd-zlib.c         |   2 +-
 migration/multifd-zstd.c         |   2 +-
 migration/multifd.c              | 249 ++++++++++++---
 migration/multifd.h              |  65 +++-
 migration/ram.c                  |   1 +
 migration/savevm.c               | 152 ++++++++-
 migration/savevm.h               |   7 +
 migration/trace-events           |   1 +
 tests/unit/test-thread-pool.c    |   8 +-
 util/thread-pool.c               |  83 ++++-
 30 files changed, 1406 insertions(+), 86 deletions(-)
 create mode 100644 migration/multifd-device-state.c



^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH v2 01/17] vfio/migration: Add save_{iterate, complete_precopy}_started trace events
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-09-05 13:08   ` [PATCH v2 01/17] vfio/migration: Add save_{iterate,complete_precopy}_started " Avihai Horon
  2024-08-27 17:54 ` [PATCH v2 02/17] migration/ram: Add load start trace event Maciej S. Szmigiero
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way both the start and end points of migrating a particular VFIO
device are known.

Add also a vfio_save_iterate_empty_hit trace event so it is known when
there's no more data to send for that device.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 13 +++++++++++++
 hw/vfio/trace-events          |  3 +++
 include/hw/vfio/vfio-common.h |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 262d42a46e58..24679d8c5034 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -472,6 +472,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
         return -ENOMEM;
     }
 
+    migration->save_iterate_run = false;
+    migration->save_iterate_empty_hit = false;
+
     if (vfio_precopy_supported(vbasedev)) {
         switch (migration->device_state) {
         case VFIO_DEVICE_STATE_RUNNING:
@@ -605,9 +608,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     VFIOMigration *migration = vbasedev->migration;
     ssize_t data_size;
 
+    if (!migration->save_iterate_run) {
+        trace_vfio_save_iterate_started(vbasedev->name);
+        migration->save_iterate_run = true;
+    }
+
     data_size = vfio_save_block(f, migration);
     if (data_size < 0) {
         return data_size;
+    } else if (data_size == 0 && !migration->save_iterate_empty_hit) {
+        trace_vfio_save_iterate_empty_hit(vbasedev->name);
+        migration->save_iterate_empty_hit = true;
     }
 
     vfio_update_estimated_pending_data(migration, data_size);
@@ -633,6 +644,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     int ret;
     Error *local_err = NULL;
 
+    trace_vfio_save_complete_precopy_started(vbasedev->name);
+
     /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP, &local_err);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 98bd4dcceadc..013c602f30fa 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -159,8 +159,11 @@ vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
 vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
+vfio_save_complete_precopy_started(const char *name) " (%s)"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_save_iterate_started(const char *name) " (%s)"
+vfio_save_iterate_empty_hit(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fed499b199f0..32d58e3e025b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -73,6 +73,9 @@ typedef struct VFIOMigration {
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
     bool initial_data_sent;
+
+    bool save_iterate_run;
+    bool save_iterate_empty_hit;
 } VFIOMigration;
 
 struct VFIOGroup;


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 02/17] migration/ram: Add load start trace event
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-08-27 17:54 ` [PATCH v2 01/17] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-28 18:44   ` Fabiano Rosas
  2024-08-27 17:54 ` [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

There's a RAM load complete trace event but there wasn't its start equivalent.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/ram.c        | 1 +
 migration/trace-events | 1 +
 2 files changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 67ca3d5d51a1..7997bd830b9c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4127,6 +4127,7 @@ static int ram_load_precopy(QEMUFile *f)
                           RAM_SAVE_FLAG_ZERO);
     }
 
+    trace_ram_load_start();
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr;
         void *host = NULL, *host_bak = NULL;
diff --git a/migration/trace-events b/migration/trace-events
index c65902f042bd..2a99a7baaea6 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) ""
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
+ram_load_start(void) ""
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-08-27 17:54 ` [PATCH v2 01/17] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
  2024-08-27 17:54 ` [PATCH v2 02/17] migration/ram: Add load start trace event Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-28 18:50   ` Fabiano Rosas
  2024-09-09 15:41   ` Peter Xu
  2024-08-27 17:54 ` [PATCH v2 04/17] thread-pool: Add a DestroyNotify parameter to thread_pool_submit{, _aio)() Maciej S. Szmigiero
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way there aren't stale flags there.

p->flags can't contain SYNC to be sent at the next RAM packet since syncs
are now handled separately in multifd_send_thread.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 0c07a2040ba8..b06a9fab500e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -601,6 +601,7 @@ static void *multifd_send_thread(void *opaque)
          * qatomic_store_release() in multifd_send().
          */
         if (qatomic_load_acquire(&p->pending_job)) {
+            p->flags = 0;
             p->iovs_num = 0;
             assert(!multifd_payload_empty(p->data));
 
@@ -652,7 +653,6 @@ static void *multifd_send_thread(void *opaque)
                 }
                 /* p->next_packet_size will always be zero for a SYNC packet */
                 stat64_add(&mig_stats.multifd_bytes, p->packet_len);
-                p->flags = 0;
             }
 
             qatomic_set(&p->pending_sync, false);


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 04/17] thread-pool: Add a DestroyNotify parameter to thread_pool_submit{, _aio)()
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-27 17:54 ` [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support Maciej S. Szmigiero
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Automatic memory management is less prone to mistakes or confusion
who is responsible for freeing the memory backing the "arg" parameter
or dropping a strong reference to object pointed by this parameter.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 backends/tpm/tpm_backend.c    |  2 +-
 block/file-win32.c            |  2 +-
 hw/9pfs/coth.c                |  3 ++-
 hw/ppc/spapr_nvdimm.c         |  4 ++--
 hw/virtio/virtio-pmem.c       |  2 +-
 include/block/thread-pool.h   |  6 ++++--
 tests/unit/test-thread-pool.c |  8 ++++----
 util/thread-pool.c            | 16 ++++++++++++----
 8 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/backends/tpm/tpm_backend.c b/backends/tpm/tpm_backend.c
index 485a20b9e09f..65ef961b59ae 100644
--- a/backends/tpm/tpm_backend.c
+++ b/backends/tpm/tpm_backend.c
@@ -107,7 +107,7 @@ void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
 
     s->cmd = cmd;
     object_ref(OBJECT(s));
-    thread_pool_submit_aio(tpm_backend_worker_thread, s,
+    thread_pool_submit_aio(tpm_backend_worker_thread, s, NULL,
                            tpm_backend_request_completed, s);
 }
 
diff --git a/block/file-win32.c b/block/file-win32.c
index 7e1baa1ece6a..9b99ae2f89e1 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -167,7 +167,7 @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, HANDLE hfile,
     acb->aio_offset = offset;
 
     trace_file_paio_submit(acb, opaque, offset, count, type);
-    return thread_pool_submit_aio(aio_worker, acb, cb, opaque);
+    return thread_pool_submit_aio(aio_worker, acb, NULL, cb, opaque);
 }
 
 int qemu_ftruncate64(int fd, int64_t length)
diff --git a/hw/9pfs/coth.c b/hw/9pfs/coth.c
index 598f46add993..fe5bfa6920fe 100644
--- a/hw/9pfs/coth.c
+++ b/hw/9pfs/coth.c
@@ -41,5 +41,6 @@ static int coroutine_enter_func(void *arg)
 void co_run_in_worker_bh(void *opaque)
 {
     Coroutine *co = opaque;
-    thread_pool_submit_aio(coroutine_enter_func, co, coroutine_enter_cb, co);
+    thread_pool_submit_aio(coroutine_enter_func, co, NULL,
+                           coroutine_enter_cb, co);
 }
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 7d2dfe5e3d2f..f9ee45935d1d 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -517,7 +517,7 @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
     }
 
     QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
-        thread_pool_submit_aio(flush_worker_cb, state,
+        thread_pool_submit_aio(flush_worker_cb, state, NULL,
                                spapr_nvdimm_flush_completion_cb, state);
     }
 
@@ -698,7 +698,7 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
 
         state->drcidx = drc_index;
 
-        thread_pool_submit_aio(flush_worker_cb, state,
+        thread_pool_submit_aio(flush_worker_cb, state, NULL,
                                spapr_nvdimm_flush_completion_cb, state);
 
         continue_token = state->continue_token;
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index c3512c2dae3f..f1331c03f474 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -87,7 +87,7 @@ static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
     req_data->fd   = memory_region_get_fd(&backend->mr);
     req_data->pmem = pmem;
     req_data->vdev = vdev;
-    thread_pool_submit_aio(worker_cb, req_data, done_cb, req_data);
+    thread_pool_submit_aio(worker_cb, req_data, NULL, done_cb, req_data);
 }
 
 static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 948ff5f30c31..b484c4780ea6 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -33,10 +33,12 @@ void thread_pool_free(ThreadPool *pool);
  * thread_pool_submit* API: submit I/O requests in the thread's
  * current AioContext.
  */
-BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
+BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
+                                   void *arg, GDestroyNotify arg_destroy,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
-void thread_pool_submit(ThreadPoolFunc *func, void *arg);
+void thread_pool_submit(ThreadPoolFunc *func,
+                        void *arg, GDestroyNotify arg_destroy);
 
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
index 1483e53473db..e4afb9e36292 100644
--- a/tests/unit/test-thread-pool.c
+++ b/tests/unit/test-thread-pool.c
@@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
 static void test_submit(void)
 {
     WorkerTestData data = { .n = 0 };
-    thread_pool_submit(worker_cb, &data);
+    thread_pool_submit(worker_cb, &data, NULL);
     while (data.n == 0) {
         aio_poll(ctx, true);
     }
@@ -56,7 +56,7 @@ static void test_submit(void)
 static void test_submit_aio(void)
 {
     WorkerTestData data = { .n = 0, .ret = -EINPROGRESS };
-    data.aiocb = thread_pool_submit_aio(worker_cb, &data,
+    data.aiocb = thread_pool_submit_aio(worker_cb, &data, NULL,
                                         done_cb, &data);
 
     /* The callbacks are not called until after the first wait.  */
@@ -121,7 +121,7 @@ static void test_submit_many(void)
     for (i = 0; i < 100; i++) {
         data[i].n = 0;
         data[i].ret = -EINPROGRESS;
-        thread_pool_submit_aio(worker_cb, &data[i], done_cb, &data[i]);
+        thread_pool_submit_aio(worker_cb, &data[i], NULL, done_cb, &data[i]);
     }
 
     active = 100;
@@ -149,7 +149,7 @@ static void do_test_cancel(bool sync)
     for (i = 0; i < 100; i++) {
         data[i].n = 0;
         data[i].ret = -EINPROGRESS;
-        data[i].aiocb = thread_pool_submit_aio(long_cb, &data[i],
+        data[i].aiocb = thread_pool_submit_aio(long_cb, &data[i], NULL,
                                                done_cb, &data[i]);
     }
 
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 27eb777e855b..69a87ee79252 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -38,6 +38,7 @@ struct ThreadPoolElement {
     ThreadPool *pool;
     ThreadPoolFunc *func;
     void *arg;
+    GDestroyNotify arg_destroy;
 
     /* Moving state out of THREAD_QUEUED is protected by lock.  After
      * that, only the worker thread can write to it.  Reads and writes
@@ -188,6 +189,10 @@ restart:
                                    elem->ret);
         QLIST_REMOVE(elem, all);
 
+        if (elem->arg_destroy) {
+            elem->arg_destroy(elem->arg);
+        }
+
         if (elem->common.cb) {
             /* Read state before ret.  */
             smp_rmb();
@@ -238,7 +243,8 @@ static const AIOCBInfo thread_pool_aiocb_info = {
     .cancel_async       = thread_pool_cancel,
 };
 
-BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
+BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
+                                   void *arg, GDestroyNotify arg_destroy,
                                    BlockCompletionFunc *cb, void *opaque)
 {
     ThreadPoolElement *req;
@@ -251,6 +257,7 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
     req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
     req->func = func;
     req->arg = arg;
+    req->arg_destroy = arg_destroy;
     req->state = THREAD_QUEUED;
     req->pool = pool;
 
@@ -285,14 +292,15 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
 {
     ThreadPoolCo tpc = { .co = qemu_coroutine_self(), .ret = -EINPROGRESS };
     assert(qemu_in_coroutine());
-    thread_pool_submit_aio(func, arg, thread_pool_co_cb, &tpc);
+    thread_pool_submit_aio(func, arg, NULL, thread_pool_co_cb, &tpc);
     qemu_coroutine_yield();
     return tpc.ret;
 }
 
-void thread_pool_submit(ThreadPoolFunc *func, void *arg)
+void thread_pool_submit(ThreadPoolFunc *func,
+                        void *arg, GDestroyNotify arg_destroy)
 {
-    thread_pool_submit_aio(func, arg, NULL, NULL);
+    thread_pool_submit_aio(func, arg, arg_destroy, NULL, NULL);
 }
 
 void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 04/17] thread-pool: Add a DestroyNotify parameter to thread_pool_submit{, _aio)() Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-09-02 22:07   ` Fabiano Rosas
  2024-09-03 13:55   ` Stefan Hajnoczi
  2024-08-27 17:54 ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Migration code wants to manage device data sending threads in one place.

QEMU has an existing thread pool implementation, however it was limited
to queuing AIO operations only and essentially had a 1:1 mapping between
the current AioContext and the ThreadPool in use.

Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
too.

This brings a few new operations on a pool:
* thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
thread count in the pool.

* thread_pool_join() operation waits until all the submitted work requests
have finished.

* thread_pool_poll() lets the new thread and / or thread completion bottom
halves run (if they are indeed scheduled to be run).
It is useful for thread pool users that need to launch or terminate new
threads without returning to the QEMU main loop.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/thread-pool.h   | 10 ++++-
 tests/unit/test-thread-pool.c |  2 +-
 util/thread-pool.c            | 77 ++++++++++++++++++++++++++++++-----
 3 files changed, 76 insertions(+), 13 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index b484c4780ea6..1769496056cd 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -37,9 +37,15 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
                                    void *arg, GDestroyNotify arg_destroy,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
-void thread_pool_submit(ThreadPoolFunc *func,
-                        void *arg, GDestroyNotify arg_destroy);
+BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                               void *arg, GDestroyNotify arg_destroy,
+                               BlockCompletionFunc *cb, void *opaque);
 
+void thread_pool_join(ThreadPool *pool);
+void thread_pool_poll(ThreadPool *pool);
+
+void thread_pool_set_minmax_threads(ThreadPool *pool,
+                                    int min_threads, int max_threads);
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
 #endif
diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
index e4afb9e36292..469c0f7057b6 100644
--- a/tests/unit/test-thread-pool.c
+++ b/tests/unit/test-thread-pool.c
@@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
 static void test_submit(void)
 {
     WorkerTestData data = { .n = 0 };
-    thread_pool_submit(worker_cb, &data, NULL);
+    thread_pool_submit(NULL, worker_cb, &data, NULL, NULL, NULL);
     while (data.n == 0) {
         aio_poll(ctx, true);
     }
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 69a87ee79252..2bf3be875a51 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -60,6 +60,7 @@ struct ThreadPool {
     QemuMutex lock;
     QemuCond worker_stopped;
     QemuCond request_cond;
+    QemuCond no_requests_cond;
     QEMUBH *new_thread_bh;
 
     /* The following variables are only accessed from one AioContext. */
@@ -73,6 +74,7 @@ struct ThreadPool {
     int pending_threads; /* threads created but not running yet */
     int min_threads;
     int max_threads;
+    size_t requests_executing;
 };
 
 static void *worker_thread(void *opaque)
@@ -107,6 +109,10 @@ static void *worker_thread(void *opaque)
         req = QTAILQ_FIRST(&pool->request_list);
         QTAILQ_REMOVE(&pool->request_list, req, reqs);
         req->state = THREAD_ACTIVE;
+
+        assert(pool->requests_executing < SIZE_MAX);
+        pool->requests_executing++;
+
         qemu_mutex_unlock(&pool->lock);
 
         ret = req->func(req->arg);
@@ -118,6 +124,14 @@ static void *worker_thread(void *opaque)
 
         qemu_bh_schedule(pool->completion_bh);
         qemu_mutex_lock(&pool->lock);
+
+        assert(pool->requests_executing > 0);
+        pool->requests_executing--;
+
+        if (pool->requests_executing == 0 &&
+            QTAILQ_EMPTY(&pool->request_list)) {
+            qemu_cond_signal(&pool->no_requests_cond);
+        }
     }
 
     pool->cur_threads--;
@@ -243,13 +257,16 @@ static const AIOCBInfo thread_pool_aiocb_info = {
     .cancel_async       = thread_pool_cancel,
 };
 
-BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
-                                   void *arg, GDestroyNotify arg_destroy,
-                                   BlockCompletionFunc *cb, void *opaque)
+BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                               void *arg, GDestroyNotify arg_destroy,
+                               BlockCompletionFunc *cb, void *opaque)
 {
     ThreadPoolElement *req;
     AioContext *ctx = qemu_get_current_aio_context();
-    ThreadPool *pool = aio_get_thread_pool(ctx);
+
+    if (!pool) {
+        pool = aio_get_thread_pool(ctx);
+    }
 
     /* Assert that the thread submitting work is the same running the pool */
     assert(pool->ctx == qemu_get_current_aio_context());
@@ -275,6 +292,18 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
     return &req->common;
 }
 
+BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
+                                   void *arg, GDestroyNotify arg_destroy,
+                                   BlockCompletionFunc *cb, void *opaque)
+{
+    return thread_pool_submit(NULL, func, arg, arg_destroy, cb, opaque);
+}
+
+void thread_pool_poll(ThreadPool *pool)
+{
+    aio_bh_poll(pool->ctx);
+}
+
 typedef struct ThreadPoolCo {
     Coroutine *co;
     int ret;
@@ -297,18 +326,38 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
     return tpc.ret;
 }
 
-void thread_pool_submit(ThreadPoolFunc *func,
-                        void *arg, GDestroyNotify arg_destroy)
+void thread_pool_join(ThreadPool *pool)
 {
-    thread_pool_submit_aio(func, arg, arg_destroy, NULL, NULL);
+    /* Assert that the thread waiting is the same running the pool */
+    assert(pool->ctx == qemu_get_current_aio_context());
+
+    qemu_mutex_lock(&pool->lock);
+
+    if (pool->requests_executing > 0 ||
+        !QTAILQ_EMPTY(&pool->request_list)) {
+        qemu_cond_wait(&pool->no_requests_cond, &pool->lock);
+    }
+    assert(pool->requests_executing == 0 &&
+           QTAILQ_EMPTY(&pool->request_list));
+
+    qemu_mutex_unlock(&pool->lock);
+
+    aio_bh_poll(pool->ctx);
+
+    assert(QLIST_EMPTY(&pool->head));
 }
 
-void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
+void thread_pool_set_minmax_threads(ThreadPool *pool,
+                                    int min_threads, int max_threads)
 {
+    assert(min_threads >= 0);
+    assert(max_threads > 0);
+    assert(max_threads >= min_threads);
+
     qemu_mutex_lock(&pool->lock);
 
-    pool->min_threads = ctx->thread_pool_min;
-    pool->max_threads = ctx->thread_pool_max;
+    pool->min_threads = min_threads;
+    pool->max_threads = max_threads;
 
     /*
      * We either have to:
@@ -330,6 +379,12 @@ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
     qemu_mutex_unlock(&pool->lock);
 }
 
+void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
+{
+    thread_pool_set_minmax_threads(pool,
+                                   ctx->thread_pool_min, ctx->thread_pool_max);
+}
+
 static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
 {
     if (!ctx) {
@@ -342,6 +397,7 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
     qemu_mutex_init(&pool->lock);
     qemu_cond_init(&pool->worker_stopped);
     qemu_cond_init(&pool->request_cond);
+    qemu_cond_init(&pool->no_requests_cond);
     pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
 
     QLIST_INIT(&pool->head);
@@ -382,6 +438,7 @@ void thread_pool_free(ThreadPool *pool)
     qemu_mutex_unlock(&pool->lock);
 
     qemu_bh_delete(pool->completion_bh);
+    qemu_cond_destroy(&pool->no_requests_cond);
     qemu_cond_destroy(&pool->request_cond);
     qemu_cond_destroy(&pool->worker_stopped);
     qemu_mutex_destroy(&pool->lock);


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin, end} handlers
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-28 19:03   ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers Fabiano Rosas
  2024-09-05 13:45   ` Avihai Horon
  2024-08-27 17:54 ` [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

These SaveVMHandlers help device provide its own asynchronous
transmission of the remaining data at the end of a precopy phase.

In this use case the save_live_complete_precopy_begin handler might
be used to mark the stream boundary before proceeding with asynchronous
transmission of the remaining data while the
save_live_complete_precopy_end handler might be used to mark the
stream boundary after performing the asynchronous transmission.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
 migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index f60e797894e5..9de123252edf 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
      */
     int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+    /**
+     * @save_live_complete_precopy_begin
+     *
+     * Called at the end of a precopy phase, before all
+     * @save_live_complete_precopy handlers and before launching
+     * all @save_live_complete_precopy_thread threads.
+     * The handler might, for example, mark the stream boundary before
+     * proceeding with asynchronous transmission of the remaining data via
+     * @save_live_complete_precopy_thread.
+     * When postcopy is enabled, devices that support postcopy will skip this step.
+     *
+     * @f: QEMUFile where the handler can synchronously send data before returning
+     * @idstr: this device section idstr
+     * @instance_id: this device section instance_id
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*save_live_complete_precopy_begin)(QEMUFile *f,
+                                            char *idstr, uint32_t instance_id,
+                                            void *opaque);
+    /**
+     * @save_live_complete_precopy_end
+     *
+     * Called at the end of a precopy phase, after @save_live_complete_precopy
+     * handlers and after all @save_live_complete_precopy_thread threads have
+     * finished. When postcopy is enabled, devices that support postcopy will
+     * skip this step.
+     *
+     * @f: QEMUFile where the handler can synchronously send data before returning
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
+
     /* This runs both outside and inside the BQL.  */
 
     /**
diff --git a/migration/savevm.c b/migration/savevm.c
index 6bb404b9c86f..d43acbbf20cf 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1496,6 +1496,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     SaveStateEntry *se;
     int ret;
 
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+             se->ops->has_postcopy(se->opaque)) ||
+            !se->ops->save_live_complete_precopy_begin) {
+            continue;
+        }
+
+        save_section_header(f, se, QEMU_VM_SECTION_END);
+
+        ret = se->ops->save_live_complete_precopy_begin(f,
+                                                        se->idstr, se->instance_id,
+                                                        se->opaque);
+
+        save_section_footer(f, se);
+
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
             (in_postcopy && se->ops->has_postcopy &&
@@ -1527,6 +1548,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
                                     end_ts_each - start_ts_each);
     }
 
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+             se->ops->has_postcopy(se->opaque)) ||
+            !se->ops->save_live_complete_precopy_end) {
+            continue;
+        }
+
+        ret = se->ops->save_live_complete_precopy_end(f, se->opaque);
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (5 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-30 19:05   ` Fabiano Rosas
  2024-09-05 14:15   ` Avihai Horon
  2024-08-27 17:54 ` [PATCH v2 08/17] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 15 +++++++++++++++
 migration/savevm.c           | 25 +++++++++++++++++++++++++
 migration/savevm.h           |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 9de123252edf..4a578f140713 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -263,6 +263,21 @@ typedef struct SaveVMHandlers {
      */
     int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+    /**
+     * @load_state_buffer
+     *
+     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @data: the data buffer to load
+     * @data_size: the data length in buffer
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
+                             Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/savevm.c b/migration/savevm.c
index d43acbbf20cf..3fde5ca8c26b 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3101,6 +3101,31 @@ int qemu_loadvm_approve_switchover(void)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp)
+{
+    SaveStateEntry *se;
+
+    se = find_se(idstr, instance_id);
+    if (!se) {
+        error_setg(errp, "Unknown idstr %s or instance id %u for load state buffer",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    if (!se->ops || !se->ops->load_state_buffer) {
+        error_setg(errp, "idstr %s / instance %u has no load state buffer operation",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 9ec96a995c93..d388f1bfca98 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp);
+
 #endif


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (6 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-30 19:28   ` Fabiano Rosas
                     ` (2 more replies)
  2024-08-27 17:54 ` [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
                   ` (10 subsequent siblings)
  18 siblings, 3 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

load_finish SaveVMHandler allows migration code to poll whether
a device-specific asynchronous device state loading operation had finished.

In order to avoid calling this handler needlessly the device is supposed
to notify the migration code of its possible readiness via a call to
qemu_loadvm_load_finish_ready_broadcast() while holding
qemu_loadvm_load_finish_ready_lock.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 21 +++++++++++++++
 migration/migration.c        |  6 +++++
 migration/migration.h        |  3 +++
 migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
 migration/savevm.h           |  4 +++
 5 files changed, 86 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 4a578f140713..44d8cf5192ae 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
     int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
                              Error **errp);
 
+    /**
+     * @load_finish
+     *
+     * Poll whether all asynchronous device state loading had finished.
+     * Not called on the load failure path.
+     *
+     * Called while holding the qemu_loadvm_load_finish_ready_lock.
+     *
+     * If this method signals "not ready" then it might not be called
+     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
+     * while holding qemu_loadvm_load_finish_ready_lock.
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @is_finished: whether the loading had finished (output parameter)
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     * It's not an error that the loading still hasn't finished.
+     */
+    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/migration.c b/migration/migration.c
index 3dea06d57732..d61e7b055e07 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -259,6 +259,9 @@ void migration_object_init(void)
 
     current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
 
+    qemu_mutex_init(&current_incoming->load_finish_ready_mutex);
+    qemu_cond_init(&current_incoming->load_finish_ready_cond);
+
     migration_object_check(current_migration, &error_fatal);
 
     ram_mig_init();
@@ -410,6 +413,9 @@ void migration_incoming_state_destroy(void)
         mis->postcopy_qemufile_dst = NULL;
     }
 
+    qemu_mutex_destroy(&mis->load_finish_ready_mutex);
+    qemu_cond_destroy(&mis->load_finish_ready_cond);
+
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index 38aa1402d516..4e2443e6c8ec 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -230,6 +230,9 @@ struct MigrationIncomingState {
 
     /* Do exit on incoming migration failure */
     bool exit_on_error;
+
+    QemuCond load_finish_ready_cond;
+    QemuMutex load_finish_ready_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 3fde5ca8c26b..33c9200d1e78 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3022,6 +3022,37 @@ int qemu_loadvm_state(QEMUFile *f)
         return ret;
     }
 
+    qemu_loadvm_load_finish_ready_lock();
+    while (!ret) { /* Don't call load_finish() handlers on the load failure path */
+        bool all_ready = true;
+        SaveStateEntry *se = NULL;
+
+        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+            bool this_ready;
+
+            if (!se->ops || !se->ops->load_finish) {
+                continue;
+            }
+
+            ret = se->ops->load_finish(se->opaque, &this_ready, &local_err);
+            if (ret) {
+                error_report_err(local_err);
+
+                qemu_loadvm_load_finish_ready_unlock();
+                return -EINVAL;
+            } else if (!this_ready) {
+                all_ready = false;
+            }
+        }
+
+        if (all_ready) {
+            break;
+        }
+
+        qemu_cond_wait(&mis->load_finish_ready_cond, &mis->load_finish_ready_mutex);
+    }
+    qemu_loadvm_load_finish_ready_unlock();
+
     if (ret == 0) {
         ret = qemu_file_get_error(f);
     }
@@ -3126,6 +3157,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
     return 0;
 }
 
+void qemu_loadvm_load_finish_ready_lock(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    qemu_mutex_lock(&mis->load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_unlock(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    qemu_mutex_unlock(&mis->load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_broadcast(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    qemu_cond_broadcast(&mis->load_finish_ready_cond);
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index d388f1bfca98..69ae22cded7a 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -73,4 +73,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
 int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
                                   char *buf, size_t len, Error **errp);
 
+void qemu_loadvm_load_finish_ready_lock(void);
+void qemu_loadvm_load_finish_ready_unlock(void);
+void qemu_loadvm_load_finish_ready_broadcast(void);
+
 #endif


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (7 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 08/17] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-30 20:22   ` Fabiano Rosas
  2024-09-05 16:47   ` Avihai Horon
  2024-08-27 17:54 ` [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic Maciej S. Szmigiero
                   ` (9 subsequent siblings)
  18 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.

To differentiate between a device state and a RAM packet the packet
header is read first.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is then read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 127 +++++++++++++++++++++++++++++++++++++-------
 migration/multifd.h |  31 ++++++++++-
 2 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index b06a9fab500e..d5a8e5a9c9b5 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -21,6 +21,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -209,10 +210,10 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
 
     memset(packet, 0, p->packet_len);
 
-    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-    packet->version = cpu_to_be32(MULTIFD_VERSION);
+    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
 
-    packet->flags = cpu_to_be32(p->flags);
+    packet->hdr.flags = cpu_to_be32(p->flags);
     packet->next_packet_size = cpu_to_be32(p->next_packet_size);
 
     packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
@@ -228,31 +229,49 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
                             p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
+                                             MultiFDPacketHdr_t *hdr,
+                                             Error **errp)
 {
-    MultiFDPacket_t *packet = p->packet;
-    int ret = 0;
-
-    packet->magic = be32_to_cpu(packet->magic);
-    if (packet->magic != MULTIFD_MAGIC) {
+    hdr->magic = be32_to_cpu(hdr->magic);
+    if (hdr->magic != MULTIFD_MAGIC) {
         error_setg(errp, "multifd: received packet "
                    "magic %x and expected magic %x",
-                   packet->magic, MULTIFD_MAGIC);
+                   hdr->magic, MULTIFD_MAGIC);
         return -1;
     }
 
-    packet->version = be32_to_cpu(packet->version);
-    if (packet->version != MULTIFD_VERSION) {
+    hdr->version = be32_to_cpu(hdr->version);
+    if (hdr->version != MULTIFD_VERSION) {
         error_setg(errp, "multifd: received packet "
                    "version %u and expected version %u",
-                   packet->version, MULTIFD_VERSION);
+                   hdr->version, MULTIFD_VERSION);
         return -1;
     }
 
-    p->flags = be32_to_cpu(packet->flags);
+    p->flags = be32_to_cpu(hdr->flags);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
+                                                   Error **errp)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+    packet->instance_id = be32_to_cpu(packet->instance_id);
+    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
+{
+    MultiFDPacket_t *packet = p->packet;
+    int ret = 0;
+
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
-    p->packets_recved++;
 
     if (!(p->flags & MULTIFD_FLAG_SYNC)) {
         ret = multifd_ram_unfill_packet(p, errp);
@@ -264,6 +283,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     return ret;
 }
 
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    p->packets_recved++;
+
+    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
+        return multifd_recv_unfill_packet_device_state(p, errp);
+    } else {
+        return multifd_recv_unfill_packet_ram(p, errp);
+    }
+
+    g_assert_not_reached();
+}
+
 static bool multifd_send_should_exit(void)
 {
     return qatomic_read(&multifd_send_state->exiting);
@@ -1014,6 +1046,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
+    g_clear_pointer(&p->packet_dev_state, g_free);
     g_free(p->normal);
     p->normal = NULL;
     g_free(p->zero);
@@ -1126,8 +1159,13 @@ static void *multifd_recv_thread(void *opaque)
     rcu_register_thread();
 
     while (true) {
+        MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
+        bool is_device_state = false;
         bool has_data = false;
+        uint8_t *pkt_buf;
+        size_t pkt_len;
+
         p->normal_num = 0;
 
         if (use_packets) {
@@ -1135,8 +1173,28 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
-                                           p->packet_len, &local_err);
+            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
+                                           sizeof(hdr), &local_err);
+            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
+                break;
+            }
+
+            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
+            if (ret) {
+                break;
+            }
+
+            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
+            if (is_device_state) {
+                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
+                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
+            } else {
+                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+                pkt_len = p->packet_len - sizeof(hdr);
+            }
+
+            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
+                                           &local_err);
             if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
                 break;
             }
@@ -1181,8 +1239,33 @@ static void *multifd_recv_thread(void *opaque)
             has_data = !!p->data->size;
         }
 
-        if (has_data) {
-            ret = multifd_recv_state->ops->recv(p, &local_err);
+        if (!is_device_state) {
+            if (has_data) {
+                ret = multifd_recv_state->ops->recv(p, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+            }
+        } else {
+            g_autofree char *idstr = NULL;
+            g_autofree char *dev_state_buf = NULL;
+
+            assert(use_packets);
+
+            if (p->next_packet_size > 0) {
+                dev_state_buf = g_malloc(p->next_packet_size);
+
+                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+            }
+
+            idstr = g_strndup(p->packet_dev_state->idstr, sizeof(p->packet_dev_state->idstr));
+            ret = qemu_loadvm_load_state_buffer(idstr,
+                                                p->packet_dev_state->instance_id,
+                                                dev_state_buf, p->next_packet_size,
+                                                &local_err);
             if (ret != 0) {
                 break;
             }
@@ -1190,6 +1273,11 @@ static void *multifd_recv_thread(void *opaque)
 
         if (use_packets) {
             if (flags & MULTIFD_FLAG_SYNC) {
+                if (is_device_state) {
+                    error_setg(&local_err, "multifd: received SYNC device state packet");
+                    break;
+                }
+
                 qemu_sem_post(&multifd_recv_state->sem_sync);
                 qemu_sem_wait(&p->sem_sync);
             }
@@ -1258,6 +1346,7 @@ int multifd_recv_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
         }
         p->name = g_strdup_printf("mig/dst/recv_%d", i);
         p->normal = g_new0(ram_addr_t, page_count);
diff --git a/migration/multifd.h b/migration/multifd.h
index a3e35196d179..a8f3e4838c01 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_QPL (4 << 1)
 #define MULTIFD_FLAG_UADK (8 << 1)
 
+/*
+ * If set it means that this packet contains device state
+ * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
+ */
+#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
+
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
 
@@ -52,6 +58,11 @@ typedef struct {
     uint32_t magic;
     uint32_t version;
     uint32_t flags;
+} __attribute__((packed)) MultiFDPacketHdr_t;
+
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
     /* maximum number of allocated pages */
     uint32_t pages_alloc;
     /* non zero pages */
@@ -72,6 +83,16 @@ typedef struct {
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
+    char idstr[256] QEMU_NONSTRING;
+    uint32_t instance_id;
+
+    /* size of the next packet that contains the actual data */
+    uint32_t next_packet_size;
+} __attribute__((packed)) MultiFDPacketDeviceState_t;
+
 typedef struct {
     /* number of used pages */
     uint32_t num;
@@ -89,6 +110,13 @@ struct MultiFDRecvData {
     off_t file_offset;
 };
 
+typedef struct {
+    char *idstr;
+    uint32_t instance_id;
+    char *buf;
+    size_t buf_len;
+} MultiFDDeviceState_t;
+
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
@@ -204,8 +232,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_dev_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets received through this channel */


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (8 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-30 18:13   ` Fabiano Rosas
  2024-09-10 14:13   ` Peter Xu
  2024-08-27 17:54 ` [PATCH v2 11/17] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is necessary for multifd_send() to be able to be called
from multiple threads.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index d5a8e5a9c9b5..b25789dde0b3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -343,26 +343,38 @@ bool multifd_send(MultiFDSendData **send_data)
         return false;
     }
 
-    /* We wait here, until at least one channel is ready */
-    qemu_sem_wait(&multifd_send_state->channels_ready);
-
     /*
      * next_channel can remain from a previous migration that was
      * using more channels, so ensure it doesn't overflow if the
      * limit is lower now.
      */
-    next_channel %= migrate_multifd_channels();
-    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
+    i = qatomic_load_acquire(&next_channel);
+    if (unlikely(i >= migrate_multifd_channels())) {
+        qatomic_cmpxchg(&next_channel, i, 0);
+    }
+
+    /* We wait here, until at least one channel is ready */
+    qemu_sem_wait(&multifd_send_state->channels_ready);
+
+    while (true) {
+        int i_next;
+
         if (multifd_send_should_exit()) {
             return false;
         }
+
+        i = qatomic_load_acquire(&next_channel);
+        i_next = (i + 1) % migrate_multifd_channels();
+        if (qatomic_cmpxchg(&next_channel, i, i_next) != i) {
+            continue;
+        }
+
         p = &multifd_send_state->params[i];
         /*
          * Lockless read to p->pending_job is safe, because only multifd
          * sender thread can clear it.
          */
         if (qatomic_read(&p->pending_job) == false) {
-            next_channel = (i + 1) % migrate_multifd_channels();
             break;
         }
     }


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 11/17] migration/multifd: Add an explicit MultiFDSendData destructor
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (9 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-30 13:12   ` Fabiano Rosas
  2024-08-27 17:54 ` [PATCH v2 12/17] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way if there are fields there that needs explicit disposal (like, for
example, some attached buffers) they will be handled appropriately.

Add a related assert to multifd_set_payload_type() in order to make sure
that this function is only used to fill a previously empty MultiFDSendData
with some payload, not the other way around.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-nocomp.c |  3 +--
 migration/multifd.c        | 31 ++++++++++++++++++++++++++++---
 migration/multifd.h        |  5 +++++
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 53ea9f9c8371..39eb77c9b3b7 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -40,8 +40,7 @@ void multifd_ram_save_setup(void)
 
 void multifd_ram_save_cleanup(void)
 {
-    g_free(multifd_ram_send);
-    multifd_ram_send = NULL;
+    g_clear_pointer(&multifd_ram_send, multifd_send_data_free);
 }
 
 static void multifd_set_file_bitmap(MultiFDSendParams *p)
diff --git a/migration/multifd.c b/migration/multifd.c
index b25789dde0b3..a74e8a5cc891 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -119,6 +119,32 @@ MultiFDSendData *multifd_send_data_alloc(void)
     return g_malloc0(size_minus_payload + max_payload_size);
 }
 
+void multifd_send_data_clear(MultiFDSendData *data)
+{
+    if (multifd_payload_empty(data)) {
+        return;
+    }
+
+    switch (data->type) {
+    default:
+        /* Nothing to do */
+        break;
+    }
+
+    data->type = MULTIFD_PAYLOAD_NONE;
+}
+
+void multifd_send_data_free(MultiFDSendData *data)
+{
+    if (!data) {
+        return;
+    }
+
+    multifd_send_data_clear(data);
+
+    g_free(data);
+}
+
 static bool multifd_use_packets(void)
 {
     return !migrate_mapped_ram();
@@ -506,8 +532,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     qemu_sem_destroy(&p->sem_sync);
     g_free(p->name);
     p->name = NULL;
-    g_free(p->data);
-    p->data = NULL;
+    g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
@@ -671,7 +696,7 @@ static void *multifd_send_thread(void *opaque)
                        p->next_packet_size + p->packet_len);
 
             p->next_packet_size = 0;
-            multifd_set_payload_type(p->data, MULTIFD_PAYLOAD_NONE);
+            multifd_send_data_clear(p->data);
 
             /*
              * Making sure p->data is published before saying "we're
diff --git a/migration/multifd.h b/migration/multifd.h
index a8f3e4838c01..a0853622153e 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -139,6 +139,9 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
+    assert(multifd_payload_empty(data));
+    assert(type != MULTIFD_PAYLOAD_NONE);
+
     data->type = type;
 }
 
@@ -288,6 +291,8 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
+void multifd_send_data_clear(MultiFDSendData *data);
+void multifd_send_data_free(MultiFDSendData *data);
 
 static inline uint32_t multifd_ram_page_size(void)
 {


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (10 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 11/17] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-29  0:41   ` Fabiano Rosas
  2024-09-10 16:06   ` Peter Xu
  2024-08-27 17:54 ` [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
                   ` (6 subsequent siblings)
  18 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         |  4 ++
 migration/meson.build            |  1 +
 migration/multifd-device-state.c | 99 ++++++++++++++++++++++++++++++++
 migration/multifd-nocomp.c       |  6 +-
 migration/multifd-qpl.c          |  2 +-
 migration/multifd-uadk.c         |  2 +-
 migration/multifd-zlib.c         |  2 +-
 migration/multifd-zstd.c         |  2 +-
 migration/multifd.c              | 65 +++++++++++++++------
 migration/multifd.h              | 29 +++++++++-
 10 files changed, 184 insertions(+), 28 deletions(-)
 create mode 100644 migration/multifd-device-state.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613bac..7266b1b77d1f 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void);
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
+/* migration/multifd-device-state.c */
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len);
+
 #endif
diff --git a/migration/meson.build b/migration/meson.build
index 77f3abf08eb1..00853595894f 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -21,6 +21,7 @@ system_ss.add(files(
   'migration-hmp-cmds.c',
   'migration.c',
   'multifd.c',
+  'multifd-device-state.c',
   'multifd-nocomp.c',
   'multifd-zlib.c',
   'multifd-zero-page.c',
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
new file mode 100644
index 000000000000..c9b44f0b5ab9
--- /dev/null
+++ b/migration/multifd-device-state.c
@@ -0,0 +1,99 @@
+/*
+ * Multifd device state migration
+ *
+ * Copyright (C) 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/lockable.h"
+#include "migration/misc.h"
+#include "multifd.h"
+
+static QemuMutex queue_job_mutex;
+
+static MultiFDSendData *device_state_send;
+
+size_t multifd_device_state_payload_size(void)
+{
+    return sizeof(MultiFDDeviceState_t);
+}
+
+void multifd_device_state_save_setup(void)
+{
+    qemu_mutex_init(&queue_job_mutex);
+
+    device_state_send = multifd_send_data_alloc();
+}
+
+void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
+{
+    g_clear_pointer(&device_state->idstr, g_free);
+    g_clear_pointer(&device_state->buf, g_free);
+}
+
+void multifd_device_state_save_cleanup(void)
+{
+    g_clear_pointer(&device_state_send, multifd_send_data_free);
+
+    qemu_mutex_destroy(&queue_job_mutex);
+}
+
+static void multifd_device_state_fill_packet(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+    packet->hdr.flags = cpu_to_be32(p->flags);
+    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
+    packet->instance_id = cpu_to_be32(device_state->instance_id);
+    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+void multifd_device_state_send_prepare(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+
+    assert(multifd_payload_device_state(p->data));
+
+    multifd_send_prepare_header_device_state(p);
+
+    assert(!(p->flags & MULTIFD_FLAG_SYNC));
+
+    p->next_packet_size = device_state->buf_len;
+    if (p->next_packet_size > 0) {
+        p->iov[p->iovs_num].iov_base = device_state->buf;
+        p->iov[p->iovs_num].iov_len = p->next_packet_size;
+        p->iovs_num++;
+    }
+
+    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
+
+    multifd_device_state_fill_packet(p);
+}
+
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len)
+{
+    /* Device state submissions can come from multiple threads */
+    QEMU_LOCK_GUARD(&queue_job_mutex);
+    MultiFDDeviceState_t *device_state;
+
+    assert(multifd_payload_empty(device_state_send));
+
+    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
+    device_state = &device_state_send->u.device_state;
+    device_state->idstr = g_strdup(idstr);
+    device_state->instance_id = instance_id;
+    device_state->buf = g_memdup2(data, len);
+    device_state->buf_len = len;
+
+    if (!multifd_send(&device_state_send)) {
+        multifd_send_data_clear(device_state_send);
+        return false;
+    }
+
+    return true;
+}
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 39eb77c9b3b7..0b7b543f44db 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -116,13 +116,13 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
          * Only !zerocopy needs the header in IOV; zerocopy will
          * send it separately.
          */
-        multifd_send_prepare_header(p);
+        multifd_send_prepare_header_ram(p);
     }
 
     multifd_send_prepare_iovs(p);
     p->flags |= MULTIFD_FLAG_NOCOMP;
 
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
 
     if (use_zero_copy_send) {
         /* Send header first, without zerocopy */
@@ -371,7 +371,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
         return false;
     }
 
-    multifd_send_prepare_header(p);
+    multifd_send_prepare_header_ram(p);
 
     return true;
 }
diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
index 75041a4c4dfe..bd6b5b6a3868 100644
--- a/migration/multifd-qpl.c
+++ b/migration/multifd-qpl.c
@@ -490,7 +490,7 @@ static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
 
 out:
     p->flags |= MULTIFD_FLAG_QPL;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd-uadk.c b/migration/multifd-uadk.c
index db2549f59bfe..6e2d26010742 100644
--- a/migration/multifd-uadk.c
+++ b/migration/multifd-uadk.c
@@ -198,7 +198,7 @@ static int multifd_uadk_send_prepare(MultiFDSendParams *p, Error **errp)
     }
 out:
     p->flags |= MULTIFD_FLAG_UADK;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 6787538762d2..62a1fe59ad3e 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -156,7 +156,7 @@ static int multifd_zlib_send_prepare(MultiFDSendParams *p, Error **errp)
 
 out:
     p->flags |= MULTIFD_FLAG_ZLIB;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 1576b1e2adc6..f98b07e7f9f5 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -143,7 +143,7 @@ static int multifd_zstd_send_prepare(MultiFDSendParams *p, Error **errp)
 
 out:
     p->flags |= MULTIFD_FLAG_ZSTD;
-    multifd_send_fill_packet(p);
+    multifd_send_fill_packet_ram(p);
     return 0;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index a74e8a5cc891..bebe5b5a9b9c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -19,6 +20,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -107,7 +109,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
      * added to the union in the future are larger than
      * (MultiFDPages_t + flex array).
      */
-    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
+    max_payload_size = MAX(multifd_ram_payload_size(),
+                           multifd_device_state_payload_size());
+    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
 
     /*
      * Account for any holes the compiler might insert. We can't pack
@@ -126,6 +130,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
     }
 
     switch (data->type) {
+    case MULTIFD_PAYLOAD_DEVICE_STATE:
+        multifd_device_state_clear(&data->u.device_state);
+        break;
     default:
         /* Nothing to do */
         break;
@@ -228,7 +235,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
     return msg.id;
 }
 
-void multifd_send_fill_packet(MultiFDSendParams *p)
+void multifd_send_fill_packet_ram(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
     uint64_t packet_num;
@@ -397,20 +404,16 @@ bool multifd_send(MultiFDSendData **send_data)
 
         p = &multifd_send_state->params[i];
         /*
-         * Lockless read to p->pending_job is safe, because only multifd
-         * sender thread can clear it.
+         * Lockless RMW on p->pending_job_preparing is safe, because only multifd
+         * sender thread can clear it after it had seen p->pending_job being set.
+         *
+         * Pairs with qatomic_store_release() in multifd_send_thread().
          */
-        if (qatomic_read(&p->pending_job) == false) {
+        if (qatomic_cmpxchg(&p->pending_job_preparing, false, true) == false) {
             break;
         }
     }
 
-    /*
-     * Make sure we read p->pending_job before all the rest.  Pairs with
-     * qatomic_store_release() in multifd_send_thread().
-     */
-    smp_mb_acquire();
-
     assert(multifd_payload_empty(p->data));
 
     /*
@@ -534,6 +537,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     p->name = NULL;
     g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
+    g_clear_pointer(&p->packet_device_state, g_free);
     g_free(p->packet);
     p->packet = NULL;
     multifd_send_state->ops->send_cleanup(p, errp);
@@ -545,6 +549,7 @@ static void multifd_send_cleanup_state(void)
 {
     file_cleanup_outgoing_migration();
     socket_cleanup_outgoing_migration();
+    multifd_device_state_save_cleanup();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
     g_free(multifd_send_state->params);
@@ -670,19 +675,29 @@ static void *multifd_send_thread(void *opaque)
          * qatomic_store_release() in multifd_send().
          */
         if (qatomic_load_acquire(&p->pending_job)) {
+            bool is_device_state = multifd_payload_device_state(p->data);
+            size_t total_size;
+
             p->flags = 0;
             p->iovs_num = 0;
             assert(!multifd_payload_empty(p->data));
 
-            ret = multifd_send_state->ops->send_prepare(p, &local_err);
-            if (ret != 0) {
-                break;
+            if (is_device_state) {
+                multifd_device_state_send_prepare(p);
+            } else {
+                ret = multifd_send_state->ops->send_prepare(p, &local_err);
+                if (ret != 0) {
+                    break;
+                }
             }
 
             if (migrate_mapped_ram()) {
+                assert(!is_device_state);
+
                 ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
                                               &p->data->u.ram, &local_err);
             } else {
+                total_size = iov_size(p->iov, p->iovs_num);
                 ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num,
                                                   NULL, 0, p->write_flags,
                                                   &local_err);
@@ -692,18 +707,27 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
-            stat64_add(&mig_stats.multifd_bytes,
-                       p->next_packet_size + p->packet_len);
+            if (is_device_state) {
+                stat64_add(&mig_stats.multifd_bytes, total_size);
+            } else {
+                /*
+                 * Can't just always add total_size since IOVs do not include
+                 * packet header in the zerocopy RAM case.
+                 */
+                stat64_add(&mig_stats.multifd_bytes,
+                           p->next_packet_size + p->packet_len);
+            }
 
             p->next_packet_size = 0;
             multifd_send_data_clear(p->data);
 
             /*
              * Making sure p->data is published before saying "we're
-             * free".  Pairs with the smp_mb_acquire() in
+             * free".  Pairs with the qatomic_cmpxchg() in
              * multifd_send().
              */
             qatomic_store_release(&p->pending_job, false);
+            qatomic_store_release(&p->pending_job_preparing, false);
         } else {
             /*
              * If not a normal job, must be a sync request.  Note that
@@ -714,7 +738,7 @@ static void *multifd_send_thread(void *opaque)
 
             if (use_packets) {
                 p->flags = MULTIFD_FLAG_SYNC;
-                multifd_send_fill_packet(p);
+                multifd_send_fill_packet_ram(p);
                 ret = qio_channel_write_all(p->c, (void *)p->packet,
                                             p->packet_len, &local_err);
                 if (ret != 0) {
@@ -910,6 +934,9 @@ bool multifd_send_setup(void)
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
+            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
         }
         p->name = g_strdup_printf("mig/src/send_%d", i);
         p->write_flags = 0;
@@ -944,6 +971,8 @@ bool multifd_send_setup(void)
         }
     }
 
+    multifd_device_state_save_setup();
+
     return true;
 
 err:
diff --git a/migration/multifd.h b/migration/multifd.h
index a0853622153e..c15c83104c8b 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -120,10 +120,12 @@ typedef struct {
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
+    MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
 typedef union MultiFDPayload {
     MultiFDPages_t ram;
+    MultiFDDeviceState_t device_state;
 } MultiFDPayload;
 
 struct MultiFDSendData {
@@ -136,6 +138,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
     return data->type == MULTIFD_PAYLOAD_NONE;
 }
 
+static inline bool multifd_payload_device_state(MultiFDSendData *data)
+{
+    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
+}
+
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
@@ -182,13 +189,15 @@ typedef struct {
      * cleared by the multifd sender threads.
      */
     bool pending_job;
+    bool pending_job_preparing;
     bool pending_sync;
     MultiFDSendData *data;
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_device_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
@@ -276,18 +285,25 @@ typedef struct {
 } MultiFDMethods;
 
 void multifd_register_ops(int method, MultiFDMethods *ops);
-void multifd_send_fill_packet(MultiFDSendParams *p);
+void multifd_send_fill_packet_ram(MultiFDSendParams *p);
 bool multifd_send_prepare_common(MultiFDSendParams *p);
 void multifd_send_zero_page_detect(MultiFDSendParams *p);
 void multifd_recv_zero_page_process(MultiFDRecvParams *p);
 
-static inline void multifd_send_prepare_header(MultiFDSendParams *p)
+static inline void multifd_send_prepare_header_ram(MultiFDSendParams *p)
 {
     p->iov[0].iov_len = p->packet_len;
     p->iov[0].iov_base = p->packet;
     p->iovs_num++;
 }
 
+static inline void multifd_send_prepare_header_device_state(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = sizeof(*p->packet_device_state);
+    p->iov[0].iov_base = p->packet_device_state;
+    p->iovs_num++;
+}
+
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
@@ -310,4 +326,11 @@ int multifd_ram_flush_and_sync(void);
 size_t multifd_ram_payload_size(void);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
+
+size_t multifd_device_state_payload_size(void);
+void multifd_device_state_save_setup(void);
+void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
+void multifd_device_state_save_cleanup(void);
+void multifd_device_state_send_prepare(MultiFDSendParams *p);
+
 #endif


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support()
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (11 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 12/17] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-30 18:55   ` Fabiano Rosas
  2024-08-27 17:54 ` [PATCH v2 14/17] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         | 1 +
 migration/multifd-device-state.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 7266b1b77d1f..189de6d02ad6 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -114,5 +114,6 @@ void dirty_bitmap_mig_init(void);
 /* migration/multifd-device-state.c */
 bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
+bool migration_has_device_state_support(void);
 
 #endif
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index c9b44f0b5ab9..7b34fe736c7f 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -11,6 +11,7 @@
 #include "qemu/lockable.h"
 #include "migration/misc.h"
 #include "multifd.h"
+#include "options.h"
 
 static QemuMutex queue_job_mutex;
 
@@ -97,3 +98,9 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
 
     return true;
 }
+
+bool migration_has_device_state_support(void)
+{
+    return migrate_multifd() && !migrate_mapped_ram() &&
+        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 14/17] migration: Add save_live_complete_precopy_thread handler
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (12 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-27 17:54 ` [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This SaveVMHandler helps device provide its own asynchronous transmission
of the remaining data at the end of a precopy phase via multifd channels,
in parallel with the transfer done by save_live_complete_precopy handlers.

These threads are launched only when multifd device state transfer is
supported, after all save_live_complete_precopy_begin handlers have
already finished (for stream synchronization purposes).

Management of these threads in done in the multifd migration code,
wrapping them in the generic thread pool.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         | 10 ++++
 include/migration/register.h     | 25 +++++++++
 include/qemu/typedefs.h          |  4 ++
 migration/multifd-device-state.c | 87 ++++++++++++++++++++++++++++++++
 migration/savevm.c               | 40 ++++++++++++++-
 5 files changed, 165 insertions(+), 1 deletion(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 189de6d02ad6..26f7f3140f03 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -116,4 +116,14 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
 bool migration_has_device_state_support(void);
 
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque);
+
+void multifd_launch_device_state_save_threads(int max_count);
+
+void multifd_abort_device_state_save_threads(void);
+int multifd_join_device_state_save_threads(void);
+
 #endif
diff --git a/include/migration/register.h b/include/migration/register.h
index 44d8cf5192ae..ace2cfc0f75e 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -139,6 +139,31 @@ typedef struct SaveVMHandlers {
      */
     int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
 
+    /* This runs in a separate thread. */
+
+    /**
+     * @save_live_complete_precopy_thread
+     *
+     * Called at the end of a precopy phase from a separate worker thread
+     * in configurations where multifd device state transfer is supported
+     * in order to perform asynchronous transmission of the remaining data in
+     * parallel with @save_live_complete_precopy handlers.
+     * The call happens after all @save_live_complete_precopy_begin handlers
+     * have finished.
+     * When postcopy is enabled, devices that support postcopy will skip this
+     * step.
+     *
+     * @idstr: this device section idstr
+     * @instance_id: this device section instance_id
+     * @abort_flag: flag indicating that the migration core wants to abort
+     * the transmission and so the handler should exit ASAP. To be read by
+     * qatomic_read() or similar.
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
+
     /* This runs both outside and inside the BQL.  */
 
     /**
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 9d222dc37628..edd6e7b9c116 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -130,5 +130,9 @@ typedef struct IRQState *qemu_irq;
  * Function types
  */
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
+typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
+                                                    uint32_t instance_id,
+                                                    bool *abort_flag,
+                                                    void *opaque);
 
 #endif /* QEMU_TYPEDEFS_H */
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index 7b34fe736c7f..9b364e8ef33c 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -9,12 +9,17 @@
 
 #include "qemu/osdep.h"
 #include "qemu/lockable.h"
+#include "block/thread-pool.h"
 #include "migration/misc.h"
 #include "multifd.h"
 #include "options.h"
 
 static QemuMutex queue_job_mutex;
 
+ThreadPool *send_threads;
+int send_threads_ret;
+bool send_threads_abort;
+
 static MultiFDSendData *device_state_send;
 
 size_t multifd_device_state_payload_size(void)
@@ -27,6 +32,10 @@ void multifd_device_state_save_setup(void)
     qemu_mutex_init(&queue_job_mutex);
 
     device_state_send = multifd_send_data_alloc();
+
+    send_threads = thread_pool_new(NULL);
+    send_threads_ret = 0;
+    send_threads_abort = false;
 }
 
 void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
@@ -37,6 +46,7 @@ void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
 
 void multifd_device_state_save_cleanup(void)
 {
+    g_clear_pointer(&send_threads, thread_pool_free);
     g_clear_pointer(&device_state_send, multifd_send_data_free);
 
     qemu_mutex_destroy(&queue_job_mutex);
@@ -104,3 +114,80 @@ bool migration_has_device_state_support(void)
     return migrate_multifd() && !migrate_mapped_ram() &&
         migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
 }
+
+static void multifd_device_state_save_thread_complete(void *opaque, int ret)
+{
+    if (ret && !send_threads_ret) {
+        send_threads_ret = ret;
+    }
+}
+
+struct MultiFDDSSaveThreadData {
+    SaveLiveCompletePrecopyThreadHandler hdlr;
+    char *idstr;
+    uint32_t instance_id;
+    void *opaque;
+};
+
+static void multifd_device_state_save_thread_data_free(void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data = opaque;
+
+    g_clear_pointer(&data->idstr, g_free);
+    g_free(data);
+}
+
+static int multifd_device_state_save_thread(void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data = opaque;
+
+    return data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
+                      data->opaque);
+}
+
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data;
+
+    assert(migration_has_device_state_support());
+
+    data = g_new(struct MultiFDDSSaveThreadData, 1);
+    data->hdlr = hdlr;
+    data->idstr = g_strdup(idstr);
+    data->instance_id = instance_id;
+    data->opaque = opaque;
+
+    thread_pool_submit(send_threads,
+                       multifd_device_state_save_thread,
+                       data, multifd_device_state_save_thread_data_free,
+                       multifd_device_state_save_thread_complete, NULL);
+}
+
+void multifd_launch_device_state_save_threads(int max_count)
+{
+    assert(migration_has_device_state_support());
+
+    thread_pool_set_minmax_threads(send_threads,
+                                   0, max_count);
+
+    thread_pool_poll(send_threads);
+}
+
+void multifd_abort_device_state_save_threads(void)
+{
+    assert(migration_has_device_state_support());
+
+    qatomic_set(&send_threads_abort, true);
+}
+
+int multifd_join_device_state_save_threads(void)
+{
+    assert(migration_has_device_state_support());
+
+    thread_pool_join(send_threads);
+
+    return send_threads_ret;
+}
diff --git a/migration/savevm.c b/migration/savevm.c
index 33c9200d1e78..a70f6ed006f2 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1495,6 +1495,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     int64_t start_ts_each, end_ts_each;
     SaveStateEntry *se;
     int ret;
+    bool multifd_device_state = migration_has_device_state_support();
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
@@ -1517,6 +1518,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
         }
     }
 
+    if (multifd_device_state) {
+        int thread_count = 0;
+
+        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+            SaveLiveCompletePrecopyThreadHandler hdlr;
+
+            if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+                             se->ops->has_postcopy(se->opaque)) ||
+                !se->ops->save_live_complete_precopy_thread) {
+                continue;
+            }
+
+            hdlr = se->ops->save_live_complete_precopy_thread;
+            multifd_spawn_device_state_save_thread(hdlr,
+                                                   se->idstr, se->instance_id,
+                                                   se->opaque);
+            thread_count++;
+        }
+        multifd_launch_device_state_save_threads(thread_count);
+    }
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
             (in_postcopy && se->ops->has_postcopy &&
@@ -1541,13 +1563,21 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
         save_section_footer(f, se);
         if (ret < 0) {
             qemu_file_set_error(f, ret);
-            return -1;
+            goto ret_fail_abort_threads;
         }
         end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
         trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id,
                                     end_ts_each - start_ts_each);
     }
 
+    if (multifd_device_state) {
+        ret = multifd_join_device_state_save_threads();
+        if (ret) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
              se->ops->has_postcopy(se->opaque)) ||
@@ -1565,6 +1595,14 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;
+
+ret_fail_abort_threads:
+    if (multifd_device_state) {
+        multifd_abort_device_state_save_threads();
+        multifd_join_device_state_save_threads();
+    }
+
+    return -1;
 }
 
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (13 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 14/17] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-09-09  8:55   ` Avihai Horon
  2024-08-27 17:54 ` [PATCH v2 16/17] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating
its position in the stream.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
state.

Since it's important to finish loading device state transferred via
the main migration channel (via save_live_iterate handler) before
starting loading the data asynchronously transferred via multifd
a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
mark the end of the main migration channel data.

The device state loading process waits until that flag is seen before
commencing loading of the multifd-transferred device state.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 338 +++++++++++++++++++++++++++++++++-
 hw/vfio/pci.c                 |   2 +
 hw/vfio/trace-events          |   9 +-
 include/hw/vfio/vfio-common.h |  17 ++
 4 files changed, 362 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 24679d8c5034..57c1542528dc 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -15,6 +15,7 @@
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 
+#include "io/channel-buffer.h"
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/misc.h"
@@ -47,6 +48,7 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE    (0xffffffffef100006ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -55,6 +57,15 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+typedef struct VFIODeviceStatePacket {
+    uint32_t version;
+    uint32_t idx;
+    uint32_t flags;
+    uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -254,6 +265,188 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
+typedef struct LoadedBuffer {
+    bool is_present;
+    char *data;
+    size_t len;
+} LoadedBuffer;
+
+static void loaded_buffer_clear(gpointer data)
+{
+    LoadedBuffer *lb = data;
+
+    if (!lb->is_present) {
+        return;
+    }
+
+    g_clear_pointer(&lb->data, g_free);
+    lb->is_present = false;
+}
+
+static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                                  Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
+    LoadedBuffer *lb;
+
+    if (data_size < sizeof(*packet)) {
+        error_setg(errp, "packet too short at %zu (min is %zu)",
+                   data_size, sizeof(*packet));
+        return -1;
+    }
+
+    if (packet->version != 0) {
+        error_setg(errp, "packet has unknown version %" PRIu32,
+                   packet->version);
+        return -1;
+    }
+
+    if (packet->idx == UINT32_MAX) {
+        error_setg(errp, "packet has too high idx %" PRIu32,
+                   packet->idx);
+        return -1;
+    }
+
+    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+    /* config state packet should be the last one in the stream */
+    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+        migration->load_buf_idx_last = packet->idx;
+    }
+
+    assert(migration->load_bufs);
+    if (packet->idx >= migration->load_bufs->len) {
+        g_array_set_size(migration->load_bufs, packet->idx + 1);
+    }
+
+    lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
+    if (lb->is_present) {
+        error_setg(errp, "state buffer %" PRIu32 " already filled", packet->idx);
+        return -1;
+    }
+
+    assert(packet->idx >= migration->load_buf_idx);
+
+    migration->load_buf_queued_pending_buffers++;
+    if (migration->load_buf_queued_pending_buffers >
+        vbasedev->migration_max_queued_buffers) {
+        error_setg(errp,
+                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
+                   packet->idx, vbasedev->migration_max_queued_buffers);
+        return -1;
+    }
+
+    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
+    lb->len = data_size - sizeof(*packet);
+    lb->is_present = true;
+
+    qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
+
+    return 0;
+}
+
+static void *vfio_load_bufs_thread(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    Error **errp = &migration->load_bufs_thread_errp;
+    g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock(
+        QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));
+    LoadedBuffer *lb;
+
+    while (!migration->load_bufs_device_ready &&
+           !migration->load_bufs_thread_want_exit) {
+        qemu_cond_wait(&migration->load_bufs_device_ready_cond, &migration->load_bufs_mutex);
+    }
+
+    while (!migration->load_bufs_thread_want_exit) {
+        bool starved;
+        ssize_t ret;
+
+        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
+
+        if (migration->load_buf_idx >= migration->load_bufs->len) {
+            assert(migration->load_buf_idx == migration->load_bufs->len);
+            starved = true;
+        } else {
+            lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
+            starved = !lb->is_present;
+        }
+
+        if (starved) {
+            trace_vfio_load_state_device_buffer_starved(vbasedev->name, migration->load_buf_idx);
+            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond, &migration->load_bufs_mutex);
+            continue;
+        }
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last) {
+            break;
+        }
+
+        if (migration->load_buf_idx == 0) {
+            trace_vfio_load_state_device_buffer_start(vbasedev->name);
+        }
+
+        if (lb->len) {
+            g_autofree char *buf = NULL;
+            size_t buf_len;
+            int errno_save;
+
+            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
+                                                           migration->load_buf_idx);
+
+            /* lb might become re-allocated when we drop the lock */
+            buf = g_steal_pointer(&lb->data);
+            buf_len = lb->len;
+
+            /* Loading data to the device takes a while, drop the lock during this process */
+            qemu_mutex_unlock(&migration->load_bufs_mutex);
+            ret = write(migration->data_fd, buf, buf_len);
+            errno_save = errno;
+            qemu_mutex_lock(&migration->load_bufs_mutex);
+
+            if (ret < 0) {
+                error_setg(errp, "write to state buffer %" PRIu32 " failed with %d",
+                           migration->load_buf_idx, errno_save);
+                break;
+            } else if (ret < buf_len) {
+                error_setg(errp, "write to state buffer %" PRIu32 " incomplete %zd / %zu",
+                           migration->load_buf_idx, ret, buf_len);
+                break;
+            }
+
+            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
+                                                         migration->load_buf_idx);
+        }
+
+        assert(migration->load_buf_queued_pending_buffers > 0);
+        migration->load_buf_queued_pending_buffers--;
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
+            trace_vfio_load_state_device_buffer_end(vbasedev->name);
+        }
+
+        migration->load_buf_idx++;
+    }
+
+    if (migration->load_bufs_thread_want_exit &&
+        !*errp) {
+        error_setg(errp, "load bufs thread asked to quit");
+    }
+
+    g_clear_pointer(&locker, qemu_lockable_auto_unlock);
+
+    qemu_loadvm_load_finish_ready_lock();
+    migration->load_bufs_thread_finished = true;
+    qemu_loadvm_load_finish_ready_broadcast();
+    qemu_loadvm_load_finish_ready_unlock();
+
+    return NULL;
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
                                          Error **errp)
 {
@@ -285,6 +478,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     uint64_t data;
 
+    trace_vfio_load_device_config_state_start(vbasedev->name);
+
     if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
         int ret;
 
@@ -303,7 +498,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         return -EINVAL;
     }
 
-    trace_vfio_load_device_config_state(vbasedev->name);
+    trace_vfio_load_device_config_state_end(vbasedev->name);
     return qemu_file_get_error(f);
 }
 
@@ -687,16 +882,70 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
 static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   vbasedev->migration->device_state, errp);
+    if (ret) {
+        return ret;
+    }
+
+    assert(!migration->load_bufs);
+    migration->load_bufs = g_array_new(FALSE, TRUE, sizeof(LoadedBuffer));
+    g_array_set_clear_func(migration->load_bufs, loaded_buffer_clear);
+
+    qemu_mutex_init(&migration->load_bufs_mutex);
+
+    migration->load_bufs_device_ready = false;
+    qemu_cond_init(&migration->load_bufs_device_ready_cond);
+
+    migration->load_buf_idx = 0;
+    migration->load_buf_idx_last = UINT32_MAX;
+    migration->load_buf_queued_pending_buffers = 0;
+    qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
+
+    migration->config_state_loaded_to_dev = false;
+
+    assert(!migration->load_bufs_thread_started);
 
-    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
-                                    vbasedev->migration->device_state, errp);
+    migration->load_bufs_thread_finished = false;
+    migration->load_bufs_thread_want_exit = false;
+    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs",
+                       vfio_load_bufs_thread, opaque, QEMU_THREAD_JOINABLE);
+
+    migration->load_bufs_thread_started = true;
+
+    return 0;
 }
 
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->load_bufs_thread_started) {
+        qemu_mutex_lock(&migration->load_bufs_mutex);
+        migration->load_bufs_thread_want_exit = true;
+        qemu_mutex_unlock(&migration->load_bufs_mutex);
+
+        qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
+        qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
+
+        qemu_thread_join(&migration->load_bufs_thread);
+
+        assert(migration->load_bufs_thread_finished);
+
+        migration->load_bufs_thread_started = false;
+    }
 
     vfio_migration_cleanup(vbasedev);
+
+    g_clear_pointer(&migration->load_bufs, g_array_unref);
+    qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
+    qemu_cond_destroy(&migration->load_bufs_device_ready_cond);
+    qemu_mutex_destroy(&migration->load_bufs_mutex);
+
     trace_vfio_load_cleanup(vbasedev->name);
 
     return 0;
@@ -705,6 +954,7 @@ static int vfio_load_cleanup(void *opaque)
 static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     int ret = 0;
     uint64_t data;
 
@@ -716,6 +966,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
         switch (data) {
         case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
         {
+            migration->config_state_loaded_to_dev = true;
             return vfio_load_device_config_state(f, opaque);
         }
         case VFIO_MIG_FLAG_DEV_SETUP_STATE:
@@ -742,6 +993,15 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE:
+        {
+            QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
+
+            migration->load_bufs_device_ready = true;
+            qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
+
+            break;
+        }
         case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
         {
             if (!vfio_precopy_supported(vbasedev) ||
@@ -774,6 +1034,76 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static int vfio_load_finish(void *opaque, bool *is_finished, Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    g_autoptr(QemuLockable) locker = NULL;
+    LoadedBuffer *lb;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f_out = NULL, *f_in = NULL;
+    uint64_t mig_header;
+    int ret;
+
+    if (migration->config_state_loaded_to_dev) {
+        *is_finished = true;
+        return 0;
+    }
+
+    if (!migration->load_bufs_thread_finished) {
+        assert(migration->load_bufs_thread_started);
+        *is_finished = false;
+        return 0;
+    }
+
+    if (migration->load_bufs_thread_errp) {
+        error_propagate(errp, g_steal_pointer(&migration->load_bufs_thread_errp));
+        return -1;
+    }
+
+    locker = qemu_lockable_auto_lock(QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));
+
+    assert(migration->load_buf_idx == migration->load_buf_idx_last);
+    lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
+    assert(lb->is_present);
+
+    bioc = qio_channel_buffer_new(lb->len);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
+
+    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
+    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
+
+    ret = qemu_fflush(f_out);
+    if (ret) {
+        error_setg(errp, "load device config state file flush failed with %d", ret);
+        g_clear_pointer(&f_out, qemu_fclose);
+        return -1;
+    }
+
+    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
+    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
+
+    mig_header = qemu_get_be64(f_in);
+    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+        error_setg(errp, "load device config state invalid header %"PRIu64, mig_header);
+        g_clear_pointer(&f_out, qemu_fclose);
+        g_clear_pointer(&f_in, qemu_fclose);
+        return -1;
+    }
+
+    ret = vfio_load_device_config_state(f_in, opaque);
+    g_clear_pointer(&f_out, qemu_fclose);
+    g_clear_pointer(&f_in, qemu_fclose);
+    if (ret < 0) {
+        error_setg(errp, "load device config state failed with %d", ret);
+        return -1;
+    }
+
+    migration->config_state_loaded_to_dev = true;
+    *is_finished = true;
+    return 0;
+}
+
 static bool vfio_switchover_ack_needed(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -794,6 +1124,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .load_state_buffer = vfio_load_state_buffer,
+    .load_finish = vfio_load_finish,
     .switchover_ack_needed = vfio_switchover_ack_needed,
 };
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2407720c3530..08cb56d27a05 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3378,6 +3378,8 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
+                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 013c602f30fa..9d2519a28a7e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,9 +149,16 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_device_config_state_start(const char *name) " (%s)"
+vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
+vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_start(const char *name) " (%s)"
+vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_end(const char *name) " (%s)"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 32d58e3e025b..ba5b9464e79a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -76,6 +76,22 @@ typedef struct VFIOMigration {
 
     bool save_iterate_run;
     bool save_iterate_empty_hit;
+
+    QemuThread load_bufs_thread;
+    Error *load_bufs_thread_errp;
+    bool load_bufs_thread_started;
+    bool load_bufs_thread_finished;
+    bool load_bufs_thread_want_exit;
+
+    GArray *load_bufs;
+    bool load_bufs_device_ready;
+    QemuCond load_bufs_device_ready_cond;
+    QemuCond load_bufs_buffer_ready_cond;
+    QemuMutex load_bufs_mutex;
+    uint32_t load_buf_idx;
+    uint32_t load_buf_idx_last;
+    uint32_t load_buf_queued_pending_buffers;
+    bool config_state_loaded_to_dev;
 } VFIOMigration;
 
 struct VFIOGroup;
@@ -134,6 +150,7 @@ typedef struct VFIODevice {
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
     bool migration_events;
+    uint64_t migration_max_queued_buffers;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 16/17] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (14 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-08-27 17:54 ` [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This property allows configuring at runtime whether to send the
particular device state via multifd channels when live migrating that
device.

It is ignored on the receive side and defaults to "false" for bit stream
compatibility with older QEMU versions.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/pci.c                 | 7 +++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 08cb56d27a05..b68f08ba8a4f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3354,6 +3354,8 @@ static void vfio_instance_init(Object *obj)
     pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
 }
 
+static PropertyInfo qdev_prop_bool_mutable;
+
 static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
     DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
@@ -3378,6 +3380,8 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
+                vbasedev.migration_multifd_transfer, qdev_prop_bool_mutable, bool),
     DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
                        vbasedev.migration_max_queued_buffers, UINT64_MAX),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
@@ -3477,6 +3481,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    qdev_prop_bool_mutable = qdev_prop_bool;
+    qdev_prop_bool_mutable.realized_set_allowed = true;
+
     type_register_static(&vfio_pci_dev_info);
     type_register_static(&vfio_pci_nohotplug_dev_info);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ba5b9464e79a..fe05acb9a5d1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -149,6 +149,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
+    bool migration_multifd_transfer;
     bool migration_events;
     uint64_t migration_max_queued_buffers;
     VFIODeviceOps *ops;


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (15 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 16/17] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2024-08-27 17:54 ` Maciej S. Szmigiero
  2024-09-09 11:41   ` Avihai Horon
  2024-08-28 20:46 ` [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
  2024-10-11 13:58 ` Cédric Le Goater
  18 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-27 17:54 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Implement the multifd device state transfer via additional per-device
thread inside save_live_complete_precopy_thread handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
x-migration-multifd-transfer device property value.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 169 ++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   2 +
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 172 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 57c1542528dc..67996aa2df8b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -655,6 +655,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
+    /* Make a copy of this setting at the start in case it is changed mid-migration */
+    migration->multifd_transfer = vbasedev->migration_multifd_transfer;
+
+    if (migration->multifd_transfer && !migration_has_device_state_support()) {
+        error_setg(errp,
+                   "%s: Multifd device transfer requested but unsupported in the current config",
+                   vbasedev->name);
+        return -EINVAL;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
@@ -835,10 +845,20 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     ssize_t data_size;
     int ret;
     Error *local_err = NULL;
 
+    if (migration->multifd_transfer) {
+        /*
+         * Emit dummy NOP data, vfio_save_complete_precopy_thread()
+         * does the actual transfer.
+         */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return 0;
+    }
+
     trace_vfio_save_complete_precopy_started(vbasedev->name);
 
     /* We reach here with device state STOP or STOP_COPY only */
@@ -864,12 +884,159 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
+                                                                char *idstr,
+                                                                uint32_t instance_id,
+                                                                uint32_t idx)
+{
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f = NULL;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    size_t packet_len;
+
+    bioc = qio_channel_buffer_new(0);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+    f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+    ret = vfio_save_device_config_state(f, vbasedev, NULL);
+    if (ret) {
+        return ret;
+    }
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        goto ret_close_file;
+    }
+
+    packet_len = sizeof(*packet) + bioc->usage;
+    packet = g_malloc0(packet_len);
+    packet->idx = idx;
+    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+    memcpy(&packet->data, bioc->data, bioc->usage);
+
+    if (!multifd_queue_device_state(idstr, instance_id,
+                                    (char *)packet, packet_len)) {
+        ret = -1;
+    }
+
+    bytes_transferred += packet_len;
+
+ret_close_file:
+    g_clear_pointer(&f, qemu_fclose);
+    return ret;
+}
+
+static int vfio_save_complete_precopy_thread(char *idstr,
+                                             uint32_t instance_id,
+                                             bool *abort_flag,
+                                             void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    uint32_t idx;
+
+    if (!migration->multifd_transfer) {
+        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
+        return 0;
+    }
+
+    trace_vfio_save_complete_precopy_thread_started(vbasedev->name,
+                                                    idstr, instance_id);
+
+    /* We reach here with device state STOP or STOP_COPY only */
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   VFIO_DEVICE_STATE_STOP, NULL);
+    if (ret) {
+        goto ret_finish;
+    }
+
+    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+
+    for (idx = 0; ; idx++) {
+        ssize_t data_size;
+        size_t packet_size;
+
+        if (qatomic_read(abort_flag)) {
+            ret = -ECANCELED;
+            goto ret_finish;
+        }
+
+        data_size = read(migration->data_fd, &packet->data,
+                         migration->data_buffer_size);
+        if (data_size < 0) {
+            if (errno != ENOMSG) {
+                ret = -errno;
+                goto ret_finish;
+            }
+
+            /*
+             * Pre-copy emptied all the device state for now. For more information,
+             * please refer to the Linux kernel VFIO uAPI.
+             */
+            data_size = 0;
+        }
+
+        if (data_size == 0)
+            break;
+
+        packet->idx = idx;
+        packet_size = sizeof(*packet) + data_size;
+
+        if (!multifd_queue_device_state(idstr, instance_id,
+                                        (char *)packet, packet_size)) {
+            ret = -1;
+            goto ret_finish;
+        }
+
+        bytes_transferred += packet_size;
+    }
+
+    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
+                                                               instance_id,
+                                                               idx);
+
+ret_finish:
+    trace_vfio_save_complete_precopy_thread_finished(vbasedev->name, ret);
+
+    return ret;
+}
+
+static int vfio_save_complete_precopy_begin(QEMUFile *f,
+                                            char *idstr, uint32_t instance_id,
+                                            void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration->multifd_transfer) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return 0;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE);
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_fflush(f);
+}
+
 static void vfio_save_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     Error *local_err = NULL;
     int ret;
 
+    if (migration->multifd_transfer) {
+        /* Emit dummy NOP data */
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+        return;
+    }
+
     ret = vfio_save_device_config_state(f, opaque, &local_err);
     if (ret) {
         error_prepend(&local_err,
@@ -1119,7 +1286,9 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .state_pending_exact = vfio_state_pending_exact,
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy_begin = vfio_save_complete_precopy_begin,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 9d2519a28a7e..b1d9c9d5f2e1 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -167,6 +167,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_complete_precopy_started(const char *name) " (%s)"
+vfio_save_complete_precopy_thread_started(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
+vfio_save_complete_precopy_thread_finished(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_save_iterate_started(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fe05acb9a5d1..4578a0ca6a5c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -72,6 +72,7 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    bool multifd_transfer;
     bool initial_data_sent;
 
     bool save_iterate_run;


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 02/17] migration/ram: Add load start trace event
  2024-08-27 17:54 ` [PATCH v2 02/17] migration/ram: Add load start trace event Maciej S. Szmigiero
@ 2024-08-28 18:44   ` Fabiano Rosas
  2024-08-28 20:21     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-28 18:44 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> There's a RAM load complete trace event but there wasn't its start equivalent.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  migration/ram.c        | 1 +
>  migration/trace-events | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 67ca3d5d51a1..7997bd830b9c 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4127,6 +4127,7 @@ static int ram_load_precopy(QEMUFile *f)
>                            RAM_SAVE_FLAG_ZERO);
>      }
>  
> +    trace_ram_load_start();

This would fit better at ram_load() paired with trace_ram_load_complete(), no?

>      while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
>          ram_addr_t addr;
>          void *host = NULL, *host_bak = NULL;
> diff --git a/migration/trace-events b/migration/trace-events
> index c65902f042bd..2a99a7baaea6 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) ""
>  save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> +ram_load_start(void) ""
>  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
>  ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>  ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet
  2024-08-27 17:54 ` [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
@ 2024-08-28 18:50   ` Fabiano Rosas
  2024-09-09 15:41   ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-28 18:50 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This way there aren't stale flags there.
>
> p->flags can't contain SYNC to be sent at the next RAM packet since syncs
> are now handled separately in multifd_send_thread.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-08-27 17:54 ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
@ 2024-08-28 19:03   ` Fabiano Rosas
  2024-09-05 13:45   ` Avihai Horon
  1 sibling, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-28 19:03 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> These SaveVMHandlers help device provide its own asynchronous
> transmission of the remaining data at the end of a precopy phase.
>
> In this use case the save_live_complete_precopy_begin handler might
> be used to mark the stream boundary before proceeding with asynchronous
> transmission of the remaining data while the
> save_live_complete_precopy_end handler might be used to mark the
> stream boundary after performing the asynchronous transmission.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 02/17] migration/ram: Add load start trace event
  2024-08-28 18:44   ` Fabiano Rosas
@ 2024-08-28 20:21     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-28 20:21 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.08.2024 20:44, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> There's a RAM load complete trace event but there wasn't its start equivalent.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   migration/ram.c        | 1 +
>>   migration/trace-events | 1 +
>>   2 files changed, 2 insertions(+)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 67ca3d5d51a1..7997bd830b9c 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -4127,6 +4127,7 @@ static int ram_load_precopy(QEMUFile *f)
>>                             RAM_SAVE_FLAG_ZERO);
>>       }
>>   
>> +    trace_ram_load_start();
> 
> This would fit better at ram_load() paired with trace_ram_load_complete(), no?
> 

Right - will move it there in the next version of this patch set.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (16 preceding siblings ...)
  2024-08-27 17:54 ` [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2024-08-28 20:46 ` Fabiano Rosas
  2024-08-28 21:58   ` Maciej S. Szmigiero
  2024-10-11 13:58 ` Cédric Le Goater
  18 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-28 20:46 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This is an updated v2 patch series of the v1 series located here:
> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>
> Changes from v1:
> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
> implemented automatic memory management support for its work element
> function argument.
>
> * Introduced a multifd device state save thread pool, ported the VFIO
> multifd device state save implementation to use this thread pool instead
> of VFIO internally managed individual threads.
>
> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>
> * Moved device state related multifd code to new multifd-device-state.c
> file where it made sense.
>
> * Implemented a max in-flight VFIO device state buffer count limit to
> allow capping the maximum recipient memory usage.
>
> * Removed unnecessary explicit memory barriers from multifd_send().
>
> * A few small changes like updated comments, code formatting,
> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>
>
> For convenience, this patch set is also available as a git tree:
> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio

With this branch I'm getting:

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/uri/plain/none
...
qemu-system-x86_64: ../util/thread-pool.c:354: thread_pool_set_minmax_threads: Assertion `max_threads > 0' failed.
Broken pipe


$ ./tests/qemu-iotests/check -p -qcow2 068
...
+qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer
  2024-08-28 20:46 ` [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
@ 2024-08-28 21:58   ` Maciej S. Szmigiero
  2024-08-29  0:51     ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-28 21:58 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.08.2024 22:46, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is an updated v2 patch series of the v1 series located here:
>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>
>> Changes from v1:
>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>> implemented automatic memory management support for its work element
>> function argument.
>>
>> * Introduced a multifd device state save thread pool, ported the VFIO
>> multifd device state save implementation to use this thread pool instead
>> of VFIO internally managed individual threads.
>>
>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>
>> * Moved device state related multifd code to new multifd-device-state.c
>> file where it made sense.
>>
>> * Implemented a max in-flight VFIO device state buffer count limit to
>> allow capping the maximum recipient memory usage.
>>
>> * Removed unnecessary explicit memory barriers from multifd_send().
>>
>> * A few small changes like updated comments, code formatting,
>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>
>>
>> For convenience, this patch set is also available as a git tree:
>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
> 
> With this branch I'm getting:
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/uri/plain/none
> ...
> qemu-system-x86_64: ../util/thread-pool.c:354: thread_pool_set_minmax_threads: Assertion `max_threads > 0' failed.
> Broken pipe
> 

Oops, I should have tested this patch set in setups without any VFIO devices too.

Fixed this now (together with that RAM tracepoint thing) and updated the GitHub tree -
the above test now passes.

Tomorrow I will test the whole multifd VFIO migration once again to be sure.

> $ ./tests/qemu-iotests/check -p -qcow2 068
> ...
> +qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
> 

I'm not sure how this can happen - it looks like qemu_loadvm_state() might be called
somehow after migration_incoming_state_destroy() already destroyed the migration state?
Will investigate this in detail tomorrow.

By the way, this test seems to not be run by the default "make check".

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-08-27 17:54 ` [PATCH v2 12/17] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2024-08-29  0:41   ` Fabiano Rosas
  2024-08-29 20:03     ` Maciej S. Szmigiero
  2024-09-10 19:48     ` Peter Xu
  2024-09-10 16:06   ` Peter Xu
  1 sibling, 2 replies; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-29  0:41 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> A new function multifd_queue_device_state() is provided for device to queue
> its state for transmission via a multifd channel.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/migration/misc.h         |  4 ++
>  migration/meson.build            |  1 +
>  migration/multifd-device-state.c | 99 ++++++++++++++++++++++++++++++++
>  migration/multifd-nocomp.c       |  6 +-
>  migration/multifd-qpl.c          |  2 +-
>  migration/multifd-uadk.c         |  2 +-
>  migration/multifd-zlib.c         |  2 +-
>  migration/multifd-zstd.c         |  2 +-
>  migration/multifd.c              | 65 +++++++++++++++------
>  migration/multifd.h              | 29 +++++++++-
>  10 files changed, 184 insertions(+), 28 deletions(-)
>  create mode 100644 migration/multifd-device-state.c
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index bfadc5613bac..7266b1b77d1f 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void);
>  /* migration/block-dirty-bitmap.c */
>  void dirty_bitmap_mig_init(void);
>  
> +/* migration/multifd-device-state.c */
> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> +                                char *data, size_t len);
> +
>  #endif
> diff --git a/migration/meson.build b/migration/meson.build
> index 77f3abf08eb1..00853595894f 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -21,6 +21,7 @@ system_ss.add(files(
>    'migration-hmp-cmds.c',
>    'migration.c',
>    'multifd.c',
> +  'multifd-device-state.c',
>    'multifd-nocomp.c',
>    'multifd-zlib.c',
>    'multifd-zero-page.c',
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> new file mode 100644
> index 000000000000..c9b44f0b5ab9
> --- /dev/null
> +++ b/migration/multifd-device-state.c
> @@ -0,0 +1,99 @@
> +/*
> + * Multifd device state migration
> + *
> + * Copyright (C) 2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/lockable.h"
> +#include "migration/misc.h"
> +#include "multifd.h"
> +
> +static QemuMutex queue_job_mutex;
> +
> +static MultiFDSendData *device_state_send;
> +
> +size_t multifd_device_state_payload_size(void)
> +{
> +    return sizeof(MultiFDDeviceState_t);
> +}

This will not be necessary because the payload size is the same as the
data type. We only need it for the special case where the MultiFDPages_t
is smaller than the total ram payload size.

> +
> +void multifd_device_state_save_setup(void)

s/save/send/. The ram ones are only called "save" because they're called
from ram_save_setup(), but we then have the proper nocomp_send_setup
hook.

> +{
> +    qemu_mutex_init(&queue_job_mutex);
> +
> +    device_state_send = multifd_send_data_alloc();
> +}
> +
> +void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
> +{
> +    g_clear_pointer(&device_state->idstr, g_free);
> +    g_clear_pointer(&device_state->buf, g_free);
> +}
> +
> +void multifd_device_state_save_cleanup(void)

s/save/send/

> +{
> +    g_clear_pointer(&device_state_send, multifd_send_data_free);
> +
> +    qemu_mutex_destroy(&queue_job_mutex);
> +}
> +
> +static void multifd_device_state_fill_packet(MultiFDSendParams *p)
> +{
> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
> +    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
> +
> +    packet->hdr.flags = cpu_to_be32(p->flags);
> +    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
> +    packet->instance_id = cpu_to_be32(device_state->instance_id);
> +    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
> +}
> +
> +void multifd_device_state_send_prepare(MultiFDSendParams *p)
> +{
> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
> +
> +    assert(multifd_payload_device_state(p->data));
> +
> +    multifd_send_prepare_header_device_state(p);
> +
> +    assert(!(p->flags & MULTIFD_FLAG_SYNC));
> +
> +    p->next_packet_size = device_state->buf_len;
> +    if (p->next_packet_size > 0) {
> +        p->iov[p->iovs_num].iov_base = device_state->buf;
> +        p->iov[p->iovs_num].iov_len = p->next_packet_size;
> +        p->iovs_num++;
> +    }
> +
> +    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
> +
> +    multifd_device_state_fill_packet(p);
> +}
> +
> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> +                                char *data, size_t len)
> +{
> +    /* Device state submissions can come from multiple threads */
> +    QEMU_LOCK_GUARD(&queue_job_mutex);
> +    MultiFDDeviceState_t *device_state;
> +
> +    assert(multifd_payload_empty(device_state_send));
> +
> +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> +    device_state = &device_state_send->u.device_state;
> +    device_state->idstr = g_strdup(idstr);
> +    device_state->instance_id = instance_id;
> +    device_state->buf = g_memdup2(data, len);
> +    device_state->buf_len = len;
> +
> +    if (!multifd_send(&device_state_send)) {
> +        multifd_send_data_clear(device_state_send);
> +        return false;
> +    }
> +
> +    return true;
> +}
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index 39eb77c9b3b7..0b7b543f44db 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -116,13 +116,13 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
>           * Only !zerocopy needs the header in IOV; zerocopy will
>           * send it separately.
>           */
> -        multifd_send_prepare_header(p);
> +        multifd_send_prepare_header_ram(p);
>      }
>  
>      multifd_send_prepare_iovs(p);
>      p->flags |= MULTIFD_FLAG_NOCOMP;
>  
> -    multifd_send_fill_packet(p);
> +    multifd_send_fill_packet_ram(p);
>  
>      if (use_zero_copy_send) {
>          /* Send header first, without zerocopy */
> @@ -371,7 +371,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
>          return false;
>      }
>  
> -    multifd_send_prepare_header(p);
> +    multifd_send_prepare_header_ram(p);
>  
>      return true;
>  }
> diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> index 75041a4c4dfe..bd6b5b6a3868 100644
> --- a/migration/multifd-qpl.c
> +++ b/migration/multifd-qpl.c
> @@ -490,7 +490,7 @@ static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
>  
>  out:
>      p->flags |= MULTIFD_FLAG_QPL;
> -    multifd_send_fill_packet(p);
> +    multifd_send_fill_packet_ram(p);
>      return 0;
>  }
>  
> diff --git a/migration/multifd-uadk.c b/migration/multifd-uadk.c
> index db2549f59bfe..6e2d26010742 100644
> --- a/migration/multifd-uadk.c
> +++ b/migration/multifd-uadk.c
> @@ -198,7 +198,7 @@ static int multifd_uadk_send_prepare(MultiFDSendParams *p, Error **errp)
>      }
>  out:
>      p->flags |= MULTIFD_FLAG_UADK;
> -    multifd_send_fill_packet(p);
> +    multifd_send_fill_packet_ram(p);
>      return 0;
>  }
>  
> diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
> index 6787538762d2..62a1fe59ad3e 100644
> --- a/migration/multifd-zlib.c
> +++ b/migration/multifd-zlib.c
> @@ -156,7 +156,7 @@ static int multifd_zlib_send_prepare(MultiFDSendParams *p, Error **errp)
>  
>  out:
>      p->flags |= MULTIFD_FLAG_ZLIB;
> -    multifd_send_fill_packet(p);
> +    multifd_send_fill_packet_ram(p);
>      return 0;
>  }
>  
> diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
> index 1576b1e2adc6..f98b07e7f9f5 100644
> --- a/migration/multifd-zstd.c
> +++ b/migration/multifd-zstd.c
> @@ -143,7 +143,7 @@ static int multifd_zstd_send_prepare(MultiFDSendParams *p, Error **errp)
>  
>  out:
>      p->flags |= MULTIFD_FLAG_ZSTD;
> -    multifd_send_fill_packet(p);
> +    multifd_send_fill_packet_ram(p);
>      return 0;
>  }
>  
> diff --git a/migration/multifd.c b/migration/multifd.c
> index a74e8a5cc891..bebe5b5a9b9c 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -12,6 +12,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/cutils.h"
> +#include "qemu/iov.h"
>  #include "qemu/rcu.h"
>  #include "exec/target_page.h"
>  #include "sysemu/sysemu.h"
> @@ -19,6 +20,7 @@
>  #include "qemu/error-report.h"
>  #include "qapi/error.h"
>  #include "file.h"
> +#include "migration/misc.h"
>  #include "migration.h"
>  #include "migration-stats.h"
>  #include "savevm.h"
> @@ -107,7 +109,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
>       * added to the union in the future are larger than
>       * (MultiFDPages_t + flex array).
>       */
> -    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
> +    max_payload_size = MAX(multifd_ram_payload_size(),
> +                           multifd_device_state_payload_size());

This is not needed, the sizeof(MultiFDPayload) below already has the
same effect.

> +    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
>  
>      /*
>       * Account for any holes the compiler might insert. We can't pack
> @@ -126,6 +130,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
>      }
>  
>      switch (data->type) {
> +    case MULTIFD_PAYLOAD_DEVICE_STATE:
> +        multifd_device_state_clear(&data->u.device_state);
> +        break;
>      default:
>          /* Nothing to do */
>          break;
> @@ -228,7 +235,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
>      return msg.id;
>  }
>  
> -void multifd_send_fill_packet(MultiFDSendParams *p)
> +void multifd_send_fill_packet_ram(MultiFDSendParams *p)

Do we need this change if there's no counterpart for device_state? It
might be less confusing to just leave this one as it is.

>  {
>      MultiFDPacket_t *packet = p->packet;
>      uint64_t packet_num;
> @@ -397,20 +404,16 @@ bool multifd_send(MultiFDSendData **send_data)
>  
>          p = &multifd_send_state->params[i];
>          /*
> -         * Lockless read to p->pending_job is safe, because only multifd
> -         * sender thread can clear it.
> +         * Lockless RMW on p->pending_job_preparing is safe, because only multifd
> +         * sender thread can clear it after it had seen p->pending_job being set.
> +         *
> +         * Pairs with qatomic_store_release() in multifd_send_thread().
>           */
> -        if (qatomic_read(&p->pending_job) == false) {
> +        if (qatomic_cmpxchg(&p->pending_job_preparing, false, true) == false) {

What's the motivation for this change? It would be better to have it in
a separate patch with a proper justification.

>              break;
>          }
>      }
>  
> -    /*
> -     * Make sure we read p->pending_job before all the rest.  Pairs with
> -     * qatomic_store_release() in multifd_send_thread().
> -     */
> -    smp_mb_acquire();
> -
>      assert(multifd_payload_empty(p->data));
>  
>      /*
> @@ -534,6 +537,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>      p->name = NULL;
>      g_clear_pointer(&p->data, multifd_send_data_free);
>      p->packet_len = 0;
> +    g_clear_pointer(&p->packet_device_state, g_free);
>      g_free(p->packet);
>      p->packet = NULL;
>      multifd_send_state->ops->send_cleanup(p, errp);
> @@ -545,6 +549,7 @@ static void multifd_send_cleanup_state(void)
>  {
>      file_cleanup_outgoing_migration();
>      socket_cleanup_outgoing_migration();
> +    multifd_device_state_save_cleanup();
>      qemu_sem_destroy(&multifd_send_state->channels_created);
>      qemu_sem_destroy(&multifd_send_state->channels_ready);
>      g_free(multifd_send_state->params);
> @@ -670,19 +675,29 @@ static void *multifd_send_thread(void *opaque)
>           * qatomic_store_release() in multifd_send().
>           */
>          if (qatomic_load_acquire(&p->pending_job)) {
> +            bool is_device_state = multifd_payload_device_state(p->data);
> +            size_t total_size;
> +
>              p->flags = 0;
>              p->iovs_num = 0;
>              assert(!multifd_payload_empty(p->data));
>  
> -            ret = multifd_send_state->ops->send_prepare(p, &local_err);
> -            if (ret != 0) {
> -                break;
> +            if (is_device_state) {
> +                multifd_device_state_send_prepare(p);
> +            } else {
> +                ret = multifd_send_state->ops->send_prepare(p, &local_err);
> +                if (ret != 0) {
> +                    break;
> +                }
>              }
>  
>              if (migrate_mapped_ram()) {
> +                assert(!is_device_state);
> +
>                  ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
>                                                &p->data->u.ram, &local_err);
>              } else {
> +                total_size = iov_size(p->iov, p->iovs_num);
>                  ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num,
>                                                    NULL, 0, p->write_flags,
>                                                    &local_err);
> @@ -692,18 +707,27 @@ static void *multifd_send_thread(void *opaque)
>                  break;
>              }
>  
> -            stat64_add(&mig_stats.multifd_bytes,
> -                       p->next_packet_size + p->packet_len);
> +            if (is_device_state) {
> +                stat64_add(&mig_stats.multifd_bytes, total_size);
> +            } else {
> +                /*
> +                 * Can't just always add total_size since IOVs do not include
> +                 * packet header in the zerocopy RAM case.
> +                 */
> +                stat64_add(&mig_stats.multifd_bytes,
> +                           p->next_packet_size + p->packet_len);

You could set total_size for both branches after send_prepare and use it
here unconditionally.

> +            }
>  
>              p->next_packet_size = 0;
>              multifd_send_data_clear(p->data);
>  
>              /*
>               * Making sure p->data is published before saying "we're
> -             * free".  Pairs with the smp_mb_acquire() in
> +             * free".  Pairs with the qatomic_cmpxchg() in
>               * multifd_send().
>               */
>              qatomic_store_release(&p->pending_job, false);
> +            qatomic_store_release(&p->pending_job_preparing, false);
>          } else {
>              /*
>               * If not a normal job, must be a sync request.  Note that
> @@ -714,7 +738,7 @@ static void *multifd_send_thread(void *opaque)
>  
>              if (use_packets) {
>                  p->flags = MULTIFD_FLAG_SYNC;
> -                multifd_send_fill_packet(p);
> +                multifd_send_fill_packet_ram(p);
>                  ret = qio_channel_write_all(p->c, (void *)p->packet,
>                                              p->packet_len, &local_err);
>                  if (ret != 0) {
> @@ -910,6 +934,9 @@ bool multifd_send_setup(void)
>              p->packet_len = sizeof(MultiFDPacket_t)
>                            + sizeof(uint64_t) * page_count;
>              p->packet = g_malloc0(p->packet_len);
> +            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
> +            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
> +            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>          }
>          p->name = g_strdup_printf("mig/src/send_%d", i);
>          p->write_flags = 0;
> @@ -944,6 +971,8 @@ bool multifd_send_setup(void)
>          }
>      }
>  
> +    multifd_device_state_save_setup();
> +
>      return true;
>  
>  err:
> diff --git a/migration/multifd.h b/migration/multifd.h
> index a0853622153e..c15c83104c8b 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -120,10 +120,12 @@ typedef struct {
>  typedef enum {
>      MULTIFD_PAYLOAD_NONE,
>      MULTIFD_PAYLOAD_RAM,
> +    MULTIFD_PAYLOAD_DEVICE_STATE,
>  } MultiFDPayloadType;
>  
>  typedef union MultiFDPayload {
>      MultiFDPages_t ram;
> +    MultiFDDeviceState_t device_state;
>  } MultiFDPayload;
>  
>  struct MultiFDSendData {
> @@ -136,6 +138,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
>      return data->type == MULTIFD_PAYLOAD_NONE;
>  }
>  
> +static inline bool multifd_payload_device_state(MultiFDSendData *data)
> +{
> +    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
> +}
> +
>  static inline void multifd_set_payload_type(MultiFDSendData *data,
>                                              MultiFDPayloadType type)
>  {
> @@ -182,13 +189,15 @@ typedef struct {
>       * cleared by the multifd sender threads.
>       */
>      bool pending_job;
> +    bool pending_job_preparing;
>      bool pending_sync;
>      MultiFDSendData *data;
>  
>      /* thread local variables. No locking required */
>  
> -    /* pointer to the packet */
> +    /* pointers to the possible packet types */
>      MultiFDPacket_t *packet;
> +    MultiFDPacketDeviceState_t *packet_device_state;
>      /* size of the next packet that contains pages */
>      uint32_t next_packet_size;
>      /* packets sent through this channel */
> @@ -276,18 +285,25 @@ typedef struct {
>  } MultiFDMethods;
>  
>  void multifd_register_ops(int method, MultiFDMethods *ops);
> -void multifd_send_fill_packet(MultiFDSendParams *p);
> +void multifd_send_fill_packet_ram(MultiFDSendParams *p);
>  bool multifd_send_prepare_common(MultiFDSendParams *p);
>  void multifd_send_zero_page_detect(MultiFDSendParams *p);
>  void multifd_recv_zero_page_process(MultiFDRecvParams *p);
>  
> -static inline void multifd_send_prepare_header(MultiFDSendParams *p)
> +static inline void multifd_send_prepare_header_ram(MultiFDSendParams *p)

This could instead go to multifd-nocomp.c and become multifd_ram_prepare_header.

>  {
>      p->iov[0].iov_len = p->packet_len;
>      p->iov[0].iov_base = p->packet;
>      p->iovs_num++;
>  }
>  
> +static inline void multifd_send_prepare_header_device_state(MultiFDSendParams *p)

Seem like this could also move to multifd-device-state.c and drop the
"send" part.

> +{
> +    p->iov[0].iov_len = sizeof(*p->packet_device_state);
> +    p->iov[0].iov_base = p->packet_device_state;
> +    p->iovs_num++;
> +}
> +
>  void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
>  bool multifd_send(MultiFDSendData **send_data);
>  MultiFDSendData *multifd_send_data_alloc(void);
> @@ -310,4 +326,11 @@ int multifd_ram_flush_and_sync(void);
>  size_t multifd_ram_payload_size(void);
>  void multifd_ram_fill_packet(MultiFDSendParams *p);
>  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
> +
> +size_t multifd_device_state_payload_size(void);
> +void multifd_device_state_save_setup(void);
> +void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
> +void multifd_device_state_save_cleanup(void);
> +void multifd_device_state_send_prepare(MultiFDSendParams *p);
> +
>  #endif


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer
  2024-08-28 21:58   ` Maciej S. Szmigiero
@ 2024-08-29  0:51     ` Fabiano Rosas
  2024-08-29 20:02       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-29  0:51 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 28.08.2024 22:46, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This is an updated v2 patch series of the v1 series located here:
>>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>>
>>> Changes from v1:
>>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>>> implemented automatic memory management support for its work element
>>> function argument.
>>>
>>> * Introduced a multifd device state save thread pool, ported the VFIO
>>> multifd device state save implementation to use this thread pool instead
>>> of VFIO internally managed individual threads.
>>>
>>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>>
>>> * Moved device state related multifd code to new multifd-device-state.c
>>> file where it made sense.
>>>
>>> * Implemented a max in-flight VFIO device state buffer count limit to
>>> allow capping the maximum recipient memory usage.
>>>
>>> * Removed unnecessary explicit memory barriers from multifd_send().
>>>
>>> * A few small changes like updated comments, code formatting,
>>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>>
>>>
>>> For convenience, this patch set is also available as a git tree:
>>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
>> 
>> With this branch I'm getting:
>> 
>> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/uri/plain/none
>> ...
>> qemu-system-x86_64: ../util/thread-pool.c:354: thread_pool_set_minmax_threads: Assertion `max_threads > 0' failed.
>> Broken pipe
>> 
>
> Oops, I should have tested this patch set in setups without any VFIO devices too.
>
> Fixed this now (together with that RAM tracepoint thing) and updated the GitHub tree -
> the above test now passes.
>
> Tomorrow I will test the whole multifd VFIO migration once again to be sure.
>
>> $ ./tests/qemu-iotests/check -p -qcow2 068
>> ...
>> +qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
>> 
>
> I'm not sure how this can happen - it looks like qemu_loadvm_state() might be called
> somehow after migration_incoming_state_destroy() already destroyed the migration state?
> Will investigate this in detail tomorrow.

Usually something breaks and then the clean up code rushes and frees
state while other parts are still using it.

We also had issues recently with code not incrementing the migration
state refcount properly:

27eb8499ed ("migration: Fix use-after-free of migration state object")

>
> By the way, this test seems to not be run by the default "make check".
>
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer
  2024-08-29  0:51     ` Fabiano Rosas
@ 2024-08-29 20:02       ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-29 20:02 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 29.08.2024 02:51, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 28.08.2024 22:46, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> This is an updated v2 patch series of the v1 series located here:
>>>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>>>
>>>> Changes from v1:
>>>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>>>> implemented automatic memory management support for its work element
>>>> function argument.
>>>>
>>>> * Introduced a multifd device state save thread pool, ported the VFIO
>>>> multifd device state save implementation to use this thread pool instead
>>>> of VFIO internally managed individual threads.
>>>>
>>>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>>>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>>>
>>>> * Moved device state related multifd code to new multifd-device-state.c
>>>> file where it made sense.
>>>>
>>>> * Implemented a max in-flight VFIO device state buffer count limit to
>>>> allow capping the maximum recipient memory usage.
>>>>
>>>> * Removed unnecessary explicit memory barriers from multifd_send().
>>>>
>>>> * A few small changes like updated comments, code formatting,
>>>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>>>
>>>>
>>>> For convenience, this patch set is also available as a git tree:
>>>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
>>>
>>> With this branch I'm getting:
>>>
(..)
>>> $ ./tests/qemu-iotests/check -p -qcow2 068
>>> ...
>>> +qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
>>>
>>
>> I'm not sure how this can happen - it looks like qemu_loadvm_state() might be called
>> somehow after migration_incoming_state_destroy() already destroyed the migration state?
>> Will investigate this in detail tomorrow.
> 
> Usually something breaks and then the clean up code rushes and frees
> state while other parts are still using it.
> 
> We also had issues recently with code not incrementing the migration
> state refcount properly:
> 
> 27eb8499ed ("migration: Fix use-after-free of migration state object")

Looks like MigrationIncomingState is just for "true" incoming migration,
which can be started just once - so it is destroyed after the first
attempt and never reinitialized.

On the other hand, MigrationState is for both true incoming migration and
also for snapshot load - the later which seems able to be started multiple
times.

Moved these variables to MigrationState, updated the GitHub tree and now
this test passes.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-08-29  0:41   ` Fabiano Rosas
@ 2024-08-29 20:03     ` Maciej S. Szmigiero
  2024-08-30 13:02       ` Fabiano Rosas
  2024-09-10 19:48     ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-08-29 20:03 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 29.08.2024 02:41, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> A new function multifd_queue_device_state() is provided for device to queue
>> its state for transmission via a multifd channel.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h         |  4 ++
>>   migration/meson.build            |  1 +
>>   migration/multifd-device-state.c | 99 ++++++++++++++++++++++++++++++++
>>   migration/multifd-nocomp.c       |  6 +-
>>   migration/multifd-qpl.c          |  2 +-
>>   migration/multifd-uadk.c         |  2 +-
>>   migration/multifd-zlib.c         |  2 +-
>>   migration/multifd-zstd.c         |  2 +-
>>   migration/multifd.c              | 65 +++++++++++++++------
>>   migration/multifd.h              | 29 +++++++++-
>>   10 files changed, 184 insertions(+), 28 deletions(-)
>>   create mode 100644 migration/multifd-device-state.c
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index bfadc5613bac..7266b1b77d1f 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void);
>>   /* migration/block-dirty-bitmap.c */
>>   void dirty_bitmap_mig_init(void);
>>   
>> +/* migration/multifd-device-state.c */
>> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> +                                char *data, size_t len);
>> +
>>   #endif
>> diff --git a/migration/meson.build b/migration/meson.build
>> index 77f3abf08eb1..00853595894f 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -21,6 +21,7 @@ system_ss.add(files(
>>     'migration-hmp-cmds.c',
>>     'migration.c',
>>     'multifd.c',
>> +  'multifd-device-state.c',
>>     'multifd-nocomp.c',
>>     'multifd-zlib.c',
>>     'multifd-zero-page.c',
>> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> new file mode 100644
>> index 000000000000..c9b44f0b5ab9
>> --- /dev/null
>> +++ b/migration/multifd-device-state.c
>> @@ -0,0 +1,99 @@
>> +/*
>> + * Multifd device state migration
>> + *
>> + * Copyright (C) 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/lockable.h"
>> +#include "migration/misc.h"
>> +#include "multifd.h"
>> +
>> +static QemuMutex queue_job_mutex;
>> +
>> +static MultiFDSendData *device_state_send;
>> +
>> +size_t multifd_device_state_payload_size(void)
>> +{
>> +    return sizeof(MultiFDDeviceState_t);
>> +}
> 
> This will not be necessary because the payload size is the same as the
> data type. We only need it for the special case where the MultiFDPages_t
> is smaller than the total ram payload size.

I know - I just wanted to make the API consistent with the one RAM
handler provides since these multifd_send_data_alloc() calls are done
just once per migration - it isn't any kind of a hot path.

>> +
>> +void multifd_device_state_save_setup(void)
> 
> s/save/send/. The ram ones are only called "save" because they're called
> from ram_save_setup(), but we then have the proper nocomp_send_setup
> hook.

Ack.

>> +{
>> +    qemu_mutex_init(&queue_job_mutex);
>> +
>> +    device_state_send = multifd_send_data_alloc();
>> +}
>> +
>> +void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>> +{
>> +    g_clear_pointer(&device_state->idstr, g_free);
>> +    g_clear_pointer(&device_state->buf, g_free);
>> +}
>> +
>> +void multifd_device_state_save_cleanup(void)
> 
> s/save/send/

Ack.

>> +{
>> +    g_clear_pointer(&device_state_send, multifd_send_data_free);
>> +
>> +    qemu_mutex_destroy(&queue_job_mutex);
>> +}
>> +
>> +static void multifd_device_state_fill_packet(MultiFDSendParams *p)
>> +{
>> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
>> +    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
>> +
>> +    packet->hdr.flags = cpu_to_be32(p->flags);
>> +    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
>> +    packet->instance_id = cpu_to_be32(device_state->instance_id);
>> +    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>> +}
>> +
>> +void multifd_device_state_send_prepare(MultiFDSendParams *p)
>> +{
>> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
>> +
>> +    assert(multifd_payload_device_state(p->data));
>> +
>> +    multifd_send_prepare_header_device_state(p);
>> +
>> +    assert(!(p->flags & MULTIFD_FLAG_SYNC));
>> +
>> +    p->next_packet_size = device_state->buf_len;
>> +    if (p->next_packet_size > 0) {
>> +        p->iov[p->iovs_num].iov_base = device_state->buf;
>> +        p->iov[p->iovs_num].iov_len = p->next_packet_size;
>> +        p->iovs_num++;
>> +    }
>> +
>> +    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
>> +
>> +    multifd_device_state_fill_packet(p);
>> +}
>> +
>> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> +                                char *data, size_t len)
>> +{
>> +    /* Device state submissions can come from multiple threads */
>> +    QEMU_LOCK_GUARD(&queue_job_mutex);
>> +    MultiFDDeviceState_t *device_state;
>> +
>> +    assert(multifd_payload_empty(device_state_send));
>> +
>> +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
>> +    device_state = &device_state_send->u.device_state;
>> +    device_state->idstr = g_strdup(idstr);
>> +    device_state->instance_id = instance_id;
>> +    device_state->buf = g_memdup2(data, len);
>> +    device_state->buf_len = len;
>> +
>> +    if (!multifd_send(&device_state_send)) {
>> +        multifd_send_data_clear(device_state_send);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
>> index 39eb77c9b3b7..0b7b543f44db 100644
>> --- a/migration/multifd-nocomp.c
>> +++ b/migration/multifd-nocomp.c
>> @@ -116,13 +116,13 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
>>            * Only !zerocopy needs the header in IOV; zerocopy will
>>            * send it separately.
>>            */
>> -        multifd_send_prepare_header(p);
>> +        multifd_send_prepare_header_ram(p);
>>       }
>>   
>>       multifd_send_prepare_iovs(p);
>>       p->flags |= MULTIFD_FLAG_NOCOMP;
>>   
>> -    multifd_send_fill_packet(p);
>> +    multifd_send_fill_packet_ram(p);
>>   
>>       if (use_zero_copy_send) {
>>           /* Send header first, without zerocopy */
>> @@ -371,7 +371,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
>>           return false;
>>       }
>>   
>> -    multifd_send_prepare_header(p);
>> +    multifd_send_prepare_header_ram(p);
>>   
>>       return true;
>>   }
>> diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
>> index 75041a4c4dfe..bd6b5b6a3868 100644
>> --- a/migration/multifd-qpl.c
>> +++ b/migration/multifd-qpl.c
>> @@ -490,7 +490,7 @@ static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
>>   
>>   out:
>>       p->flags |= MULTIFD_FLAG_QPL;
>> -    multifd_send_fill_packet(p);
>> +    multifd_send_fill_packet_ram(p);
>>       return 0;
>>   }
>>   
>> diff --git a/migration/multifd-uadk.c b/migration/multifd-uadk.c
>> index db2549f59bfe..6e2d26010742 100644
>> --- a/migration/multifd-uadk.c
>> +++ b/migration/multifd-uadk.c
>> @@ -198,7 +198,7 @@ static int multifd_uadk_send_prepare(MultiFDSendParams *p, Error **errp)
>>       }
>>   out:
>>       p->flags |= MULTIFD_FLAG_UADK;
>> -    multifd_send_fill_packet(p);
>> +    multifd_send_fill_packet_ram(p);
>>       return 0;
>>   }
>>   
>> diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
>> index 6787538762d2..62a1fe59ad3e 100644
>> --- a/migration/multifd-zlib.c
>> +++ b/migration/multifd-zlib.c
>> @@ -156,7 +156,7 @@ static int multifd_zlib_send_prepare(MultiFDSendParams *p, Error **errp)
>>   
>>   out:
>>       p->flags |= MULTIFD_FLAG_ZLIB;
>> -    multifd_send_fill_packet(p);
>> +    multifd_send_fill_packet_ram(p);
>>       return 0;
>>   }
>>   
>> diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
>> index 1576b1e2adc6..f98b07e7f9f5 100644
>> --- a/migration/multifd-zstd.c
>> +++ b/migration/multifd-zstd.c
>> @@ -143,7 +143,7 @@ static int multifd_zstd_send_prepare(MultiFDSendParams *p, Error **errp)
>>   
>>   out:
>>       p->flags |= MULTIFD_FLAG_ZSTD;
>> -    multifd_send_fill_packet(p);
>> +    multifd_send_fill_packet_ram(p);
>>       return 0;
>>   }
>>   
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index a74e8a5cc891..bebe5b5a9b9c 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -12,6 +12,7 @@
>>   
>>   #include "qemu/osdep.h"
>>   #include "qemu/cutils.h"
>> +#include "qemu/iov.h"
>>   #include "qemu/rcu.h"
>>   #include "exec/target_page.h"
>>   #include "sysemu/sysemu.h"
>> @@ -19,6 +20,7 @@
>>   #include "qemu/error-report.h"
>>   #include "qapi/error.h"
>>   #include "file.h"
>> +#include "migration/misc.h"
>>   #include "migration.h"
>>   #include "migration-stats.h"
>>   #include "savevm.h"
>> @@ -107,7 +109,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
>>        * added to the union in the future are larger than
>>        * (MultiFDPages_t + flex array).
>>        */
>> -    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
>> +    max_payload_size = MAX(multifd_ram_payload_size(),
>> +                           multifd_device_state_payload_size());
> 
> This is not needed, the sizeof(MultiFDPayload) below already has the
> same effect.

Same as above, I think it's good for consistency, but I don't
mind removing it either (maybe by replacing it with a comment
describing that it isn't currently needed).

>> +    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
>>   
>>       /*
>>        * Account for any holes the compiler might insert. We can't pack
>> @@ -126,6 +130,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
>>       }
>>   
>>       switch (data->type) {
>> +    case MULTIFD_PAYLOAD_DEVICE_STATE:
>> +        multifd_device_state_clear(&data->u.device_state);
>> +        break;
>>       default:
>>           /* Nothing to do */
>>           break;
>> @@ -228,7 +235,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
>>       return msg.id;
>>   }
>>   
>> -void multifd_send_fill_packet(MultiFDSendParams *p)
>> +void multifd_send_fill_packet_ram(MultiFDSendParams *p)
> 
> Do we need this change if there's no counterpart for device_state? It
> might be less confusing to just leave this one as it is.

Not really, will drop this change in the next patch set version.

>>   {
>>       MultiFDPacket_t *packet = p->packet;
>>       uint64_t packet_num;
>> @@ -397,20 +404,16 @@ bool multifd_send(MultiFDSendData **send_data)
>>   
>>           p = &multifd_send_state->params[i];
>>           /*
>> -         * Lockless read to p->pending_job is safe, because only multifd
>> -         * sender thread can clear it.
>> +         * Lockless RMW on p->pending_job_preparing is safe, because only multifd
>> +         * sender thread can clear it after it had seen p->pending_job being set.
>> +         *
>> +         * Pairs with qatomic_store_release() in multifd_send_thread().
>>            */
>> -        if (qatomic_read(&p->pending_job) == false) {
>> +        if (qatomic_cmpxchg(&p->pending_job_preparing, false, true) == false) {
> 
> What's the motivation for this change? It would be better to have it in
> a separate patch with a proper justification.

The original RFC patch set used dedicated device state multifd channels.

Peter and other people wanted this functionality removed, however this caused
a performance (downtime) regression.

One of the things that seemed to help mitigate this regression was making
the multifd channel selection more fair via this change.

But I can split out it to a separate commit in the next patch set version and
then see what performance improvement it currently brings.

>>               break;
>>           }
>>       }
>>   
>> -    /*
>> -     * Make sure we read p->pending_job before all the rest.  Pairs with
>> -     * qatomic_store_release() in multifd_send_thread().
>> -     */
>> -    smp_mb_acquire();
>> -
>>       assert(multifd_payload_empty(p->data));
>>   
>>       /*
>> @@ -534,6 +537,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>>       p->name = NULL;
>>       g_clear_pointer(&p->data, multifd_send_data_free);
>>       p->packet_len = 0;
>> +    g_clear_pointer(&p->packet_device_state, g_free);
>>       g_free(p->packet);
>>       p->packet = NULL;
>>       multifd_send_state->ops->send_cleanup(p, errp);
>> @@ -545,6 +549,7 @@ static void multifd_send_cleanup_state(void)
>>   {
>>       file_cleanup_outgoing_migration();
>>       socket_cleanup_outgoing_migration();
>> +    multifd_device_state_save_cleanup();
>>       qemu_sem_destroy(&multifd_send_state->channels_created);
>>       qemu_sem_destroy(&multifd_send_state->channels_ready);
>>       g_free(multifd_send_state->params);
>> @@ -670,19 +675,29 @@ static void *multifd_send_thread(void *opaque)
>>            * qatomic_store_release() in multifd_send().
>>            */
>>           if (qatomic_load_acquire(&p->pending_job)) {
>> +            bool is_device_state = multifd_payload_device_state(p->data);
>> +            size_t total_size;
>> +
>>               p->flags = 0;
>>               p->iovs_num = 0;
>>               assert(!multifd_payload_empty(p->data));
>>   
>> -            ret = multifd_send_state->ops->send_prepare(p, &local_err);
>> -            if (ret != 0) {
>> -                break;
>> +            if (is_device_state) {
>> +                multifd_device_state_send_prepare(p);
>> +            } else {
>> +                ret = multifd_send_state->ops->send_prepare(p, &local_err);
>> +                if (ret != 0) {
>> +                    break;
>> +                }
>>               }
>>   
>>               if (migrate_mapped_ram()) {
>> +                assert(!is_device_state);
>> +
>>                   ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
>>                                                 &p->data->u.ram, &local_err);
>>               } else {
>> +                total_size = iov_size(p->iov, p->iovs_num);
>>                   ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num,
>>                                                     NULL, 0, p->write_flags,
>>                                                     &local_err);
>> @@ -692,18 +707,27 @@ static void *multifd_send_thread(void *opaque)
>>                   break;
>>               }
>>   
>> -            stat64_add(&mig_stats.multifd_bytes,
>> -                       p->next_packet_size + p->packet_len);
>> +            if (is_device_state) {
>> +                stat64_add(&mig_stats.multifd_bytes, total_size);
>> +            } else {
>> +                /*
>> +                 * Can't just always add total_size since IOVs do not include
>> +                 * packet header in the zerocopy RAM case.
>> +                 */
>> +                stat64_add(&mig_stats.multifd_bytes,
>> +                           p->next_packet_size + p->packet_len);
> 
> You could set total_size for both branches after send_prepare and use it
> here unconditionally.

Ack.

>> +            }
>>   
>>               p->next_packet_size = 0;
>>               multifd_send_data_clear(p->data);
>>   
>>               /*
>>                * Making sure p->data is published before saying "we're
>> -             * free".  Pairs with the smp_mb_acquire() in
>> +             * free".  Pairs with the qatomic_cmpxchg() in
>>                * multifd_send().
>>                */
>>               qatomic_store_release(&p->pending_job, false);
>> +            qatomic_store_release(&p->pending_job_preparing, false);
>>           } else {
>>               /*
>>                * If not a normal job, must be a sync request.  Note that
>> @@ -714,7 +738,7 @@ static void *multifd_send_thread(void *opaque)
>>   
>>               if (use_packets) {
>>                   p->flags = MULTIFD_FLAG_SYNC;
>> -                multifd_send_fill_packet(p);
>> +                multifd_send_fill_packet_ram(p);
>>                   ret = qio_channel_write_all(p->c, (void *)p->packet,
>>                                               p->packet_len, &local_err);
>>                   if (ret != 0) {
>> @@ -910,6 +934,9 @@ bool multifd_send_setup(void)
>>               p->packet_len = sizeof(MultiFDPacket_t)
>>                             + sizeof(uint64_t) * page_count;
>>               p->packet = g_malloc0(p->packet_len);
>> +            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
>> +            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
>> +            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>>           }
>>           p->name = g_strdup_printf("mig/src/send_%d", i);
>>           p->write_flags = 0;
>> @@ -944,6 +971,8 @@ bool multifd_send_setup(void)
>>           }
>>       }
>>   
>> +    multifd_device_state_save_setup();
>> +
>>       return true;
>>   
>>   err:
>> diff --git a/migration/multifd.h b/migration/multifd.h
>> index a0853622153e..c15c83104c8b 100644
>> --- a/migration/multifd.h
>> +++ b/migration/multifd.h
>> @@ -120,10 +120,12 @@ typedef struct {
>>   typedef enum {
>>       MULTIFD_PAYLOAD_NONE,
>>       MULTIFD_PAYLOAD_RAM,
>> +    MULTIFD_PAYLOAD_DEVICE_STATE,
>>   } MultiFDPayloadType;
>>   
>>   typedef union MultiFDPayload {
>>       MultiFDPages_t ram;
>> +    MultiFDDeviceState_t device_state;
>>   } MultiFDPayload;
>>   
>>   struct MultiFDSendData {
>> @@ -136,6 +138,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
>>       return data->type == MULTIFD_PAYLOAD_NONE;
>>   }
>>   
>> +static inline bool multifd_payload_device_state(MultiFDSendData *data)
>> +{
>> +    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
>> +}
>> +
>>   static inline void multifd_set_payload_type(MultiFDSendData *data,
>>                                               MultiFDPayloadType type)
>>   {
>> @@ -182,13 +189,15 @@ typedef struct {
>>        * cleared by the multifd sender threads.
>>        */
>>       bool pending_job;
>> +    bool pending_job_preparing;
>>       bool pending_sync;
>>       MultiFDSendData *data;
>>   
>>       /* thread local variables. No locking required */
>>   
>> -    /* pointer to the packet */
>> +    /* pointers to the possible packet types */
>>       MultiFDPacket_t *packet;
>> +    MultiFDPacketDeviceState_t *packet_device_state;
>>       /* size of the next packet that contains pages */
>>       uint32_t next_packet_size;
>>       /* packets sent through this channel */
>> @@ -276,18 +285,25 @@ typedef struct {
>>   } MultiFDMethods;
>>   
>>   void multifd_register_ops(int method, MultiFDMethods *ops);
>> -void multifd_send_fill_packet(MultiFDSendParams *p);
>> +void multifd_send_fill_packet_ram(MultiFDSendParams *p);
>>   bool multifd_send_prepare_common(MultiFDSendParams *p);
>>   void multifd_send_zero_page_detect(MultiFDSendParams *p);
>>   void multifd_recv_zero_page_process(MultiFDRecvParams *p);
>>   
>> -static inline void multifd_send_prepare_header(MultiFDSendParams *p)
>> +static inline void multifd_send_prepare_header_ram(MultiFDSendParams *p)
> 
> This could instead go to multifd-nocomp.c and become multifd_ram_prepare_header.

Ack.

>>   {
>>       p->iov[0].iov_len = p->packet_len;
>>       p->iov[0].iov_base = p->packet;
>>       p->iovs_num++;
>>   }
>>   
>> +static inline void multifd_send_prepare_header_device_state(MultiFDSendParams *p)
> 
> Seem like this could also move to multifd-device-state.c and drop the
> "send" part.
> 

Ack.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-08-29 20:03     ` Maciej S. Szmigiero
@ 2024-08-30 13:02       ` Fabiano Rosas
  2024-09-09 19:40         ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-30 13:02 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 29.08.2024 02:41, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> A new function multifd_queue_device_state() is provided for device to queue
>>> its state for transmission via a multifd channel.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   include/migration/misc.h         |  4 ++
>>>   migration/meson.build            |  1 +
>>>   migration/multifd-device-state.c | 99 ++++++++++++++++++++++++++++++++
>>>   migration/multifd-nocomp.c       |  6 +-
>>>   migration/multifd-qpl.c          |  2 +-
>>>   migration/multifd-uadk.c         |  2 +-
>>>   migration/multifd-zlib.c         |  2 +-
>>>   migration/multifd-zstd.c         |  2 +-
>>>   migration/multifd.c              | 65 +++++++++++++++------
>>>   migration/multifd.h              | 29 +++++++++-
>>>   10 files changed, 184 insertions(+), 28 deletions(-)
>>>   create mode 100644 migration/multifd-device-state.c
>>>
>>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>>> index bfadc5613bac..7266b1b77d1f 100644
>>> --- a/include/migration/misc.h
>>> +++ b/include/migration/misc.h
>>> @@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void);
>>>   /* migration/block-dirty-bitmap.c */
>>>   void dirty_bitmap_mig_init(void);
>>>   
>>> +/* migration/multifd-device-state.c */
>>> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>>> +                                char *data, size_t len);
>>> +
>>>   #endif
>>> diff --git a/migration/meson.build b/migration/meson.build
>>> index 77f3abf08eb1..00853595894f 100644
>>> --- a/migration/meson.build
>>> +++ b/migration/meson.build
>>> @@ -21,6 +21,7 @@ system_ss.add(files(
>>>     'migration-hmp-cmds.c',
>>>     'migration.c',
>>>     'multifd.c',
>>> +  'multifd-device-state.c',
>>>     'multifd-nocomp.c',
>>>     'multifd-zlib.c',
>>>     'multifd-zero-page.c',
>>> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>>> new file mode 100644
>>> index 000000000000..c9b44f0b5ab9
>>> --- /dev/null
>>> +++ b/migration/multifd-device-state.c
>>> @@ -0,0 +1,99 @@
>>> +/*
>>> + * Multifd device state migration
>>> + *
>>> + * Copyright (C) 2024 Oracle and/or its affiliates.
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>> + * See the COPYING file in the top-level directory.
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/lockable.h"
>>> +#include "migration/misc.h"
>>> +#include "multifd.h"
>>> +
>>> +static QemuMutex queue_job_mutex;
>>> +
>>> +static MultiFDSendData *device_state_send;
>>> +
>>> +size_t multifd_device_state_payload_size(void)
>>> +{
>>> +    return sizeof(MultiFDDeviceState_t);
>>> +}
>> 
>> This will not be necessary because the payload size is the same as the
>> data type. We only need it for the special case where the MultiFDPages_t
>> is smaller than the total ram payload size.
>
> I know - I just wanted to make the API consistent with the one RAM
> handler provides since these multifd_send_data_alloc() calls are done
> just once per migration - it isn't any kind of a hot path.
>

I think the array at the end of MultiFDPages_t should be considered
enough of a hack that we might want to keep anything related to it
outside of the interface. Other clients shouldn't have to think about
that at all.

>>> @@ -397,20 +404,16 @@ bool multifd_send(MultiFDSendData **send_data)
>>>   
>>>           p = &multifd_send_state->params[i];
>>>           /*
>>> -         * Lockless read to p->pending_job is safe, because only multifd
>>> -         * sender thread can clear it.
>>> +         * Lockless RMW on p->pending_job_preparing is safe, because only multifd
>>> +         * sender thread can clear it after it had seen p->pending_job being set.
>>> +         *
>>> +         * Pairs with qatomic_store_release() in multifd_send_thread().
>>>            */
>>> -        if (qatomic_read(&p->pending_job) == false) {
>>> +        if (qatomic_cmpxchg(&p->pending_job_preparing, false, true) == false) {
>> 
>> What's the motivation for this change? It would be better to have it in
>> a separate patch with a proper justification.
>
> The original RFC patch set used dedicated device state multifd channels.
>
> Peter and other people wanted this functionality removed, however this caused
> a performance (downtime) regression.
>
> One of the things that seemed to help mitigate this regression was making
> the multifd channel selection more fair via this change.
>
> But I can split out it to a separate commit in the next patch set version and
> then see what performance improvement it currently brings.

Yes, better to have it separate if anything for documentation of the
rationale.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 11/17] migration/multifd: Add an explicit MultiFDSendData destructor
  2024-08-27 17:54 ` [PATCH v2 11/17] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
@ 2024-08-30 13:12   ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-30 13:12 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This way if there are fields there that needs explicit disposal (like, for
> example, some attached buffers) they will be handled appropriately.
>
> Add a related assert to multifd_set_payload_type() in order to make sure
> that this function is only used to fill a previously empty MultiFDSendData
> with some payload, not the other way around.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic
  2024-08-27 17:54 ` [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic Maciej S. Szmigiero
@ 2024-08-30 18:13   ` Fabiano Rosas
  2024-09-02 20:11     ` Maciej S. Szmigiero
  2024-09-10 14:13   ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-30 18:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This is necessary for multifd_send() to be able to be called
> from multiple threads.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  migration/multifd.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index d5a8e5a9c9b5..b25789dde0b3 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -343,26 +343,38 @@ bool multifd_send(MultiFDSendData **send_data)
>          return false;
>      }
>  
> -    /* We wait here, until at least one channel is ready */
> -    qemu_sem_wait(&multifd_send_state->channels_ready);
> -
>      /*
>       * next_channel can remain from a previous migration that was
>       * using more channels, so ensure it doesn't overflow if the
>       * limit is lower now.
>       */
> -    next_channel %= migrate_multifd_channels();
> -    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
> +    i = qatomic_load_acquire(&next_channel);
> +    if (unlikely(i >= migrate_multifd_channels())) {
> +        qatomic_cmpxchg(&next_channel, i, 0);
> +    }

Do we still need this? It seems not, because the mod down below would
already truncate to a value less than the number of channels. We don't
need it to start at 0 always, the channels are equivalent.

> +
> +    /* We wait here, until at least one channel is ready */
> +    qemu_sem_wait(&multifd_send_state->channels_ready);
> +
> +    while (true) {
> +        int i_next;
> +
>          if (multifd_send_should_exit()) {
>              return false;
>          }
> +
> +        i = qatomic_load_acquire(&next_channel);
> +        i_next = (i + 1) % migrate_multifd_channels();
> +        if (qatomic_cmpxchg(&next_channel, i, i_next) != i) {
> +            continue;
> +        }

Say channel 'i' is the only one that's idle. What's stopping the other
thread(s) to race at this point and loop around to the same index?

> +
>          p = &multifd_send_state->params[i];
>          /*
>           * Lockless read to p->pending_job is safe, because only multifd
>           * sender thread can clear it.
>           */
>          if (qatomic_read(&p->pending_job) == false) {

With the cmpxchg your other patch adds here, then the race I mentioned
above should be harmless. But we'd need to bring that code into this
patch.

> -            next_channel = (i + 1) % migrate_multifd_channels();
>              break;
>          }
>      }


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support()
  2024-08-27 17:54 ` [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
@ 2024-08-30 18:55   ` Fabiano Rosas
  2024-09-02 20:11     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-30 18:55 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Since device state transfer via multifd channels requires multifd
> channels with packets and is currently not compatible with multifd
> compression add an appropriate query function so device can learn
> whether it can actually make use of it.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

Out of curiosity, what do you see as a blocker for migrating to a file?

We would just need to figure out a mapping from file offset some unit of
data to be able to write in parallel like with ram (of which the page
offset is mapped to the file offset).

> ---
>  include/migration/misc.h         | 1 +
>  migration/multifd-device-state.c | 7 +++++++
>  2 files changed, 8 insertions(+)
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 7266b1b77d1f..189de6d02ad6 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -114,5 +114,6 @@ void dirty_bitmap_mig_init(void);
>  /* migration/multifd-device-state.c */
>  bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>                                  char *data, size_t len);
> +bool migration_has_device_state_support(void);
>  
>  #endif
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index c9b44f0b5ab9..7b34fe736c7f 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -11,6 +11,7 @@
>  #include "qemu/lockable.h"
>  #include "migration/misc.h"
>  #include "multifd.h"
> +#include "options.h"
>  
>  static QemuMutex queue_job_mutex;
>  
> @@ -97,3 +98,9 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>  
>      return true;
>  }
> +
> +bool migration_has_device_state_support(void)
> +{
> +    return migrate_multifd() && !migrate_mapped_ram() &&
> +        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
> +}


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-08-27 17:54 ` [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2024-08-30 19:05   ` Fabiano Rosas
  2024-09-05 14:15   ` Avihai Horon
  1 sibling, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-30 19:05 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> qemu_loadvm_load_state_buffer() and its load_state_buffer
> SaveVMHandler allow providing device state buffer to explicitly
> specified device via its idstr and instance id.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-08-27 17:54 ` [PATCH v2 08/17] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
@ 2024-08-30 19:28   ` Fabiano Rosas
  2024-09-05 15:13   ` Avihai Horon
  2024-09-09 20:03   ` Peter Xu
  2 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-30 19:28 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> load_finish SaveVMHandler allows migration code to poll whether
> a device-specific asynchronous device state loading operation had finished.
>
> In order to avoid calling this handler needlessly the device is supposed
> to notify the migration code of its possible readiness via a call to
> qemu_loadvm_load_finish_ready_broadcast() while holding
> qemu_loadvm_load_finish_ready_lock.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/migration/register.h | 21 +++++++++++++++
>  migration/migration.c        |  6 +++++
>  migration/migration.h        |  3 +++
>  migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>  migration/savevm.h           |  4 +++
>  5 files changed, 86 insertions(+)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 4a578f140713..44d8cf5192ae 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>      int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>                               Error **errp);
>  
> +    /**
> +     * @load_finish
> +     *
> +     * Poll whether all asynchronous device state loading had finished.
> +     * Not called on the load failure path.
> +     *
> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> +     *
> +     * If this method signals "not ready" then it might not be called
> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> +     * while holding qemu_loadvm_load_finish_ready_lock.
> +     *
> +     * @opaque: data pointer passed to register_savevm_live()
> +     * @is_finished: whether the loading had finished (output parameter)
> +     * @errp: pointer to Error*, to store an error if it happens.
> +     *
> +     * Returns zero to indicate success and negative for error
> +     * It's not an error that the loading still hasn't finished.
> +     */
> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> +
>      /**
>       * @load_setup
>       *
> diff --git a/migration/migration.c b/migration/migration.c
> index 3dea06d57732..d61e7b055e07 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -259,6 +259,9 @@ void migration_object_init(void)
>  
>      current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
>  
> +    qemu_mutex_init(&current_incoming->load_finish_ready_mutex);
> +    qemu_cond_init(&current_incoming->load_finish_ready_cond);
> +
>      migration_object_check(current_migration, &error_fatal);
>  
>      ram_mig_init();
> @@ -410,6 +413,9 @@ void migration_incoming_state_destroy(void)
>          mis->postcopy_qemufile_dst = NULL;
>      }
>  
> +    qemu_mutex_destroy(&mis->load_finish_ready_mutex);
> +    qemu_cond_destroy(&mis->load_finish_ready_cond);
> +
>      yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index 38aa1402d516..4e2443e6c8ec 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -230,6 +230,9 @@ struct MigrationIncomingState {
>  
>      /* Do exit on incoming migration failure */
>      bool exit_on_error;
> +
> +    QemuCond load_finish_ready_cond;
> +    QemuMutex load_finish_ready_mutex;

With these moved to MigrationState:

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-08-27 17:54 ` [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2024-08-30 20:22   ` Fabiano Rosas
  2024-09-02 20:12     ` Maciej S. Szmigiero
  2024-09-05 16:47   ` Avihai Horon
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-08-30 20:22 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Add a basic support for receiving device state via multifd channels -
> channels that are shared with RAM transfers.
>
> To differentiate between a device state and a RAM packet the packet
> header is read first.
>
> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
> packet header either device state (MultiFDPacketDeviceState_t) or RAM
> data (existing MultiFDPacket_t) is then read.
>
> The received device state data is provided to
> qemu_loadvm_load_state_buffer() function for processing in the
> device's load_state_buffer handler.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  migration/multifd.c | 127 +++++++++++++++++++++++++++++++++++++-------
>  migration/multifd.h |  31 ++++++++++-
>  2 files changed, 138 insertions(+), 20 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index b06a9fab500e..d5a8e5a9c9b5 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -21,6 +21,7 @@
>  #include "file.h"
>  #include "migration.h"
>  #include "migration-stats.h"
> +#include "savevm.h"
>  #include "socket.h"
>  #include "tls.h"
>  #include "qemu-file.h"
> @@ -209,10 +210,10 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
>  
>      memset(packet, 0, p->packet_len);
>  
> -    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> -    packet->version = cpu_to_be32(MULTIFD_VERSION);
> +    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
> +    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>  
> -    packet->flags = cpu_to_be32(p->flags);
> +    packet->hdr.flags = cpu_to_be32(p->flags);
>      packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>  
>      packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
> @@ -228,31 +229,49 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
>                              p->flags, p->next_packet_size);
>  }
>  
> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
> +                                             MultiFDPacketHdr_t *hdr,
> +                                             Error **errp)
>  {
> -    MultiFDPacket_t *packet = p->packet;
> -    int ret = 0;
> -
> -    packet->magic = be32_to_cpu(packet->magic);
> -    if (packet->magic != MULTIFD_MAGIC) {
> +    hdr->magic = be32_to_cpu(hdr->magic);
> +    if (hdr->magic != MULTIFD_MAGIC) {
>          error_setg(errp, "multifd: received packet "
>                     "magic %x and expected magic %x",
> -                   packet->magic, MULTIFD_MAGIC);
> +                   hdr->magic, MULTIFD_MAGIC);
>          return -1;
>      }
>  
> -    packet->version = be32_to_cpu(packet->version);
> -    if (packet->version != MULTIFD_VERSION) {
> +    hdr->version = be32_to_cpu(hdr->version);
> +    if (hdr->version != MULTIFD_VERSION) {
>          error_setg(errp, "multifd: received packet "
>                     "version %u and expected version %u",
> -                   packet->version, MULTIFD_VERSION);
> +                   hdr->version, MULTIFD_VERSION);
>          return -1;
>      }
>  
> -    p->flags = be32_to_cpu(packet->flags);
> +    p->flags = be32_to_cpu(hdr->flags);
> +
> +    return 0;
> +}
> +
> +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
> +                                                   Error **errp)
> +{
> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
> +
> +    packet->instance_id = be32_to_cpu(packet->instance_id);
> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
> +
> +    return 0;
> +}
> +
> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
> +{
> +    MultiFDPacket_t *packet = p->packet;
> +    int ret = 0;
> +
>      p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>      p->packet_num = be64_to_cpu(packet->packet_num);
> -    p->packets_recved++;
>  
>      if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>          ret = multifd_ram_unfill_packet(p, errp);
> @@ -264,6 +283,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>      return ret;
>  }
>  
> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +{
> +    p->packets_recved++;
> +
> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
> +        return multifd_recv_unfill_packet_device_state(p, errp);
> +    } else {
> +        return multifd_recv_unfill_packet_ram(p, errp);
> +    }
> +
> +    g_assert_not_reached();
> +}
> +
>  static bool multifd_send_should_exit(void)
>  {
>      return qatomic_read(&multifd_send_state->exiting);
> @@ -1014,6 +1046,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
>      p->packet_len = 0;
>      g_free(p->packet);
>      p->packet = NULL;
> +    g_clear_pointer(&p->packet_dev_state, g_free);
>      g_free(p->normal);
>      p->normal = NULL;
>      g_free(p->zero);
> @@ -1126,8 +1159,13 @@ static void *multifd_recv_thread(void *opaque)
>      rcu_register_thread();
>  
>      while (true) {
> +        MultiFDPacketHdr_t hdr;
>          uint32_t flags = 0;
> +        bool is_device_state = false;
>          bool has_data = false;
> +        uint8_t *pkt_buf;
> +        size_t pkt_len;
> +
>          p->normal_num = 0;
>  
>          if (use_packets) {
> @@ -1135,8 +1173,28 @@ static void *multifd_recv_thread(void *opaque)
>                  break;
>              }
>  
> -            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
> -                                           p->packet_len, &local_err);
> +            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
> +                                           sizeof(hdr), &local_err);
> +            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
> +                break;
> +            }
> +
> +            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
> +            if (ret) {
> +                break;
> +            }
> +
> +            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
> +            if (is_device_state) {
> +                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
> +                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
> +            } else {
> +                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
> +                pkt_len = p->packet_len - sizeof(hdr);
> +            }

Should we have made the packet an union as well? Would simplify these
sorts of operations. Not sure I want to start messing with that at this
point to be honest. But OTOH, look at this...

> +
> +            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
> +                                           &local_err);
>              if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>                  break;
>              }
> @@ -1181,8 +1239,33 @@ static void *multifd_recv_thread(void *opaque)
>              has_data = !!p->data->size;
>          }
>  
> -        if (has_data) {
> -            ret = multifd_recv_state->ops->recv(p, &local_err);
> +        if (!is_device_state) {
> +            if (has_data) {
> +                ret = multifd_recv_state->ops->recv(p, &local_err);
> +                if (ret != 0) {
> +                    break;
> +                }
> +            }
> +        } else {
> +            g_autofree char *idstr = NULL;
> +            g_autofree char *dev_state_buf = NULL;
> +
> +            assert(use_packets);
> +
> +            if (p->next_packet_size > 0) {
> +                dev_state_buf = g_malloc(p->next_packet_size);
> +
> +                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
> +                if (ret != 0) {
> +                    break;
> +                }
> +            }

What's the use case for !next_packet_size and still call
load_state_buffer below? I can't see it.

...because I would suggest to set has_data up there with
p->next_packet_size:

if (use_packets) {
   ...
   has_data = p->next_packet_size || p->zero_num;
} else {
   ...
   has_data = !!p->data_size;
}

and this whole block would be:

if (has_data) {
   if (is_device_state) {
       multifd_device_state_recv(p, &local_err);
   } else {
       ret = multifd_recv_state->ops->recv(p, &local_err);
   }
}

> +
> +            idstr = g_strndup(p->packet_dev_state->idstr, sizeof(p->packet_dev_state->idstr));
> +            ret = qemu_loadvm_load_state_buffer(idstr,
> +                                                p->packet_dev_state->instance_id,
> +                                                dev_state_buf, p->next_packet_size,
> +                                                &local_err);
>              if (ret != 0) {
>                  break;
>              }
> @@ -1190,6 +1273,11 @@ static void *multifd_recv_thread(void *opaque)
>  
>          if (use_packets) {
>              if (flags & MULTIFD_FLAG_SYNC) {
> +                if (is_device_state) {
> +                    error_setg(&local_err, "multifd: received SYNC device state packet");
> +                    break;
> +                }

assert(!is_device_state) enough?

> +
>                  qemu_sem_post(&multifd_recv_state->sem_sync);
>                  qemu_sem_wait(&p->sem_sync);
>              }
> @@ -1258,6 +1346,7 @@ int multifd_recv_setup(Error **errp)
>              p->packet_len = sizeof(MultiFDPacket_t)
>                  + sizeof(uint64_t) * page_count;
>              p->packet = g_malloc0(p->packet_len);
> +            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
>          }
>          p->name = g_strdup_printf("mig/dst/recv_%d", i);
>          p->normal = g_new0(ram_addr_t, page_count);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index a3e35196d179..a8f3e4838c01 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>  #define MULTIFD_FLAG_QPL (4 << 1)
>  #define MULTIFD_FLAG_UADK (8 << 1)
>  
> +/*
> + * If set it means that this packet contains device state
> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
> + */
> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)

Overlaps with UADK. I assume on purpose because device_state doesn't
support compression? Might be worth a comment.

> +
>  /* This value needs to be a multiple of qemu_target_page_size() */
>  #define MULTIFD_PACKET_SIZE (512 * 1024)
>  
> @@ -52,6 +58,11 @@ typedef struct {
>      uint32_t magic;
>      uint32_t version;
>      uint32_t flags;
> +} __attribute__((packed)) MultiFDPacketHdr_t;
> +
> +typedef struct {
> +    MultiFDPacketHdr_t hdr;
> +
>      /* maximum number of allocated pages */
>      uint32_t pages_alloc;
>      /* non zero pages */
> @@ -72,6 +83,16 @@ typedef struct {
>      uint64_t offset[];
>  } __attribute__((packed)) MultiFDPacket_t;
>  
> +typedef struct {
> +    MultiFDPacketHdr_t hdr;
> +
> +    char idstr[256] QEMU_NONSTRING;
> +    uint32_t instance_id;
> +
> +    /* size of the next packet that contains the actual data */
> +    uint32_t next_packet_size;
> +} __attribute__((packed)) MultiFDPacketDeviceState_t;
> +
>  typedef struct {
>      /* number of used pages */
>      uint32_t num;
> @@ -89,6 +110,13 @@ struct MultiFDRecvData {
>      off_t file_offset;
>  };
>  
> +typedef struct {
> +    char *idstr;
> +    uint32_t instance_id;
> +    char *buf;
> +    size_t buf_len;
> +} MultiFDDeviceState_t;
> +
>  typedef enum {
>      MULTIFD_PAYLOAD_NONE,
>      MULTIFD_PAYLOAD_RAM,
> @@ -204,8 +232,9 @@ typedef struct {
>  
>      /* thread local variables. No locking required */
>  
> -    /* pointer to the packet */
> +    /* pointers to the possible packet types */
>      MultiFDPacket_t *packet;
> +    MultiFDPacketDeviceState_t *packet_dev_state;
>      /* size of the next packet that contains pages */
>      uint32_t next_packet_size;
>      /* packets received through this channel */


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic
  2024-08-30 18:13   ` Fabiano Rosas
@ 2024-09-02 20:11     ` Maciej S. Szmigiero
  2024-09-03 15:01       ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-02 20:11 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 30.08.2024 20:13, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is necessary for multifd_send() to be able to be called
>> from multiple threads.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   migration/multifd.c | 24 ++++++++++++++++++------
>>   1 file changed, 18 insertions(+), 6 deletions(-)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index d5a8e5a9c9b5..b25789dde0b3 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -343,26 +343,38 @@ bool multifd_send(MultiFDSendData **send_data)
>>           return false;
>>       }
>>   
>> -    /* We wait here, until at least one channel is ready */
>> -    qemu_sem_wait(&multifd_send_state->channels_ready);
>> -
>>       /*
>>        * next_channel can remain from a previous migration that was
>>        * using more channels, so ensure it doesn't overflow if the
>>        * limit is lower now.
>>        */
>> -    next_channel %= migrate_multifd_channels();
>> -    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
>> +    i = qatomic_load_acquire(&next_channel);
>> +    if (unlikely(i >= migrate_multifd_channels())) {
>> +        qatomic_cmpxchg(&next_channel, i, 0);
>> +    }
> 
> Do we still need this? It seems not, because the mod down below would
> already truncate to a value less than the number of channels. We don't
> need it to start at 0 always, the channels are equivalent.

The "modulo" operation below forces i_next to be in the proper range,
not i.

If the qatomic_cmpxchg() ends up succeeding then we use the (now out of
bounds) i value to index multifd_send_state->params[].

>> +
>> +    /* We wait here, until at least one channel is ready */
>> +    qemu_sem_wait(&multifd_send_state->channels_ready);
>> +
>> +    while (true) {
>> +        int i_next;
>> +
>>           if (multifd_send_should_exit()) {
>>               return false;
>>           }
>> +
>> +        i = qatomic_load_acquire(&next_channel);
>> +        i_next = (i + 1) % migrate_multifd_channels();
>> +        if (qatomic_cmpxchg(&next_channel, i, i_next) != i) {
>> +            continue;
>> +        }
> 
> Say channel 'i' is the only one that's idle. What's stopping the other
> thread(s) to race at this point and loop around to the same index?

See the reply below.

>> +
>>           p = &multifd_send_state->params[i];
>>           /*
>>            * Lockless read to p->pending_job is safe, because only multifd
>>            * sender thread can clear it.
>>            */
>>           if (qatomic_read(&p->pending_job) == false) {
> 
> With the cmpxchg your other patch adds here, then the race I mentioned
> above should be harmless. But we'd need to bring that code into this
> patch.
> 

You're right - the sender code with this patch alone isn't thread safe
yet but this commit is only literally about "converting
multifd_send()::next_channel to atomic".

At the time of this patch there aren't any multifd_send() calls from
multiple threads, and the commit that introduces such possible call
site (multifd_queue_device_state()) also modifies multifd_send()
to be fully thread safe by introducing p->pending_job_preparing.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support()
  2024-08-30 18:55   ` Fabiano Rosas
@ 2024-09-02 20:11     ` Maciej S. Szmigiero
  2024-09-03 15:09       ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-02 20:11 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 30.08.2024 20:55, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Since device state transfer via multifd channels requires multifd
>> channels with packets and is currently not compatible with multifd
>> compression add an appropriate query function so device can learn
>> whether it can actually make use of it.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 
> Out of curiosity, what do you see as a blocker for migrating to a file?
> 
> We would just need to figure out a mapping from file offset some unit of
> data to be able to write in parallel like with ram (of which the page
> offset is mapped to the file offset).

I'm not sure whether there's a point in that since VFIO devices
just provide a raw device state stream - there's no way to know
that some buffer is no longer needed because it consisted of
dirty data that was completely overwritten by a later buffer.

Also, the device type that the code was developed against - a (smart)
NIC - has so large device state because (more or less) it keeps a lot
of data about network connections passing / made through it.

It doesn't really make sense to make snapshot of such device for later
reload since these connections will be long dropped by their remote
peers by this point.

Such snapshotting might make more sense with GPU VFIO devices though.

If such file migration support is desired at some later point then for
sure the whole code would need to be carefully re-checked for implicit
assumptions.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-08-30 20:22   ` Fabiano Rosas
@ 2024-09-02 20:12     ` Maciej S. Szmigiero
  2024-09-03 14:42       ` Fabiano Rosas
  2024-09-09 19:52       ` Peter Xu
  0 siblings, 2 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-02 20:12 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Peter Xu

On 30.08.2024 22:22, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add a basic support for receiving device state via multifd channels -
>> channels that are shared with RAM transfers.
>>
>> To differentiate between a device state and a RAM packet the packet
>> header is read first.
>>
>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>> data (existing MultiFDPacket_t) is then read.
>>
>> The received device state data is provided to
>> qemu_loadvm_load_state_buffer() function for processing in the
>> device's load_state_buffer handler.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   migration/multifd.c | 127 +++++++++++++++++++++++++++++++++++++-------
>>   migration/multifd.h |  31 ++++++++++-
>>   2 files changed, 138 insertions(+), 20 deletions(-)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index b06a9fab500e..d5a8e5a9c9b5 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
(..)
>>       g_free(p->zero);
>> @@ -1126,8 +1159,13 @@ static void *multifd_recv_thread(void *opaque)
>>       rcu_register_thread();
>>   
>>       while (true) {
>> +        MultiFDPacketHdr_t hdr;
>>           uint32_t flags = 0;
>> +        bool is_device_state = false;
>>           bool has_data = false;
>> +        uint8_t *pkt_buf;
>> +        size_t pkt_len;
>> +
>>           p->normal_num = 0;
>>   
>>           if (use_packets) {
>> @@ -1135,8 +1173,28 @@ static void *multifd_recv_thread(void *opaque)
>>                   break;
>>               }
>>   
>> -            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
>> -                                           p->packet_len, &local_err);
>> +            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
>> +                                           sizeof(hdr), &local_err);
>> +            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>> +                break;
>> +            }
>> +
>> +            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
>> +            if (ret) {
>> +                break;
>> +            }
>> +
>> +            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
>> +            if (is_device_state) {
>> +                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
>> +                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
>> +            } else {
>> +                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
>> +                pkt_len = p->packet_len - sizeof(hdr);
>> +            }
> 
> Should we have made the packet an union as well? Would simplify these
> sorts of operations. Not sure I want to start messing with that at this
> point to be honest. But OTOH, look at this...

RAM packet length is not constant (at least from the viewpoint of the
migration code) so the union allocation would need some kind of a
"multifd_ram_packet_size()" runtime size determination.

Also, since RAM and device state packet body size is different then
for the extra complexity introduced by that union we'll just get rid of
that single pkt_buf assignment.

>> +
>> +            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
>> +                                           &local_err);
>>               if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>>                   break;
>>               }
>> @@ -1181,8 +1239,33 @@ static void *multifd_recv_thread(void *opaque)
>>               has_data = !!p->data->size;
>>           }
>>   
>> -        if (has_data) {
>> -            ret = multifd_recv_state->ops->recv(p, &local_err);
>> +        if (!is_device_state) {
>> +            if (has_data) {
>> +                ret = multifd_recv_state->ops->recv(p, &local_err);
>> +                if (ret != 0) {
>> +                    break;
>> +                }
>> +            }
>> +        } else {
>> +            g_autofree char *idstr = NULL;
>> +            g_autofree char *dev_state_buf = NULL;
>> +
>> +            assert(use_packets);
>> +
>> +            if (p->next_packet_size > 0) {
>> +                dev_state_buf = g_malloc(p->next_packet_size);
>> +
>> +                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
>> +                if (ret != 0) {
>> +                    break;
>> +                }
>> +            }
> 
> What's the use case for !next_packet_size and still call
> load_state_buffer below? I can't see it.

Currently, next_packet_size == 0 has not usage indeed - it is
a leftover from an early version of the patch set (not public)
that had device state packet (chunk) indexing done by
the common migration code, rather than by the VFIO consumer.

And then an empty packet could be used to mark the stream
boundary - like the max chunk number to expect.

> ...because I would suggest to set has_data up there with
> p->next_packet_size:
> 
> if (use_packets) {
>     ...
>     has_data = p->next_packet_size || p->zero_num;
> } else {
>     ...
>     has_data = !!p->data_size;
> }
> 
> and this whole block would be:
> 
> if (has_data) {
>     if (is_device_state) {
>         multifd_device_state_recv(p, &local_err);
>     } else {
>         ret = multifd_recv_state->ops->recv(p, &local_err);
>     }
> }

The above block makes sense to me with two caveats:
1) If empty device state packets (next_packet_size == 0) were
to be unsupported they need to be rejected cleanly rather
than silently skipped,

2) has_data has to have its value computed depending on whether
this is a RAM or a device state packet since looking at
p->normal_num and p->zero_num makes no sense for a device state
packet while I am not sure that looking at p->next_packet_size
for a RAM packet won't introduce some subtle regression.

>> +
>> +            idstr = g_strndup(p->packet_dev_state->idstr, sizeof(p->packet_dev_state->idstr));
>> +            ret = qemu_loadvm_load_state_buffer(idstr,
>> +                                                p->packet_dev_state->instance_id,
>> +                                                dev_state_buf, p->next_packet_size,
>> +                                                &local_err);
>>               if (ret != 0) {
>>                   break;
>>               }
>> @@ -1190,6 +1273,11 @@ static void *multifd_recv_thread(void *opaque)
>>   
>>           if (use_packets) {
>>               if (flags & MULTIFD_FLAG_SYNC) {
>> +                if (is_device_state) {
>> +                    error_setg(&local_err, "multifd: received SYNC device state packet");
>> +                    break;
>> +                }
> 
> assert(!is_device_state) enough?

It's not bug in the receiver code but rather an issue with the
remote QEMU sending us wrong data if we get a SYNC device state
packet.

So I think returning an error is more appropriate than triggering
an assert() failure for that.

>> +
>>                   qemu_sem_post(&multifd_recv_state->sem_sync);
>>                   qemu_sem_wait(&p->sem_sync);
>>               }
>> @@ -1258,6 +1346,7 @@ int multifd_recv_setup(Error **errp)
>>               p->packet_len = sizeof(MultiFDPacket_t)
>>                   + sizeof(uint64_t) * page_count;
>>               p->packet = g_malloc0(p->packet_len);
>> +            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
>>           }
>>           p->name = g_strdup_printf("mig/dst/recv_%d", i);
>>           p->normal = g_new0(ram_addr_t, page_count);
>> diff --git a/migration/multifd.h b/migration/multifd.h
>> index a3e35196d179..a8f3e4838c01 100644
>> --- a/migration/multifd.h
>> +++ b/migration/multifd.h
>> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>   #define MULTIFD_FLAG_QPL (4 << 1)
>>   #define MULTIFD_FLAG_UADK (8 << 1)
>>   
>> +/*
>> + * If set it means that this packet contains device state
>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>> + */
>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
> 
> Overlaps with UADK. I assume on purpose because device_state doesn't
> support compression? Might be worth a comment.
> 

Yes, the device state transfer bit stream does not support compression
so it is not a problem since these "compression type" flags will never
be set in such bit stream anyway.

Will add a relevant comment here.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-08-27 17:54 ` [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support Maciej S. Szmigiero
@ 2024-09-02 22:07   ` Fabiano Rosas
  2024-09-03 12:02     ` Maciej S. Szmigiero
  2024-09-03 13:55   ` Stefan Hajnoczi
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-02 22:07 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Migration code wants to manage device data sending threads in one place.
>
> QEMU has an existing thread pool implementation, however it was limited
> to queuing AIO operations only and essentially had a 1:1 mapping between
> the current AioContext and the ThreadPool in use.
>
> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
> too.
>
> This brings a few new operations on a pool:
> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
> thread count in the pool.
>
> * thread_pool_join() operation waits until all the submitted work requests
> have finished.
>
> * thread_pool_poll() lets the new thread and / or thread completion bottom
> halves run (if they are indeed scheduled to be run).
> It is useful for thread pool users that need to launch or terminate new
> threads without returning to the QEMU main loop.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/block/thread-pool.h   | 10 ++++-
>  tests/unit/test-thread-pool.c |  2 +-
>  util/thread-pool.c            | 77 ++++++++++++++++++++++++++++++-----
>  3 files changed, 76 insertions(+), 13 deletions(-)
>
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index b484c4780ea6..1769496056cd 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -37,9 +37,15 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>                                     void *arg, GDestroyNotify arg_destroy,
>                                     BlockCompletionFunc *cb, void *opaque);
>  int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
> -void thread_pool_submit(ThreadPoolFunc *func,
> -                        void *arg, GDestroyNotify arg_destroy);
> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                               void *arg, GDestroyNotify arg_destroy,
> +                               BlockCompletionFunc *cb, void *opaque);

These kinds of changes (create wrappers, change signatures, etc), could
be in their own patch as it's just code motion that should not have
functional impact. The "no_requests" stuff would be better discussed in
a separate patch.

>  
> +void thread_pool_join(ThreadPool *pool);
> +void thread_pool_poll(ThreadPool *pool);
> +
> +void thread_pool_set_minmax_threads(ThreadPool *pool,
> +                                    int min_threads, int max_threads);
>  void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>  
>  #endif
> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
> index e4afb9e36292..469c0f7057b6 100644
> --- a/tests/unit/test-thread-pool.c
> +++ b/tests/unit/test-thread-pool.c
> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>  static void test_submit(void)
>  {
>      WorkerTestData data = { .n = 0 };
> -    thread_pool_submit(worker_cb, &data, NULL);
> +    thread_pool_submit(NULL, worker_cb, &data, NULL, NULL, NULL);
>      while (data.n == 0) {
>          aio_poll(ctx, true);
>      }
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 69a87ee79252..2bf3be875a51 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -60,6 +60,7 @@ struct ThreadPool {
>      QemuMutex lock;
>      QemuCond worker_stopped;
>      QemuCond request_cond;
> +    QemuCond no_requests_cond;
>      QEMUBH *new_thread_bh;
>  
>      /* The following variables are only accessed from one AioContext. */
> @@ -73,6 +74,7 @@ struct ThreadPool {
>      int pending_threads; /* threads created but not running yet */
>      int min_threads;
>      int max_threads;
> +    size_t requests_executing;

What's with size_t? Should this be a uint32_t instead?

>  };
>  
>  static void *worker_thread(void *opaque)
> @@ -107,6 +109,10 @@ static void *worker_thread(void *opaque)
>          req = QTAILQ_FIRST(&pool->request_list);
>          QTAILQ_REMOVE(&pool->request_list, req, reqs);
>          req->state = THREAD_ACTIVE;
> +
> +        assert(pool->requests_executing < SIZE_MAX);
> +        pool->requests_executing++;
> +
>          qemu_mutex_unlock(&pool->lock);
>  
>          ret = req->func(req->arg);
> @@ -118,6 +124,14 @@ static void *worker_thread(void *opaque)
>  
>          qemu_bh_schedule(pool->completion_bh);
>          qemu_mutex_lock(&pool->lock);
> +
> +        assert(pool->requests_executing > 0);
> +        pool->requests_executing--;
> +
> +        if (pool->requests_executing == 0 &&
> +            QTAILQ_EMPTY(&pool->request_list)) {
> +            qemu_cond_signal(&pool->no_requests_cond);
> +        }

An empty requests list and no request in flight means the worker will
now exit after the timeout, no? Can you just kick the worker out of the
wait and use pool->worker_stopped instead of the new condition variable?

>      }
>  
>      pool->cur_threads--;
> @@ -243,13 +257,16 @@ static const AIOCBInfo thread_pool_aiocb_info = {
>      .cancel_async       = thread_pool_cancel,
>  };
>  
> -BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
> -                                   void *arg, GDestroyNotify arg_destroy,
> -                                   BlockCompletionFunc *cb, void *opaque)
> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                               void *arg, GDestroyNotify arg_destroy,
> +                               BlockCompletionFunc *cb, void *opaque)
>  {
>      ThreadPoolElement *req;
>      AioContext *ctx = qemu_get_current_aio_context();
> -    ThreadPool *pool = aio_get_thread_pool(ctx);
> +
> +    if (!pool) {
> +        pool = aio_get_thread_pool(ctx);
> +    }

I'd go for a separate implementation to really drive the point that this
new usage is different. See the code snippet below.

It seems we're a short step away to being able to use this
implementation in a general way. Is there something that can be done
with the 'common' field in the ThreadPoolElement?

========
static void thread_pool_submit_request(ThreadPool *pool, ThreadPoolElement *req)
{
    req->state = THREAD_QUEUED;
    req->pool = pool;

    QLIST_INSERT_HEAD(&pool->head, req, all);

    trace_thread_pool_submit(pool, req, req->arg);

    qemu_mutex_lock(&pool->lock);
    if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
        spawn_thread(pool);
    }
    QTAILQ_INSERT_TAIL(&pool->request_list, req, reqs);
    qemu_mutex_unlock(&pool->lock);
    qemu_cond_signal(&pool->request_cond);
}

BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                   BlockCompletionFunc *cb, void *opaque)
{
    ThreadPoolElement *req;
    AioContext *ctx = qemu_get_current_aio_context();
    ThreadPool *pool = aio_get_thread_pool(ctx);

    /* Assert that the thread submitting work is the same running the pool */
    assert(pool->ctx == qemu_get_current_aio_context());

    req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
    req->func = func;
    req->arg = arg;

    thread_pool_submit_request(pool, req);
    return &req->common;
}

void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
{
    ThreadPoolElement *req;

    req = g_malloc(sizeof(ThreadPoolElement));
    req->func = func;
    req->arg = arg;

    thread_pool_submit_request(pool, req);
}
=================

>  
>      /* Assert that the thread submitting work is the same running the pool */
>      assert(pool->ctx == qemu_get_current_aio_context());
> @@ -275,6 +292,18 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>      return &req->common;
>  }
>  
> +BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
> +                                   void *arg, GDestroyNotify arg_destroy,
> +                                   BlockCompletionFunc *cb, void *opaque)
> +{
> +    return thread_pool_submit(NULL, func, arg, arg_destroy, cb, opaque);
> +}
> +
> +void thread_pool_poll(ThreadPool *pool)
> +{
> +    aio_bh_poll(pool->ctx);
> +}
> +
>  typedef struct ThreadPoolCo {
>      Coroutine *co;
>      int ret;
> @@ -297,18 +326,38 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>      return tpc.ret;
>  }
>  
> -void thread_pool_submit(ThreadPoolFunc *func,
> -                        void *arg, GDestroyNotify arg_destroy)
> +void thread_pool_join(ThreadPool *pool)

This is misleading because it's about the requests, not the threads in
the pool. Compare with what thread_pool_free does:

    /* Wait for worker threads to terminate */
    pool->max_threads = 0;
    qemu_cond_broadcast(&pool->request_cond);
    while (pool->cur_threads > 0) {
        qemu_cond_wait(&pool->worker_stopped, &pool->lock);
    }

>  {
> -    thread_pool_submit_aio(func, arg, arg_destroy, NULL, NULL);
> +    /* Assert that the thread waiting is the same running the pool */
> +    assert(pool->ctx == qemu_get_current_aio_context());
> +
> +    qemu_mutex_lock(&pool->lock);
> +
> +    if (pool->requests_executing > 0 ||
> +        !QTAILQ_EMPTY(&pool->request_list)) {
> +        qemu_cond_wait(&pool->no_requests_cond, &pool->lock);
> +    }
> +    assert(pool->requests_executing == 0 &&
> +           QTAILQ_EMPTY(&pool->request_list));
> +
> +    qemu_mutex_unlock(&pool->lock);
> +
> +    aio_bh_poll(pool->ctx);
> +
> +    assert(QLIST_EMPTY(&pool->head));
>  }
>  
> -void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
> +void thread_pool_set_minmax_threads(ThreadPool *pool,
> +                                    int min_threads, int max_threads)
>  {
> +    assert(min_threads >= 0);
> +    assert(max_threads > 0);
> +    assert(max_threads >= min_threads);
> +
>      qemu_mutex_lock(&pool->lock);
>  
> -    pool->min_threads = ctx->thread_pool_min;
> -    pool->max_threads = ctx->thread_pool_max;
> +    pool->min_threads = min_threads;
> +    pool->max_threads = max_threads;
>  
>      /*
>       * We either have to:
> @@ -330,6 +379,12 @@ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
>      qemu_mutex_unlock(&pool->lock);
>  }
>  
> +void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
> +{
> +    thread_pool_set_minmax_threads(pool,
> +                                   ctx->thread_pool_min, ctx->thread_pool_max);
> +}
> +
>  static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
>  {
>      if (!ctx) {
> @@ -342,6 +397,7 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
>      qemu_mutex_init(&pool->lock);
>      qemu_cond_init(&pool->worker_stopped);
>      qemu_cond_init(&pool->request_cond);
> +    qemu_cond_init(&pool->no_requests_cond);
>      pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
>  
>      QLIST_INIT(&pool->head);
> @@ -382,6 +438,7 @@ void thread_pool_free(ThreadPool *pool)
>      qemu_mutex_unlock(&pool->lock);
>  
>      qemu_bh_delete(pool->completion_bh);
> +    qemu_cond_destroy(&pool->no_requests_cond);
>      qemu_cond_destroy(&pool->request_cond);
>      qemu_cond_destroy(&pool->worker_stopped);
>      qemu_mutex_destroy(&pool->lock);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-02 22:07   ` Fabiano Rosas
@ 2024-09-03 12:02     ` Maciej S. Szmigiero
  2024-09-03 14:26       ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-03 12:02 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

On 3.09.2024 00:07, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Migration code wants to manage device data sending threads in one place.
>>
>> QEMU has an existing thread pool implementation, however it was limited
>> to queuing AIO operations only and essentially had a 1:1 mapping between
>> the current AioContext and the ThreadPool in use.
>>
>> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
>> too.
>>
>> This brings a few new operations on a pool:
>> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
>> thread count in the pool.
>>
>> * thread_pool_join() operation waits until all the submitted work requests
>> have finished.
>>
>> * thread_pool_poll() lets the new thread and / or thread completion bottom
>> halves run (if they are indeed scheduled to be run).
>> It is useful for thread pool users that need to launch or terminate new
>> threads without returning to the QEMU main loop.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/block/thread-pool.h   | 10 ++++-
>>   tests/unit/test-thread-pool.c |  2 +-
>>   util/thread-pool.c            | 77 ++++++++++++++++++++++++++++++-----
>>   3 files changed, 76 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>> index b484c4780ea6..1769496056cd 100644
>> --- a/include/block/thread-pool.h
>> +++ b/include/block/thread-pool.h
>> @@ -37,9 +37,15 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>                                      void *arg, GDestroyNotify arg_destroy,
>>                                      BlockCompletionFunc *cb, void *opaque);
>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>> -void thread_pool_submit(ThreadPoolFunc *func,
>> -                        void *arg, GDestroyNotify arg_destroy);
>> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>> +                               void *arg, GDestroyNotify arg_destroy,
>> +                               BlockCompletionFunc *cb, void *opaque);
> 
> These kinds of changes (create wrappers, change signatures, etc), could
> be in their own patch as it's just code motion that should not have
> functional impact. The "no_requests" stuff would be better discussed in
> a separate patch.

These changes *all* should have no functional impact on existing callers.

But I get your overall point, will try to separate these really trivial
parts.

>>   
>> +void thread_pool_join(ThreadPool *pool);
>> +void thread_pool_poll(ThreadPool *pool);
>> +
>> +void thread_pool_set_minmax_threads(ThreadPool *pool,
>> +                                    int min_threads, int max_threads);
>>   void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>>   
>>   #endif
>> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
>> index e4afb9e36292..469c0f7057b6 100644
>> --- a/tests/unit/test-thread-pool.c
>> +++ b/tests/unit/test-thread-pool.c
>> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>>   static void test_submit(void)
>>   {
>>       WorkerTestData data = { .n = 0 };
>> -    thread_pool_submit(worker_cb, &data, NULL);
>> +    thread_pool_submit(NULL, worker_cb, &data, NULL, NULL, NULL);
>>       while (data.n == 0) {
>>           aio_poll(ctx, true);
>>       }
>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>> index 69a87ee79252..2bf3be875a51 100644
>> --- a/util/thread-pool.c
>> +++ b/util/thread-pool.c
>> @@ -60,6 +60,7 @@ struct ThreadPool {
>>       QemuMutex lock;
>>       QemuCond worker_stopped;
>>       QemuCond request_cond;
>> +    QemuCond no_requests_cond;
>>       QEMUBH *new_thread_bh;
>>   
>>       /* The following variables are only accessed from one AioContext. */
>> @@ -73,6 +74,7 @@ struct ThreadPool {
>>       int pending_threads; /* threads created but not running yet */
>>       int min_threads;
>>       int max_threads;
>> +    size_t requests_executing;
> 
> What's with size_t? Should this be a uint32_t instead?

Sizes of objects are normally size_t, since otherwise bad
things happen if objects are bigger than 4 GiB.

Considering that the minimum object size is 1 byte the
max count of distinct objects also needs a size_t to not
risk an overflow.

I think that while 2^32 requests executing seems unlikely
saving 4 bytes seems not worth worrying that someone will
find a vulnerability triggered by overflowing a 32-bit
variable (not necessary in the migration code but in some
other thread pool user).

>>   };
>>   
>>   static void *worker_thread(void *opaque)
>> @@ -107,6 +109,10 @@ static void *worker_thread(void *opaque)
>>           req = QTAILQ_FIRST(&pool->request_list);
>>           QTAILQ_REMOVE(&pool->request_list, req, reqs);
>>           req->state = THREAD_ACTIVE;
>> +
>> +        assert(pool->requests_executing < SIZE_MAX);
>> +        pool->requests_executing++;
>> +
>>           qemu_mutex_unlock(&pool->lock);
>>   
>>           ret = req->func(req->arg);
>> @@ -118,6 +124,14 @@ static void *worker_thread(void *opaque)
>>   
>>           qemu_bh_schedule(pool->completion_bh);
>>           qemu_mutex_lock(&pool->lock);
>> +
>> +        assert(pool->requests_executing > 0);
>> +        pool->requests_executing--;
>> +
>> +        if (pool->requests_executing == 0 &&
>> +            QTAILQ_EMPTY(&pool->request_list)) {
>> +            qemu_cond_signal(&pool->no_requests_cond);
>> +        }
> 
> An empty requests list and no request in flight means the worker will
> now exit after the timeout, no? Can you just kick the worker out of the
> wait and use pool->worker_stopped instead of the new condition variable?

First, all threads won't terminate if either min_threads or max_threads
isn't 0.
It might be in the migration thread pool case but we are adding a
generic thread pool so it should be as universal as possible.
thread_pool_free() can get away with overwriting these values since
it is destroying the pool anyway.

Also, the *_join() (or whatever its final name will be) operation is
about waiting for all requests / work items to finish, not about waiting
for threads to terminate.
It's essentially a synchronization point for a thread pool, not a cleanup.

>>       }
>>   
>>       pool->cur_threads--;
>> @@ -243,13 +257,16 @@ static const AIOCBInfo thread_pool_aiocb_info = {
>>       .cancel_async       = thread_pool_cancel,
>>   };
>>   
>> -BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>> -                                   void *arg, GDestroyNotify arg_destroy,
>> -                                   BlockCompletionFunc *cb, void *opaque)
>> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>> +                               void *arg, GDestroyNotify arg_destroy,
>> +                               BlockCompletionFunc *cb, void *opaque)
>>   {
>>       ThreadPoolElement *req;
>>       AioContext *ctx = qemu_get_current_aio_context();
>> -    ThreadPool *pool = aio_get_thread_pool(ctx);
>> +
>> +    if (!pool) {
>> +        pool = aio_get_thread_pool(ctx);
>> +    }
> 
> I'd go for a separate implementation to really drive the point that this
> new usage is different. See the code snippet below.

I see your point there - will split these implementations.

> It seems we're a short step away to being able to use this
> implementation in a general way. Is there something that can be done
> with the 'common' field in the ThreadPoolElement?

The non-AIO request flow still need the completion callback from BlockAIOCB
(and its argument pointer) so removing the "common" field from these requests
would need introducing two "flavors" of ThreadPoolElement.

Not sure memory saving here are worth the increase in code complexity.

> ========
> static void thread_pool_submit_request(ThreadPool *pool, ThreadPoolElement *req)
> {
>      req->state = THREAD_QUEUED;
>      req->pool = pool;
> 
>      QLIST_INSERT_HEAD(&pool->head, req, all);
> 
>      trace_thread_pool_submit(pool, req, req->arg);
> 
>      qemu_mutex_lock(&pool->lock);
>      if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
>          spawn_thread(pool);
>      }
>      QTAILQ_INSERT_TAIL(&pool->request_list, req, reqs);
>      qemu_mutex_unlock(&pool->lock);
>      qemu_cond_signal(&pool->request_cond);
> }
> 
> BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>                                     BlockCompletionFunc *cb, void *opaque)
> {
>      ThreadPoolElement *req;
>      AioContext *ctx = qemu_get_current_aio_context();
>      ThreadPool *pool = aio_get_thread_pool(ctx);
> 
>      /* Assert that the thread submitting work is the same running the pool */
>      assert(pool->ctx == qemu_get_current_aio_context());
> 
>      req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
>      req->func = func;
>      req->arg = arg;
> 
>      thread_pool_submit_request(pool, req);
>      return &req->common;
> }
> 
> void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
> {
>      ThreadPoolElement *req;
> 
>      req = g_malloc(sizeof(ThreadPoolElement));
>      req->func = func;
>      req->arg = arg;
> 
>      thread_pool_submit_request(pool, req);
> }
> =================
> 
>>   
>>       /* Assert that the thread submitting work is the same running the pool */
>>       assert(pool->ctx == qemu_get_current_aio_context());
>> @@ -275,6 +292,18 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>       return &req->common;
>>   }
>>   
>> +BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>> +                                   void *arg, GDestroyNotify arg_destroy,
>> +                                   BlockCompletionFunc *cb, void *opaque)
>> +{
>> +    return thread_pool_submit(NULL, func, arg, arg_destroy, cb, opaque);
>> +}
>> +
>> +void thread_pool_poll(ThreadPool *pool)
>> +{
>> +    aio_bh_poll(pool->ctx);
>> +}
>> +
>>   typedef struct ThreadPoolCo {
>>       Coroutine *co;
>>       int ret;
>> @@ -297,18 +326,38 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>>       return tpc.ret;
>>   }
>>   
>> -void thread_pool_submit(ThreadPoolFunc *func,
>> -                        void *arg, GDestroyNotify arg_destroy)
>> +void thread_pool_join(ThreadPool *pool)
> 
> This is misleading because it's about the requests, not the threads in
> the pool. Compare with what thread_pool_free does:
> 
>      /* Wait for worker threads to terminate */
>      pool->max_threads = 0;
>      qemu_cond_broadcast(&pool->request_cond);
>      while (pool->cur_threads > 0) {
>          qemu_cond_wait(&pool->worker_stopped, &pool->lock);
>      }
> 

I'm open to thread_pool_join() better naming proposals.

Thanks,
Maciej




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-08-27 17:54 ` [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support Maciej S. Szmigiero
  2024-09-02 22:07   ` Fabiano Rosas
@ 2024-09-03 13:55   ` Stefan Hajnoczi
  2024-09-03 16:54     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Stefan Hajnoczi @ 2024-09-03 13:55 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, 27 Aug 2024 at 13:58, Maciej S. Szmigiero
<mail@maciej.szmigiero.name> wrote:
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Migration code wants to manage device data sending threads in one place.
>
> QEMU has an existing thread pool implementation, however it was limited
> to queuing AIO operations only and essentially had a 1:1 mapping between
> the current AioContext and the ThreadPool in use.
>
> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
> too.
>
> This brings a few new operations on a pool:
> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
> thread count in the pool.
>
> * thread_pool_join() operation waits until all the submitted work requests
> have finished.
>
> * thread_pool_poll() lets the new thread and / or thread completion bottom
> halves run (if they are indeed scheduled to be run).
> It is useful for thread pool users that need to launch or terminate new
> threads without returning to the QEMU main loop.

Did you consider glib's GThreadPool?
https://docs.gtk.org/glib/struct.ThreadPool.html

QEMU's thread pool is integrated into the QEMU event loop. If your
goal is to bypass the QEMU event loop, then you may as well use the
glib API instead.

thread_pool_join() and thread_pool_poll() will lead to code that
blocks the event loop. QEMU's aio_poll() and nested event loops in
general are a source of hangs and re-entrancy bugs. I would prefer not
introducing these issues in the QEMU ThreadPool API.


>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/block/thread-pool.h   | 10 ++++-
>  tests/unit/test-thread-pool.c |  2 +-
>  util/thread-pool.c            | 77 ++++++++++++++++++++++++++++++-----
>  3 files changed, 76 insertions(+), 13 deletions(-)
>
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index b484c4780ea6..1769496056cd 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -37,9 +37,15 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>                                     void *arg, GDestroyNotify arg_destroy,
>                                     BlockCompletionFunc *cb, void *opaque);
>  int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
> -void thread_pool_submit(ThreadPoolFunc *func,
> -                        void *arg, GDestroyNotify arg_destroy);
> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                               void *arg, GDestroyNotify arg_destroy,
> +                               BlockCompletionFunc *cb, void *opaque);
>
> +void thread_pool_join(ThreadPool *pool);
> +void thread_pool_poll(ThreadPool *pool);
> +
> +void thread_pool_set_minmax_threads(ThreadPool *pool,
> +                                    int min_threads, int max_threads);
>  void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>
>  #endif
> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
> index e4afb9e36292..469c0f7057b6 100644
> --- a/tests/unit/test-thread-pool.c
> +++ b/tests/unit/test-thread-pool.c
> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>  static void test_submit(void)
>  {
>      WorkerTestData data = { .n = 0 };
> -    thread_pool_submit(worker_cb, &data, NULL);
> +    thread_pool_submit(NULL, worker_cb, &data, NULL, NULL, NULL);
>      while (data.n == 0) {
>          aio_poll(ctx, true);
>      }
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 69a87ee79252..2bf3be875a51 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -60,6 +60,7 @@ struct ThreadPool {
>      QemuMutex lock;
>      QemuCond worker_stopped;
>      QemuCond request_cond;
> +    QemuCond no_requests_cond;
>      QEMUBH *new_thread_bh;
>
>      /* The following variables are only accessed from one AioContext. */
> @@ -73,6 +74,7 @@ struct ThreadPool {
>      int pending_threads; /* threads created but not running yet */
>      int min_threads;
>      int max_threads;
> +    size_t requests_executing;
>  };
>
>  static void *worker_thread(void *opaque)
> @@ -107,6 +109,10 @@ static void *worker_thread(void *opaque)
>          req = QTAILQ_FIRST(&pool->request_list);
>          QTAILQ_REMOVE(&pool->request_list, req, reqs);
>          req->state = THREAD_ACTIVE;
> +
> +        assert(pool->requests_executing < SIZE_MAX);
> +        pool->requests_executing++;
> +
>          qemu_mutex_unlock(&pool->lock);
>
>          ret = req->func(req->arg);
> @@ -118,6 +124,14 @@ static void *worker_thread(void *opaque)
>
>          qemu_bh_schedule(pool->completion_bh);
>          qemu_mutex_lock(&pool->lock);
> +
> +        assert(pool->requests_executing > 0);
> +        pool->requests_executing--;
> +
> +        if (pool->requests_executing == 0 &&
> +            QTAILQ_EMPTY(&pool->request_list)) {
> +            qemu_cond_signal(&pool->no_requests_cond);
> +        }
>      }
>
>      pool->cur_threads--;
> @@ -243,13 +257,16 @@ static const AIOCBInfo thread_pool_aiocb_info = {
>      .cancel_async       = thread_pool_cancel,
>  };
>
> -BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
> -                                   void *arg, GDestroyNotify arg_destroy,
> -                                   BlockCompletionFunc *cb, void *opaque)
> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                               void *arg, GDestroyNotify arg_destroy,
> +                               BlockCompletionFunc *cb, void *opaque)
>  {
>      ThreadPoolElement *req;
>      AioContext *ctx = qemu_get_current_aio_context();
> -    ThreadPool *pool = aio_get_thread_pool(ctx);
> +
> +    if (!pool) {
> +        pool = aio_get_thread_pool(ctx);
> +    }
>
>      /* Assert that the thread submitting work is the same running the pool */
>      assert(pool->ctx == qemu_get_current_aio_context());
> @@ -275,6 +292,18 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>      return &req->common;
>  }
>
> +BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
> +                                   void *arg, GDestroyNotify arg_destroy,
> +                                   BlockCompletionFunc *cb, void *opaque)
> +{
> +    return thread_pool_submit(NULL, func, arg, arg_destroy, cb, opaque);
> +}
> +
> +void thread_pool_poll(ThreadPool *pool)
> +{
> +    aio_bh_poll(pool->ctx);
> +}
> +
>  typedef struct ThreadPoolCo {
>      Coroutine *co;
>      int ret;
> @@ -297,18 +326,38 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>      return tpc.ret;
>  }
>
> -void thread_pool_submit(ThreadPoolFunc *func,
> -                        void *arg, GDestroyNotify arg_destroy)
> +void thread_pool_join(ThreadPool *pool)
>  {
> -    thread_pool_submit_aio(func, arg, arg_destroy, NULL, NULL);
> +    /* Assert that the thread waiting is the same running the pool */
> +    assert(pool->ctx == qemu_get_current_aio_context());
> +
> +    qemu_mutex_lock(&pool->lock);
> +
> +    if (pool->requests_executing > 0 ||
> +        !QTAILQ_EMPTY(&pool->request_list)) {
> +        qemu_cond_wait(&pool->no_requests_cond, &pool->lock);
> +    }
> +    assert(pool->requests_executing == 0 &&
> +           QTAILQ_EMPTY(&pool->request_list));
> +
> +    qemu_mutex_unlock(&pool->lock);
> +
> +    aio_bh_poll(pool->ctx);
> +
> +    assert(QLIST_EMPTY(&pool->head));
>  }
>
> -void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
> +void thread_pool_set_minmax_threads(ThreadPool *pool,
> +                                    int min_threads, int max_threads)
>  {
> +    assert(min_threads >= 0);
> +    assert(max_threads > 0);
> +    assert(max_threads >= min_threads);
> +
>      qemu_mutex_lock(&pool->lock);
>
> -    pool->min_threads = ctx->thread_pool_min;
> -    pool->max_threads = ctx->thread_pool_max;
> +    pool->min_threads = min_threads;
> +    pool->max_threads = max_threads;
>
>      /*
>       * We either have to:
> @@ -330,6 +379,12 @@ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
>      qemu_mutex_unlock(&pool->lock);
>  }
>
> +void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
> +{
> +    thread_pool_set_minmax_threads(pool,
> +                                   ctx->thread_pool_min, ctx->thread_pool_max);
> +}
> +
>  static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
>  {
>      if (!ctx) {
> @@ -342,6 +397,7 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
>      qemu_mutex_init(&pool->lock);
>      qemu_cond_init(&pool->worker_stopped);
>      qemu_cond_init(&pool->request_cond);
> +    qemu_cond_init(&pool->no_requests_cond);
>      pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
>
>      QLIST_INIT(&pool->head);
> @@ -382,6 +438,7 @@ void thread_pool_free(ThreadPool *pool)
>      qemu_mutex_unlock(&pool->lock);
>
>      qemu_bh_delete(pool->completion_bh);
> +    qemu_cond_destroy(&pool->no_requests_cond);
>      qemu_cond_destroy(&pool->request_cond);
>      qemu_cond_destroy(&pool->worker_stopped);
>      qemu_mutex_destroy(&pool->lock);
>


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-03 12:02     ` Maciej S. Szmigiero
@ 2024-09-03 14:26       ` Fabiano Rosas
  2024-09-03 18:14         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-03 14:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 3.09.2024 00:07, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Migration code wants to manage device data sending threads in one place.
>>>
>>> QEMU has an existing thread pool implementation, however it was limited
>>> to queuing AIO operations only and essentially had a 1:1 mapping between
>>> the current AioContext and the ThreadPool in use.
>>>
>>> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
>>> too.
>>>
>>> This brings a few new operations on a pool:
>>> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
>>> thread count in the pool.
>>>
>>> * thread_pool_join() operation waits until all the submitted work requests
>>> have finished.
>>>
>>> * thread_pool_poll() lets the new thread and / or thread completion bottom
>>> halves run (if they are indeed scheduled to be run).
>>> It is useful for thread pool users that need to launch or terminate new
>>> threads without returning to the QEMU main loop.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   include/block/thread-pool.h   | 10 ++++-
>>>   tests/unit/test-thread-pool.c |  2 +-
>>>   util/thread-pool.c            | 77 ++++++++++++++++++++++++++++++-----
>>>   3 files changed, 76 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>>> index b484c4780ea6..1769496056cd 100644
>>> --- a/include/block/thread-pool.h
>>> +++ b/include/block/thread-pool.h
>>> @@ -37,9 +37,15 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>>                                      void *arg, GDestroyNotify arg_destroy,
>>>                                      BlockCompletionFunc *cb, void *opaque);
>>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>> -void thread_pool_submit(ThreadPoolFunc *func,
>>> -                        void *arg, GDestroyNotify arg_destroy);
>>> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>> +                               void *arg, GDestroyNotify arg_destroy,
>>> +                               BlockCompletionFunc *cb, void *opaque);
>> 
>> These kinds of changes (create wrappers, change signatures, etc), could
>> be in their own patch as it's just code motion that should not have
>> functional impact. The "no_requests" stuff would be better discussed in
>> a separate patch.
>
> These changes *all* should have no functional impact on existing callers.
>
> But I get your overall point, will try to separate these really trivial
> parts.

Yeah, I guess I meant that one set of changes has a larger potential for
introducing a bug while the other is clearly harmless.

>
>>>   
>>> +void thread_pool_join(ThreadPool *pool);
>>> +void thread_pool_poll(ThreadPool *pool);
>>> +
>>> +void thread_pool_set_minmax_threads(ThreadPool *pool,
>>> +                                    int min_threads, int max_threads);
>>>   void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>>>   
>>>   #endif
>>> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
>>> index e4afb9e36292..469c0f7057b6 100644
>>> --- a/tests/unit/test-thread-pool.c
>>> +++ b/tests/unit/test-thread-pool.c
>>> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>>>   static void test_submit(void)
>>>   {
>>>       WorkerTestData data = { .n = 0 };
>>> -    thread_pool_submit(worker_cb, &data, NULL);
>>> +    thread_pool_submit(NULL, worker_cb, &data, NULL, NULL, NULL);
>>>       while (data.n == 0) {
>>>           aio_poll(ctx, true);
>>>       }
>>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>>> index 69a87ee79252..2bf3be875a51 100644
>>> --- a/util/thread-pool.c
>>> +++ b/util/thread-pool.c
>>> @@ -60,6 +60,7 @@ struct ThreadPool {
>>>       QemuMutex lock;
>>>       QemuCond worker_stopped;
>>>       QemuCond request_cond;
>>> +    QemuCond no_requests_cond;
>>>       QEMUBH *new_thread_bh;
>>>   
>>>       /* The following variables are only accessed from one AioContext. */
>>> @@ -73,6 +74,7 @@ struct ThreadPool {
>>>       int pending_threads; /* threads created but not running yet */
>>>       int min_threads;
>>>       int max_threads;
>>> +    size_t requests_executing;
>> 
>> What's with size_t? Should this be a uint32_t instead?
>
> Sizes of objects are normally size_t, since otherwise bad
> things happen if objects are bigger than 4 GiB.

Ok, but requests_executing is not the size of an object. It's the number
of objects in a linked list that satisfy a certain predicate. There are
no address space size considerations here.

>
> Considering that the minimum object size is 1 byte the
> max count of distinct objects also needs a size_t to not
> risk an overflow.

I'm not sure I get you, there's no overflow since you're bounds checking
with the assert. Or is this a more abstract line of thought about how
many ThreadPoolElements can be present in memory at a time and you'd
like a type that's certain to fit the theoretical amount of objects?

>
> I think that while 2^32 requests executing seems unlikely
> saving 4 bytes seems not worth worrying that someone will
> find a vulnerability triggered by overflowing a 32-bit
> variable (not necessary in the migration code but in some
> other thread pool user).
>
>>>   };
>>>   
>>>   static void *worker_thread(void *opaque)
>>> @@ -107,6 +109,10 @@ static void *worker_thread(void *opaque)
>>>           req = QTAILQ_FIRST(&pool->request_list);
>>>           QTAILQ_REMOVE(&pool->request_list, req, reqs);
>>>           req->state = THREAD_ACTIVE;
>>> +
>>> +        assert(pool->requests_executing < SIZE_MAX);
>>> +        pool->requests_executing++;
>>> +
>>>           qemu_mutex_unlock(&pool->lock);
>>>   
>>>           ret = req->func(req->arg);
>>> @@ -118,6 +124,14 @@ static void *worker_thread(void *opaque)
>>>   
>>>           qemu_bh_schedule(pool->completion_bh);
>>>           qemu_mutex_lock(&pool->lock);
>>> +
>>> +        assert(pool->requests_executing > 0);
>>> +        pool->requests_executing--;
>>> +
>>> +        if (pool->requests_executing == 0 &&
>>> +            QTAILQ_EMPTY(&pool->request_list)) {
>>> +            qemu_cond_signal(&pool->no_requests_cond);
>>> +        }
>> 
>> An empty requests list and no request in flight means the worker will
>> now exit after the timeout, no? Can you just kick the worker out of the
>> wait and use pool->worker_stopped instead of the new condition variable?
>
> First, all threads won't terminate if either min_threads or max_threads
> isn't 0.

Ah I overlooked the break condition, nevermind.

> It might be in the migration thread pool case but we are adding a
> generic thread pool so it should be as universal as possible.
> thread_pool_free() can get away with overwriting these values since
> it is destroying the pool anyway.
>
> Also, the *_join() (or whatever its final name will be) operation is
> about waiting for all requests / work items to finish, not about waiting
> for threads to terminate.

Right, but the idea was to piggyback on the thread termination to infer
(the obvious) requests service termination. We cannot do that, as you've
explained, fine.

> It's essentially a synchronization point for a thread pool, not a cleanup.
>
>>>       }
>>>   
>>>       pool->cur_threads--;
>>> @@ -243,13 +257,16 @@ static const AIOCBInfo thread_pool_aiocb_info = {
>>>       .cancel_async       = thread_pool_cancel,
>>>   };
>>>   
>>> -BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>> -                                   void *arg, GDestroyNotify arg_destroy,
>>> -                                   BlockCompletionFunc *cb, void *opaque)
>>> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>> +                               void *arg, GDestroyNotify arg_destroy,
>>> +                               BlockCompletionFunc *cb, void *opaque)
>>>   {
>>>       ThreadPoolElement *req;
>>>       AioContext *ctx = qemu_get_current_aio_context();
>>> -    ThreadPool *pool = aio_get_thread_pool(ctx);
>>> +
>>> +    if (!pool) {
>>> +        pool = aio_get_thread_pool(ctx);
>>> +    }
>> 
>> I'd go for a separate implementation to really drive the point that this
>> new usage is different. See the code snippet below.
>
> I see your point there - will split these implementations.
>
>> It seems we're a short step away to being able to use this
>> implementation in a general way. Is there something that can be done
>> with the 'common' field in the ThreadPoolElement?
>
> The non-AIO request flow still need the completion callback from BlockAIOCB
> (and its argument pointer) so removing the "common" field from these requests
> would need introducing two "flavors" of ThreadPoolElement.
>
> Not sure memory saving here are worth the increase in code complexity.

I'm not asking that of you, but I think it should be done
eventually. The QEMU block layer is very particular and I wouldn't want
the use-cases for the thread-pool to get confused. But I can't see a way
out right now, so let's postpone this, see if anyone else has comments.

>
>> ========
>> static void thread_pool_submit_request(ThreadPool *pool, ThreadPoolElement *req)
>> {
>>      req->state = THREAD_QUEUED;
>>      req->pool = pool;
>> 
>>      QLIST_INSERT_HEAD(&pool->head, req, all);
>> 
>>      trace_thread_pool_submit(pool, req, req->arg);
>> 
>>      qemu_mutex_lock(&pool->lock);
>>      if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
>>          spawn_thread(pool);
>>      }
>>      QTAILQ_INSERT_TAIL(&pool->request_list, req, reqs);
>>      qemu_mutex_unlock(&pool->lock);
>>      qemu_cond_signal(&pool->request_cond);
>> }
>> 
>> BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>                                     BlockCompletionFunc *cb, void *opaque)
>> {
>>      ThreadPoolElement *req;
>>      AioContext *ctx = qemu_get_current_aio_context();
>>      ThreadPool *pool = aio_get_thread_pool(ctx);
>> 
>>      /* Assert that the thread submitting work is the same running the pool */
>>      assert(pool->ctx == qemu_get_current_aio_context());
>> 
>>      req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
>>      req->func = func;
>>      req->arg = arg;
>> 
>>      thread_pool_submit_request(pool, req);
>>      return &req->common;
>> }
>> 
>> void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
>> {
>>      ThreadPoolElement *req;
>> 
>>      req = g_malloc(sizeof(ThreadPoolElement));
>>      req->func = func;
>>      req->arg = arg;
>> 
>>      thread_pool_submit_request(pool, req);
>> }
>> =================
>> 
>>>   
>>>       /* Assert that the thread submitting work is the same running the pool */
>>>       assert(pool->ctx == qemu_get_current_aio_context());
>>> @@ -275,6 +292,18 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>>       return &req->common;
>>>   }
>>>   
>>> +BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>> +                                   void *arg, GDestroyNotify arg_destroy,
>>> +                                   BlockCompletionFunc *cb, void *opaque)
>>> +{
>>> +    return thread_pool_submit(NULL, func, arg, arg_destroy, cb, opaque);
>>> +}
>>> +
>>> +void thread_pool_poll(ThreadPool *pool)
>>> +{
>>> +    aio_bh_poll(pool->ctx);
>>> +}
>>> +
>>>   typedef struct ThreadPoolCo {
>>>       Coroutine *co;
>>>       int ret;
>>> @@ -297,18 +326,38 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>>>       return tpc.ret;
>>>   }
>>>   
>>> -void thread_pool_submit(ThreadPoolFunc *func,
>>> -                        void *arg, GDestroyNotify arg_destroy)
>>> +void thread_pool_join(ThreadPool *pool)
>> 
>> This is misleading because it's about the requests, not the threads in
>> the pool. Compare with what thread_pool_free does:
>> 
>>      /* Wait for worker threads to terminate */
>>      pool->max_threads = 0;
>>      qemu_cond_broadcast(&pool->request_cond);
>>      while (pool->cur_threads > 0) {
>>          qemu_cond_wait(&pool->worker_stopped, &pool->lock);
>>      }
>> 
>
> I'm open to thread_pool_join() better naming proposals.

thread_pool_wait() might be better.

>
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-02 20:12     ` Maciej S. Szmigiero
@ 2024-09-03 14:42       ` Fabiano Rosas
  2024-09-03 18:41         ` Maciej S. Szmigiero
  2024-09-09 19:52       ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-03 14:42 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Peter Xu

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 30.08.2024 22:22, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Add a basic support for receiving device state via multifd channels -
>>> channels that are shared with RAM transfers.
>>>
>>> To differentiate between a device state and a RAM packet the packet
>>> header is read first.
>>>
>>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>>> data (existing MultiFDPacket_t) is then read.
>>>
>>> The received device state data is provided to
>>> qemu_loadvm_load_state_buffer() function for processing in the
>>> device's load_state_buffer handler.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   migration/multifd.c | 127 +++++++++++++++++++++++++++++++++++++-------
>>>   migration/multifd.h |  31 ++++++++++-
>>>   2 files changed, 138 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index b06a9fab500e..d5a8e5a9c9b5 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
> (..)
>>>       g_free(p->zero);
>>> @@ -1126,8 +1159,13 @@ static void *multifd_recv_thread(void *opaque)
>>>       rcu_register_thread();
>>>   
>>>       while (true) {
>>> +        MultiFDPacketHdr_t hdr;
>>>           uint32_t flags = 0;
>>> +        bool is_device_state = false;
>>>           bool has_data = false;
>>> +        uint8_t *pkt_buf;
>>> +        size_t pkt_len;
>>> +
>>>           p->normal_num = 0;
>>>   
>>>           if (use_packets) {
>>> @@ -1135,8 +1173,28 @@ static void *multifd_recv_thread(void *opaque)
>>>                   break;
>>>               }
>>>   
>>> -            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
>>> -                                           p->packet_len, &local_err);
>>> +            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
>>> +                                           sizeof(hdr), &local_err);
>>> +            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>>> +                break;
>>> +            }
>>> +
>>> +            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
>>> +            if (ret) {
>>> +                break;
>>> +            }
>>> +
>>> +            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
>>> +            if (is_device_state) {
>>> +                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
>>> +                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
>>> +            } else {
>>> +                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
>>> +                pkt_len = p->packet_len - sizeof(hdr);
>>> +            }
>> 
>> Should we have made the packet an union as well? Would simplify these
>> sorts of operations. Not sure I want to start messing with that at this
>> point to be honest. But OTOH, look at this...
>
> RAM packet length is not constant (at least from the viewpoint of the
> migration code) so the union allocation would need some kind of a
> "multifd_ram_packet_size()" runtime size determination.
>
> Also, since RAM and device state packet body size is different then
> for the extra complexity introduced by that union we'll just get rid of
> that single pkt_buf assignment.
>
>>> +
>>> +            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
>>> +                                           &local_err);
>>>               if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>>>                   break;
>>>               }
>>> @@ -1181,8 +1239,33 @@ static void *multifd_recv_thread(void *opaque)
>>>               has_data = !!p->data->size;
>>>           }
>>>   
>>> -        if (has_data) {
>>> -            ret = multifd_recv_state->ops->recv(p, &local_err);
>>> +        if (!is_device_state) {
>>> +            if (has_data) {
>>> +                ret = multifd_recv_state->ops->recv(p, &local_err);
>>> +                if (ret != 0) {
>>> +                    break;
>>> +                }
>>> +            }
>>> +        } else {
>>> +            g_autofree char *idstr = NULL;
>>> +            g_autofree char *dev_state_buf = NULL;
>>> +
>>> +            assert(use_packets);
>>> +
>>> +            if (p->next_packet_size > 0) {
>>> +                dev_state_buf = g_malloc(p->next_packet_size);
>>> +
>>> +                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
>>> +                if (ret != 0) {
>>> +                    break;
>>> +                }
>>> +            }
>> 
>> What's the use case for !next_packet_size and still call
>> load_state_buffer below? I can't see it.
>
> Currently, next_packet_size == 0 has not usage indeed - it is
> a leftover from an early version of the patch set (not public)
> that had device state packet (chunk) indexing done by
> the common migration code, rather than by the VFIO consumer.
>
> And then an empty packet could be used to mark the stream
> boundary - like the max chunk number to expect.
>
>> ...because I would suggest to set has_data up there with
>> p->next_packet_size:
>> 
>> if (use_packets) {
>>     ...
>>     has_data = p->next_packet_size || p->zero_num;
>> } else {
>>     ...
>>     has_data = !!p->data_size;
>> }
>> 
>> and this whole block would be:
>> 
>> if (has_data) {
>>     if (is_device_state) {
>>         multifd_device_state_recv(p, &local_err);
>>     } else {
>>         ret = multifd_recv_state->ops->recv(p, &local_err);
>>     }
>> }
>
> The above block makes sense to me with two caveats:

I have suggestions below, but this is no big deal, so feel free to go
with what you think works best.

> 1) If empty device state packets (next_packet_size == 0) were
> to be unsupported they need to be rejected cleanly rather
> than silently skipped,

Should this be rejected on the send side? That's the most likely source
of the problem if it happens. Don't need to send something we know will
cause an error when loading.

And for the case of stream corruption of some sort we could hoist the
check from load_buffer into here:

 else if (is_device_state) {
    error_setg(errp, "empty device state packet);
    break;
}

> 2) has_data has to have its value computed depending on whether
> this is a RAM or a device state packet since looking at
> p->normal_num and p->zero_num makes no sense for a device state
> packet while I am not sure that looking at p->next_packet_size
> for a RAM packet won't introduce some subtle regression.

It should be ok to use next_packet_size for RAM, it must always be in
sync with normal_num.

>
>>> +
>>> +            idstr = g_strndup(p->packet_dev_state->idstr, sizeof(p->packet_dev_state->idstr));
>>> +            ret = qemu_loadvm_load_state_buffer(idstr,
>>> +                                                p->packet_dev_state->instance_id,
>>> +                                                dev_state_buf, p->next_packet_size,
>>> +                                                &local_err);
>>>               if (ret != 0) {
>>>                   break;
>>>               }
>>> @@ -1190,6 +1273,11 @@ static void *multifd_recv_thread(void *opaque)
>>>   
>>>           if (use_packets) {
>>>               if (flags & MULTIFD_FLAG_SYNC) {
>>> +                if (is_device_state) {
>>> +                    error_setg(&local_err, "multifd: received SYNC device state packet");
>>> +                    break;
>>> +                }
>> 
>> assert(!is_device_state) enough?
>
> It's not bug in the receiver code but rather an issue with the
> remote QEMU sending us wrong data if we get a SYNC device state
> packet.
>
> So I think returning an error is more appropriate than triggering
> an assert() failure for that.

ok

>>> +
>>>                   qemu_sem_post(&multifd_recv_state->sem_sync);
>>>                   qemu_sem_wait(&p->sem_sync);
>>>               }
>>> @@ -1258,6 +1346,7 @@ int multifd_recv_setup(Error **errp)
>>>               p->packet_len = sizeof(MultiFDPacket_t)
>>>                   + sizeof(uint64_t) * page_count;
>>>               p->packet = g_malloc0(p->packet_len);
>>> +            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
>>>           }
>>>           p->name = g_strdup_printf("mig/dst/recv_%d", i);
>>>           p->normal = g_new0(ram_addr_t, page_count);
>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>> index a3e35196d179..a8f3e4838c01 100644
>>> --- a/migration/multifd.h
>>> +++ b/migration/multifd.h
>>> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>>   #define MULTIFD_FLAG_QPL (4 << 1)
>>>   #define MULTIFD_FLAG_UADK (8 << 1)
>>>   
>>> +/*
>>> + * If set it means that this packet contains device state
>>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>>> + */
>>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
>> 
>> Overlaps with UADK. I assume on purpose because device_state doesn't
>> support compression? Might be worth a comment.
>> 
>
> Yes, the device state transfer bit stream does not support compression
> so it is not a problem since these "compression type" flags will never
> be set in such bit stream anyway.
>
> Will add a relevant comment here.
>
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic
  2024-09-02 20:11     ` Maciej S. Szmigiero
@ 2024-09-03 15:01       ` Fabiano Rosas
  2024-09-03 20:04         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-03 15:01 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 30.08.2024 20:13, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This is necessary for multifd_send() to be able to be called
>>> from multiple threads.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   migration/multifd.c | 24 ++++++++++++++++++------
>>>   1 file changed, 18 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index d5a8e5a9c9b5..b25789dde0b3 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -343,26 +343,38 @@ bool multifd_send(MultiFDSendData **send_data)
>>>           return false;
>>>       }
>>>   
>>> -    /* We wait here, until at least one channel is ready */
>>> -    qemu_sem_wait(&multifd_send_state->channels_ready);
>>> -
>>>       /*
>>>        * next_channel can remain from a previous migration that was
>>>        * using more channels, so ensure it doesn't overflow if the
>>>        * limit is lower now.
>>>        */
>>> -    next_channel %= migrate_multifd_channels();
>>> -    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
>>> +    i = qatomic_load_acquire(&next_channel);
>>> +    if (unlikely(i >= migrate_multifd_channels())) {
>>> +        qatomic_cmpxchg(&next_channel, i, 0);
>>> +    }
>> 
>> Do we still need this? It seems not, because the mod down below would
>> already truncate to a value less than the number of channels. We don't
>> need it to start at 0 always, the channels are equivalent.
>
> The "modulo" operation below forces i_next to be in the proper range,
> not i.
>
> If the qatomic_cmpxchg() ends up succeeding then we use the (now out of
> bounds) i value to index multifd_send_state->params[].

Indeed.

>
>>> +
>>> +    /* We wait here, until at least one channel is ready */
>>> +    qemu_sem_wait(&multifd_send_state->channels_ready);
>>> +
>>> +    while (true) {
>>> +        int i_next;
>>> +
>>>           if (multifd_send_should_exit()) {
>>>               return false;
>>>           }
>>> +
>>> +        i = qatomic_load_acquire(&next_channel);
>>> +        i_next = (i + 1) % migrate_multifd_channels();
>>> +        if (qatomic_cmpxchg(&next_channel, i, i_next) != i) {
>>> +            continue;
>>> +        }
>> 
>> Say channel 'i' is the only one that's idle. What's stopping the other
>> thread(s) to race at this point and loop around to the same index?
>
> See the reply below.
>
>>> +
>>>           p = &multifd_send_state->params[i];
>>>           /*
>>>            * Lockless read to p->pending_job is safe, because only multifd
>>>            * sender thread can clear it.
>>>            */
>>>           if (qatomic_read(&p->pending_job) == false) {
>> 
>> With the cmpxchg your other patch adds here, then the race I mentioned
>> above should be harmless. But we'd need to bring that code into this
>> patch.
>> 
>
> You're right - the sender code with this patch alone isn't thread safe
> yet but this commit is only literally about "converting
> multifd_send()::next_channel to atomic".
>
> At the time of this patch there aren't any multifd_send() calls from
> multiple threads, and the commit that introduces such possible call
> site (multifd_queue_device_state()) also modifies multifd_send()
> to be fully thread safe by introducing p->pending_job_preparing.

In general this would be a bad practice because this commit can end up
being moved around due to backporting or bisecting. It would be better
if it were complete from the start. It also affects backporting due to
ambiguity on where the Fixes tag should point to if someone eventually
finds a bug.

I already asked you to extract the other code into a separate patch, so
it's not that bad. If you prefer to keep both changes separate for
clarity, please note on the commit message that the next patch is
necessary for correctness.

>
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support()
  2024-09-02 20:11     ` Maciej S. Szmigiero
@ 2024-09-03 15:09       ` Fabiano Rosas
  0 siblings, 0 replies; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-03 15:09 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 30.08.2024 20:55, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Since device state transfer via multifd channels requires multifd
>>> channels with packets and is currently not compatible with multifd
>>> compression add an appropriate query function so device can learn
>>> whether it can actually make use of it.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> 
>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
>> 
>> Out of curiosity, what do you see as a blocker for migrating to a file?
>> 
>> We would just need to figure out a mapping from file offset some unit of
>> data to be able to write in parallel like with ram (of which the page
>> offset is mapped to the file offset).
>
> I'm not sure whether there's a point in that since VFIO devices
> just provide a raw device state stream - there's no way to know
> that some buffer is no longer needed because it consisted of
> dirty data that was completely overwritten by a later buffer.
>
> Also, the device type that the code was developed against - a (smart)
> NIC - has so large device state because (more or less) it keeps a lot
> of data about network connections passing / made through it.
>
> It doesn't really make sense to make snapshot of such device for later
> reload since these connections will be long dropped by their remote
> peers by this point.
>
> Such snapshotting might make more sense with GPU VFIO devices though.
>
> If such file migration support is desired at some later point then for
> sure the whole code would need to be carefully re-checked for implicit
> assumptions.

Thanks, let's keep those arguments in mind, maybe we put them in a doc
somewhere so we remember this in the future.

>
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-03 13:55   ` Stefan Hajnoczi
@ 2024-09-03 16:54     ` Maciej S. Szmigiero
  2024-09-03 19:04       ` Stefan Hajnoczi
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-03 16:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

On 3.09.2024 15:55, Stefan Hajnoczi wrote:
> On Tue, 27 Aug 2024 at 13:58, Maciej S. Szmigiero
> <mail@maciej.szmigiero.name> wrote:
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Migration code wants to manage device data sending threads in one place.
>>
>> QEMU has an existing thread pool implementation, however it was limited
>> to queuing AIO operations only and essentially had a 1:1 mapping between
>> the current AioContext and the ThreadPool in use.
>>
>> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
>> too.
>>
>> This brings a few new operations on a pool:
>> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
>> thread count in the pool.
>>
>> * thread_pool_join() operation waits until all the submitted work requests
>> have finished.
>>
>> * thread_pool_poll() lets the new thread and / or thread completion bottom
>> halves run (if they are indeed scheduled to be run).
>> It is useful for thread pool users that need to launch or terminate new
>> threads without returning to the QEMU main loop.
> 
> Did you consider glib's GThreadPool?
> https://docs.gtk.org/glib/struct.ThreadPool.html
> 
> QEMU's thread pool is integrated into the QEMU event loop. If your
> goal is to bypass the QEMU event loop, then you may as well use the
> glib API instead.
> 
> thread_pool_join() and thread_pool_poll() will lead to code that
> blocks the event loop. QEMU's aio_poll() and nested event loops in
> general are a source of hangs and re-entrancy bugs. I would prefer not
> introducing these issues in the QEMU ThreadPool API.
> 

Unfortunately, the problem with the migration code is that it is
synchronous - it does not return to the main event loop until the
migration is done.

So the only way to handle things that need working event loop is to
pump it manually from inside the migration code.

The reason why I used the QEMU thread pool in the first place in this
patch set version is because Peter asked me to do so during the review
of its previous iteration [1].

Peter also asked me previously to move to QEMU synchronization
primitives from using the Glib ones in the early version of this
patch set [2].

I personally would rather use something common to many applications,
well tested and with more pairs of eyes looking at it rather to
re-invent things in QEMU.

Looking at GThreadPool it seems that it lacks ability to wait until
all queued work have finished, so this would need to be open-coded
in the migration code.

@Peter, what's your opinion on using Glib's thread pool instead of
QEMU's one, considering the above things?

Thanks,
Maciej

[1]: https://lore.kernel.org/qemu-devel/ZniFH14DT6ycjbrL@x1n/ point 5: "Worker thread model"
[2]: https://lore.kernel.org/qemu-devel/Zi_9SyJy__8wJTou@x1n/



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-03 14:26       ` Fabiano Rosas
@ 2024-09-03 18:14         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-03 18:14 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

On 3.09.2024 16:26, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 3.09.2024 00:07, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Migration code wants to manage device data sending threads in one place.
>>>>
>>>> QEMU has an existing thread pool implementation, however it was limited
>>>> to queuing AIO operations only and essentially had a 1:1 mapping between
>>>> the current AioContext and the ThreadPool in use.
>>>>
>>>> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
>>>> too.
>>>>
>>>> This brings a few new operations on a pool:
>>>> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
>>>> thread count in the pool.
>>>>
>>>> * thread_pool_join() operation waits until all the submitted work requests
>>>> have finished.
>>>>
>>>> * thread_pool_poll() lets the new thread and / or thread completion bottom
>>>> halves run (if they are indeed scheduled to be run).
>>>> It is useful for thread pool users that need to launch or terminate new
>>>> threads without returning to the QEMU main loop.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>    include/block/thread-pool.h   | 10 ++++-
>>>>    tests/unit/test-thread-pool.c |  2 +-
>>>>    util/thread-pool.c            | 77 ++++++++++++++++++++++++++++++-----
>>>>    3 files changed, 76 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>>>> index b484c4780ea6..1769496056cd 100644
>>>> --- a/include/block/thread-pool.h
>>>> +++ b/include/block/thread-pool.h
>>>> @@ -37,9 +37,15 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>>>                                       void *arg, GDestroyNotify arg_destroy,
>>>>                                       BlockCompletionFunc *cb, void *opaque);
>>>>    int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>>> -void thread_pool_submit(ThreadPoolFunc *func,
>>>> -                        void *arg, GDestroyNotify arg_destroy);
>>>> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>>> +                               void *arg, GDestroyNotify arg_destroy,
>>>> +                               BlockCompletionFunc *cb, void *opaque);
>>>
>>> These kinds of changes (create wrappers, change signatures, etc), could
>>> be in their own patch as it's just code motion that should not have
>>> functional impact. The "no_requests" stuff would be better discussed in
>>> a separate patch.
>>
>> These changes *all* should have no functional impact on existing callers.
>>
>> But I get your overall point, will try to separate these really trivial
>> parts.
> 
> Yeah, I guess I meant that one set of changes has a larger potential for
> introducing a bug while the other is clearly harmless.

I understand.

>>
>>>>    
>>>> +void thread_pool_join(ThreadPool *pool);
>>>> +void thread_pool_poll(ThreadPool *pool);
>>>> +
>>>> +void thread_pool_set_minmax_threads(ThreadPool *pool,
>>>> +                                    int min_threads, int max_threads);
>>>>    void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>>>>    
>>>>    #endif
>>>> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
>>>> index e4afb9e36292..469c0f7057b6 100644
>>>> --- a/tests/unit/test-thread-pool.c
>>>> +++ b/tests/unit/test-thread-pool.c
>>>> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>>>>    static void test_submit(void)
>>>>    {
>>>>        WorkerTestData data = { .n = 0 };
>>>> -    thread_pool_submit(worker_cb, &data, NULL);
>>>> +    thread_pool_submit(NULL, worker_cb, &data, NULL, NULL, NULL);
>>>>        while (data.n == 0) {
>>>>            aio_poll(ctx, true);
>>>>        }
>>>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>>>> index 69a87ee79252..2bf3be875a51 100644
>>>> --- a/util/thread-pool.c
>>>> +++ b/util/thread-pool.c
>>>> @@ -60,6 +60,7 @@ struct ThreadPool {
>>>>        QemuMutex lock;
>>>>        QemuCond worker_stopped;
>>>>        QemuCond request_cond;
>>>> +    QemuCond no_requests_cond;
>>>>        QEMUBH *new_thread_bh;
>>>>    
>>>>        /* The following variables are only accessed from one AioContext. */
>>>> @@ -73,6 +74,7 @@ struct ThreadPool {
>>>>        int pending_threads; /* threads created but not running yet */
>>>>        int min_threads;
>>>>        int max_threads;
>>>> +    size_t requests_executing;
>>>
>>> What's with size_t? Should this be a uint32_t instead?
>>
>> Sizes of objects are normally size_t, since otherwise bad
>> things happen if objects are bigger than 4 GiB.
> 
> Ok, but requests_executing is not the size of an object. It's the number
> of objects in a linked list that satisfy a certain predicate. There are
> no address space size considerations here.

Max object count = Address space size / Min object size

If min object size = 1 then Max object count = Address space size

>>
>> Considering that the minimum object size is 1 byte the
>> max count of distinct objects also needs a size_t to not
>> risk an overflow.
> 
> I'm not sure I get you, there's no overflow since you're bounds checking
> with the assert. Or is this a more abstract line of thought about how
> many ThreadPoolElements can be present in memory at a time and you'd
> like a type that's certain to fit the theoretical amount of objects?

It's more of theoretical thing (not to introduce an unnecessary
limit) but you are right that assert() would cover any possible
issues due to counter overflow, so I can change it to uint32_t
indeed.

>>
>> I think that while 2^32 requests executing seems unlikely
>> saving 4 bytes seems not worth worrying that someone will
>> find a vulnerability triggered by overflowing a 32-bit
>> variable (not necessary in the migration code but in some
>> other thread pool user).
>>
>>>>    };
>>>>    
>>>>    static void *worker_thread(void *opaque)
>>>> @@ -107,6 +109,10 @@ static void *worker_thread(void *opaque)
>>>>            req = QTAILQ_FIRST(&pool->request_list);
>>>>            QTAILQ_REMOVE(&pool->request_list, req, reqs);
>>>>            req->state = THREAD_ACTIVE;
>>>> +
>>>> +        assert(pool->requests_executing < SIZE_MAX);
>>>> +        pool->requests_executing++;
>>>> +
>>>>            qemu_mutex_unlock(&pool->lock);
>>>>    
>>>>            ret = req->func(req->arg);
>>>> @@ -118,6 +124,14 @@ static void *worker_thread(void *opaque)
>>>>    
>>>>            qemu_bh_schedule(pool->completion_bh);
>>>>            qemu_mutex_lock(&pool->lock);
>>>> +
>>>> +        assert(pool->requests_executing > 0);
>>>> +        pool->requests_executing--;
>>>> +
>>>> +        if (pool->requests_executing == 0 &&
>>>> +            QTAILQ_EMPTY(&pool->request_list)) {
>>>> +            qemu_cond_signal(&pool->no_requests_cond);
>>>> +        }
>>>
>>> An empty requests list and no request in flight means the worker will
>>> now exit after the timeout, no? Can you just kick the worker out of the
>>> wait and use pool->worker_stopped instead of the new condition variable?
>>
>> First, all threads won't terminate if either min_threads or max_threads
>> isn't 0.
> 
> Ah I overlooked the break condition, nevermind.
> 
>> It might be in the migration thread pool case but we are adding a
>> generic thread pool so it should be as universal as possible.
>> thread_pool_free() can get away with overwriting these values since
>> it is destroying the pool anyway.
>>
>> Also, the *_join() (or whatever its final name will be) operation is
>> about waiting for all requests / work items to finish, not about waiting
>> for threads to terminate.
> 
> Right, but the idea was to piggyback on the thread termination to infer
> (the obvious) requests service termination. We cannot do that, as you've
> explained, fine.
> 
>> It's essentially a synchronization point for a thread pool, not a cleanup.
>>
>>>>        }
>>>>    
>>>>        pool->cur_threads--;
>>>> @@ -243,13 +257,16 @@ static const AIOCBInfo thread_pool_aiocb_info = {
>>>>        .cancel_async       = thread_pool_cancel,
>>>>    };
>>>>    
>>>> -BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>>> -                                   void *arg, GDestroyNotify arg_destroy,
>>>> -                                   BlockCompletionFunc *cb, void *opaque)
>>>> +BlockAIOCB *thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>>> +                               void *arg, GDestroyNotify arg_destroy,
>>>> +                               BlockCompletionFunc *cb, void *opaque)
>>>>    {
>>>>        ThreadPoolElement *req;
>>>>        AioContext *ctx = qemu_get_current_aio_context();
>>>> -    ThreadPool *pool = aio_get_thread_pool(ctx);
>>>> +
>>>> +    if (!pool) {
>>>> +        pool = aio_get_thread_pool(ctx);
>>>> +    }
>>>
>>> I'd go for a separate implementation to really drive the point that this
>>> new usage is different. See the code snippet below.
>>
>> I see your point there - will split these implementations.
>>
>>> It seems we're a short step away to being able to use this
>>> implementation in a general way. Is there something that can be done
>>> with the 'common' field in the ThreadPoolElement?
>>
>> The non-AIO request flow still need the completion callback from BlockAIOCB
>> (and its argument pointer) so removing the "common" field from these requests
>> would need introducing two "flavors" of ThreadPoolElement.
>>
>> Not sure memory saving here are worth the increase in code complexity.
> 
> I'm not asking that of you, but I think it should be done
> eventually. The QEMU block layer is very particular and I wouldn't want
> the use-cases for the thread-pool to get confused. But I can't see a way
> out right now, so let's postpone this, see if anyone else has comments.

I understand.

>>
>>> ========
>>> static void thread_pool_submit_request(ThreadPool *pool, ThreadPoolElement *req)
>>> {
>>>       req->state = THREAD_QUEUED;
>>>       req->pool = pool;
>>>
>>>       QLIST_INSERT_HEAD(&pool->head, req, all);
>>>
>>>       trace_thread_pool_submit(pool, req, req->arg);
>>>
>>>       qemu_mutex_lock(&pool->lock);
>>>       if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
>>>           spawn_thread(pool);
>>>       }
>>>       QTAILQ_INSERT_TAIL(&pool->request_list, req, reqs);
>>>       qemu_mutex_unlock(&pool->lock);
>>>       qemu_cond_signal(&pool->request_cond);
>>> }
>>>
>>> BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>>                                      BlockCompletionFunc *cb, void *opaque)
>>> {
>>>       ThreadPoolElement *req;
>>>       AioContext *ctx = qemu_get_current_aio_context();
>>>       ThreadPool *pool = aio_get_thread_pool(ctx);
>>>
>>>       /* Assert that the thread submitting work is the same running the pool */
>>>       assert(pool->ctx == qemu_get_current_aio_context());
>>>
>>>       req = qemu_aio_get(&thread_pool_aiocb_info, NULL, cb, opaque);
>>>       req->func = func;
>>>       req->arg = arg;
>>>
>>>       thread_pool_submit_request(pool, req);
>>>       return &req->common;
>>> }
>>>
>>> void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func, void *arg)
>>> {
>>>       ThreadPoolElement *req;
>>>
>>>       req = g_malloc(sizeof(ThreadPoolElement));
>>>       req->func = func;
>>>       req->arg = arg;
>>>
>>>       thread_pool_submit_request(pool, req);
>>> }
>>> =================
>>>
>>>>    
>>>>        /* Assert that the thread submitting work is the same running the pool */
>>>>        assert(pool->ctx == qemu_get_current_aio_context());
>>>> @@ -275,6 +292,18 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>>>        return &req->common;
>>>>    }
>>>>    
>>>> +BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func,
>>>> +                                   void *arg, GDestroyNotify arg_destroy,
>>>> +                                   BlockCompletionFunc *cb, void *opaque)
>>>> +{
>>>> +    return thread_pool_submit(NULL, func, arg, arg_destroy, cb, opaque);
>>>> +}
>>>> +
>>>> +void thread_pool_poll(ThreadPool *pool)
>>>> +{
>>>> +    aio_bh_poll(pool->ctx);
>>>> +}
>>>> +
>>>>    typedef struct ThreadPoolCo {
>>>>        Coroutine *co;
>>>>        int ret;
>>>> @@ -297,18 +326,38 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>>>>        return tpc.ret;
>>>>    }
>>>>    
>>>> -void thread_pool_submit(ThreadPoolFunc *func,
>>>> -                        void *arg, GDestroyNotify arg_destroy)
>>>> +void thread_pool_join(ThreadPool *pool)
>>>
>>> This is misleading because it's about the requests, not the threads in
>>> the pool. Compare with what thread_pool_free does:
>>>
>>>       /* Wait for worker threads to terminate */
>>>       pool->max_threads = 0;
>>>       qemu_cond_broadcast(&pool->request_cond);
>>>       while (pool->cur_threads > 0) {
>>>           qemu_cond_wait(&pool->worker_stopped, &pool->lock);
>>>       }
>>>
>>
>> I'm open to thread_pool_join() better naming proposals.
> 
> thread_pool_wait() might be better.

Ack.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-03 14:42       ` Fabiano Rosas
@ 2024-09-03 18:41         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-03 18:41 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Peter Xu

On 3.09.2024 16:42, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 30.08.2024 22:22, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Add a basic support for receiving device state via multifd channels -
>>>> channels that are shared with RAM transfers.
>>>>
>>>> To differentiate between a device state and a RAM packet the packet
>>>> header is read first.
>>>>
>>>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>>>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>>>> data (existing MultiFDPacket_t) is then read.
>>>>
>>>> The received device state data is provided to
>>>> qemu_loadvm_load_state_buffer() function for processing in the
>>>> device's load_state_buffer handler.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>    migration/multifd.c | 127 +++++++++++++++++++++++++++++++++++++-------
>>>>    migration/multifd.h |  31 ++++++++++-
>>>>    2 files changed, 138 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>>> index b06a9fab500e..d5a8e5a9c9b5 100644
>>>> --- a/migration/multifd.c
>>>> +++ b/migration/multifd.c
(..)
>>>> +
>>>> +            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
>>>> +                                           &local_err);
>>>>                if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>>>>                    break;
>>>>                }
>>>> @@ -1181,8 +1239,33 @@ static void *multifd_recv_thread(void *opaque)
>>>>                has_data = !!p->data->size;
>>>>            }
>>>>    
>>>> -        if (has_data) {
>>>> -            ret = multifd_recv_state->ops->recv(p, &local_err);
>>>> +        if (!is_device_state) {
>>>> +            if (has_data) {
>>>> +                ret = multifd_recv_state->ops->recv(p, &local_err);
>>>> +                if (ret != 0) {
>>>> +                    break;
>>>> +                }
>>>> +            }
>>>> +        } else {
>>>> +            g_autofree char *idstr = NULL;
>>>> +            g_autofree char *dev_state_buf = NULL;
>>>> +
>>>> +            assert(use_packets);
>>>> +
>>>> +            if (p->next_packet_size > 0) {
>>>> +                dev_state_buf = g_malloc(p->next_packet_size);
>>>> +
>>>> +                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
>>>> +                if (ret != 0) {
>>>> +                    break;
>>>> +                }
>>>> +            }
>>>
>>> What's the use case for !next_packet_size and still call
>>> load_state_buffer below? I can't see it.
>>
>> Currently, next_packet_size == 0 has not usage indeed - it is
>> a leftover from an early version of the patch set (not public)
>> that had device state packet (chunk) indexing done by
>> the common migration code, rather than by the VFIO consumer.
>>
>> And then an empty packet could be used to mark the stream
>> boundary - like the max chunk number to expect.
>>
>>> ...because I would suggest to set has_data up there with
>>> p->next_packet_size:
>>>
>>> if (use_packets) {
>>>      ...
>>>      has_data = p->next_packet_size || p->zero_num;
>>> } else {
>>>      ...
>>>      has_data = !!p->data_size;
>>> }
>>>
>>> and this whole block would be:
>>>
>>> if (has_data) {
>>>      if (is_device_state) {
>>>          multifd_device_state_recv(p, &local_err);
>>>      } else {
>>>          ret = multifd_recv_state->ops->recv(p, &local_err);
>>>      }
>>> }
>>
>> The above block makes sense to me with two caveats:
> 
> I have suggestions below, but this is no big deal, so feel free to go
> with what you think works best.
> 
>> 1) If empty device state packets (next_packet_size == 0) were
>> to be unsupported they need to be rejected cleanly rather
>> than silently skipped,
> 
> Should this be rejected on the send side? That's the most likely source
> of the problem if it happens. Don't need to send something we know will
> cause an error when loading.

Definitely we should send correct bit stream :), it was about the
case of bit stream corruption or simply using some future bit stream
format that the QEMU version with this patch set does not understand
yet.

> And for the case of stream corruption of some sort we could hoist the
> check from load_buffer into here:
> 
>   else if (is_device_state) {
>      error_setg(errp, "empty device state packet);
>      break;
> }

Right.

>> 2) has_data has to have its value computed depending on whether
>> this is a RAM or a device state packet since looking at
>> p->normal_num and p->zero_num makes no sense for a device state
>> packet while I am not sure that looking at p->next_packet_size
>> for a RAM packet won't introduce some subtle regression.
> 
> It should be ok to use next_packet_size for RAM, it must always be in
> sync with normal_num.

Then it should be ok, but I'll look at this deeper to be sure when
I will be preparing the next patch set version.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-03 16:54     ` Maciej S. Szmigiero
@ 2024-09-03 19:04       ` Stefan Hajnoczi
  2024-09-09 16:45         ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Stefan Hajnoczi @ 2024-09-03 19:04 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel, Paolo Bonzini

On Tue, 3 Sept 2024 at 12:54, Maciej S. Szmigiero
<mail@maciej.szmigiero.name> wrote:
>
> On 3.09.2024 15:55, Stefan Hajnoczi wrote:
> > On Tue, 27 Aug 2024 at 13:58, Maciej S. Szmigiero
> > <mail@maciej.szmigiero.name> wrote:
> >>
> >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> >>
> >> Migration code wants to manage device data sending threads in one place.
> >>
> >> QEMU has an existing thread pool implementation, however it was limited
> >> to queuing AIO operations only and essentially had a 1:1 mapping between
> >> the current AioContext and the ThreadPool in use.
> >>
> >> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
> >> too.
> >>
> >> This brings a few new operations on a pool:
> >> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
> >> thread count in the pool.
> >>
> >> * thread_pool_join() operation waits until all the submitted work requests
> >> have finished.
> >>
> >> * thread_pool_poll() lets the new thread and / or thread completion bottom
> >> halves run (if they are indeed scheduled to be run).
> >> It is useful for thread pool users that need to launch or terminate new
> >> threads without returning to the QEMU main loop.
> >
> > Did you consider glib's GThreadPool?
> > https://docs.gtk.org/glib/struct.ThreadPool.html
> >
> > QEMU's thread pool is integrated into the QEMU event loop. If your
> > goal is to bypass the QEMU event loop, then you may as well use the
> > glib API instead.
> >
> > thread_pool_join() and thread_pool_poll() will lead to code that
> > blocks the event loop. QEMU's aio_poll() and nested event loops in
> > general are a source of hangs and re-entrancy bugs. I would prefer not
> > introducing these issues in the QEMU ThreadPool API.
> >
>
> Unfortunately, the problem with the migration code is that it is
> synchronous - it does not return to the main event loop until the
> migration is done.
>
> So the only way to handle things that need working event loop is to
> pump it manually from inside the migration code.
>
> The reason why I used the QEMU thread pool in the first place in this
> patch set version is because Peter asked me to do so during the review
> of its previous iteration [1].
>
> Peter also asked me previously to move to QEMU synchronization
> primitives from using the Glib ones in the early version of this
> patch set [2].
>
> I personally would rather use something common to many applications,
> well tested and with more pairs of eyes looking at it rather to
> re-invent things in QEMU.
>
> Looking at GThreadPool it seems that it lacks ability to wait until
> all queued work have finished, so this would need to be open-coded
> in the migration code.
>
> @Peter, what's your opinion on using Glib's thread pool instead of
> QEMU's one, considering the above things?

I'll add a bit more about my thinking:

Using QEMU's event-driven model is usually preferred because it makes
integrating with the rest of QEMU easy and avoids having lots of
single-purpose threads that are hard to observe/manage (e.g. through
the QMP monitor).

When there is a genuine need to spawn a thread and write synchronous
code (e.g. a blocking ioctl(2) call or something CPU-intensive), then
it's okay to do that. Use QEMUBH, EventNotifier, or other QEMU APIs to
synchronize between event loop threads and special-purpose synchronous
threads.

I haven't looked at the patch series enough to have an opinion about
whether this use case needs a special-purpose thread or not. I am
assuming it really needs to be a special-purpose thread. Peter and you
could discuss that further if you want.

I agree with Peter's request to use QEMU's synchronization primitives.
They do not depend on the event loop so they can be used outside the
event loop.

The issue I'm raising with this patch is that adding new join()/poll()
APIs that shouldn't be called from the event loop is bug-prone. It
will make the QEMU ThreadPool code harder to understand and maintain
because now there are two different contexts where different subsets
of this API can be used and mixing them leads to problems. To me the
non-event loop case is beyond the scope of QEMU's ThreadPool. I have
CCed Paolo, who wrote the thread pool in its current form in case he
wants to participate in the discussion.

Using glib's ThreadPool solves the issue while still reusing an
existing thread pool implementation. Waiting for all work to complete
can be done using QemuSemaphore.

Thanks,
Stefan

> Thanks,
> Maciej
>
> [1]: https://lore.kernel.org/qemu-devel/ZniFH14DT6ycjbrL@x1n/ point 5: "Worker thread model"
> [2]: https://lore.kernel.org/qemu-devel/Zi_9SyJy__8wJTou@x1n/
>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic
  2024-09-03 15:01       ` Fabiano Rosas
@ 2024-09-03 20:04         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-03 20:04 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 3.09.2024 17:01, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 30.08.2024 20:13, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> This is necessary for multifd_send() to be able to be called
>>>> from multiple threads.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>    migration/multifd.c | 24 ++++++++++++++++++------
>>>>    1 file changed, 18 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>>> index d5a8e5a9c9b5..b25789dde0b3 100644
>>>> --- a/migration/multifd.c
>>>> +++ b/migration/multifd.c
(..)
>>>> +
>>>> +    /* We wait here, until at least one channel is ready */
>>>> +    qemu_sem_wait(&multifd_send_state->channels_ready);
>>>> +
>>>> +    while (true) {
>>>> +        int i_next;
>>>> +
>>>>            if (multifd_send_should_exit()) {
>>>>                return false;
>>>>            }
>>>> +
>>>> +        i = qatomic_load_acquire(&next_channel);
>>>> +        i_next = (i + 1) % migrate_multifd_channels();
>>>> +        if (qatomic_cmpxchg(&next_channel, i, i_next) != i) {
>>>> +            continue;
>>>> +        }
>>>
>>> Say channel 'i' is the only one that's idle. What's stopping the other
>>> thread(s) to race at this point and loop around to the same index?
>>
>> See the reply below.
>>
>>>> +
>>>>            p = &multifd_send_state->params[i];
>>>>            /*
>>>>             * Lockless read to p->pending_job is safe, because only multifd
>>>>             * sender thread can clear it.
>>>>             */
>>>>            if (qatomic_read(&p->pending_job) == false) {
>>>
>>> With the cmpxchg your other patch adds here, then the race I mentioned
>>> above should be harmless. But we'd need to bring that code into this
>>> patch.
>>>
>>
>> You're right - the sender code with this patch alone isn't thread safe
>> yet but this commit is only literally about "converting
>> multifd_send()::next_channel to atomic".
>>
>> At the time of this patch there aren't any multifd_send() calls from
>> multiple threads, and the commit that introduces such possible call
>> site (multifd_queue_device_state()) also modifies multifd_send()
>> to be fully thread safe by introducing p->pending_job_preparing.
> 
> In general this would be a bad practice because this commit can end up
> being moved around due to backporting or bisecting. It would be better
> if it were complete from the start. It also affects backporting due to
> ambiguity on where the Fixes tag should point to if someone eventually
> finds a bug.
> 
> I already asked you to extract the other code into a separate patch, so
> it's not that bad. If you prefer to keep both changes separate for
> clarity, please note on the commit message that the next patch is
> necessary for correctness.
> 

If someone picks parts of a patch set or reorders commits then I guess
in many cases things can break indeed.

But it looks like I will be able to move code changes around to have
multifd_send() already thread safe by the time of this commit so I
will do that.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 01/17] vfio/migration: Add save_{iterate,complete_precopy}_started trace events
  2024-08-27 17:54 ` [PATCH v2 01/17] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
@ 2024-09-05 13:08   ` Avihai Horon
  2024-09-09 18:04     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-05 13:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

Hi Maciej,

On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This way both the start and end points of migrating a particular VFIO
> device are known.
>
> Add also a vfio_save_iterate_empty_hit trace event so it is known when
> there's no more data to send for that device.

Out of curiosity, what are these traces used for?

>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 13 +++++++++++++
>   hw/vfio/trace-events          |  3 +++
>   include/hw/vfio/vfio-common.h |  3 +++
>   3 files changed, 19 insertions(+)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 262d42a46e58..24679d8c5034 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -472,6 +472,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>           return -ENOMEM;
>       }
>
> +    migration->save_iterate_run = false;
> +    migration->save_iterate_empty_hit = false;
> +
>       if (vfio_precopy_supported(vbasedev)) {
>           switch (migration->device_state) {
>           case VFIO_DEVICE_STATE_RUNNING:
> @@ -605,9 +608,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>       VFIOMigration *migration = vbasedev->migration;
>       ssize_t data_size;
>
> +    if (!migration->save_iterate_run) {
> +        trace_vfio_save_iterate_started(vbasedev->name);
> +        migration->save_iterate_run = true;

Maybe rename save_iterate_run to save_iterate_started so it's aligned 
with trace_vfio_save_iterate_started and 
trace_vfio_save_complete_precopy_started?

> +    }
> +
>       data_size = vfio_save_block(f, migration);
>       if (data_size < 0) {
>           return data_size;
> +    } else if (data_size == 0 && !migration->save_iterate_empty_hit) {
> +        trace_vfio_save_iterate_empty_hit(vbasedev->name);
> +        migration->save_iterate_empty_hit = true;

During precopy we could hit empty multiple times. Any reason why only 
the first time should be traced?

>       }
>
>       vfio_update_estimated_pending_data(migration, data_size);
> @@ -633,6 +644,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       int ret;
>       Error *local_err = NULL;
>
> +    trace_vfio_save_complete_precopy_started(vbasedev->name);
> +
>       /* We reach here with device state STOP or STOP_COPY only */
>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>                                      VFIO_DEVICE_STATE_STOP, &local_err);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 98bd4dcceadc..013c602f30fa 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -159,8 +159,11 @@ vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
>   vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
> +vfio_save_complete_precopy_started(const char *name) " (%s)"
>   vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> +vfio_save_iterate_started(const char *name) " (%s)"
> +vfio_save_iterate_empty_hit(const char *name) " (%s)"

Let's keep it sorted in alphabetical order.

Thanks.

>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
>   vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>   vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index fed499b199f0..32d58e3e025b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,9 @@ typedef struct VFIOMigration {
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
>       bool initial_data_sent;
> +
> +    bool save_iterate_run;
> +    bool save_iterate_empty_hit;
>   } VFIOMigration;
>
>   struct VFIOGroup;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-08-27 17:54 ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
  2024-08-28 19:03   ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers Fabiano Rosas
@ 2024-09-05 13:45   ` Avihai Horon
  2024-09-09 17:59     ` Peter Xu
  2024-09-09 18:05     ` Maciej S. Szmigiero
  1 sibling, 2 replies; 128+ messages in thread
From: Avihai Horon @ 2024-09-05 13:45 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> These SaveVMHandlers help device provide its own asynchronous
> transmission of the remaining data at the end of a precopy phase.
>
> In this use case the save_live_complete_precopy_begin handler might
> be used to mark the stream boundary before proceeding with asynchronous
> transmission of the remaining data while the
> save_live_complete_precopy_end handler might be used to mark the
> stream boundary after performing the asynchronous transmission.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
>   migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
>   2 files changed, 71 insertions(+)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index f60e797894e5..9de123252edf 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
>        */
>       int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>
> +    /**
> +     * @save_live_complete_precopy_begin
> +     *
> +     * Called at the end of a precopy phase, before all
> +     * @save_live_complete_precopy handlers and before launching
> +     * all @save_live_complete_precopy_thread threads.
> +     * The handler might, for example, mark the stream boundary before
> +     * proceeding with asynchronous transmission of the remaining data via
> +     * @save_live_complete_precopy_thread.
> +     * When postcopy is enabled, devices that support postcopy will skip this step.
> +     *
> +     * @f: QEMUFile where the handler can synchronously send data before returning
> +     * @idstr: this device section idstr
> +     * @instance_id: this device section instance_id
> +     * @opaque: data pointer passed to register_savevm_live()
> +     *
> +     * Returns zero to indicate success and negative for error
> +     */
> +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
> +                                            char *idstr, uint32_t instance_id,
> +                                            void *opaque);
> +    /**
> +     * @save_live_complete_precopy_end
> +     *
> +     * Called at the end of a precopy phase, after @save_live_complete_precopy
> +     * handlers and after all @save_live_complete_precopy_thread threads have
> +     * finished. When postcopy is enabled, devices that support postcopy will
> +     * skip this step.
> +     *
> +     * @f: QEMUFile where the handler can synchronously send data before returning
> +     * @opaque: data pointer passed to register_savevm_live()
> +     *
> +     * Returns zero to indicate success and negative for error
> +     */
> +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);

Is this handler necessary now that migration core is responsible for the 
threads and joins them? I don't see VFIO implementing it later on.

Thanks.

> +
>       /* This runs both outside and inside the BQL.  */
>
>       /**
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 6bb404b9c86f..d43acbbf20cf 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1496,6 +1496,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>       SaveStateEntry *se;
>       int ret;
>
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
> +             se->ops->has_postcopy(se->opaque)) ||
> +            !se->ops->save_live_complete_precopy_begin) {
> +            continue;
> +        }
> +
> +        save_section_header(f, se, QEMU_VM_SECTION_END);
> +
> +        ret = se->ops->save_live_complete_precopy_begin(f,
> +                                                        se->idstr, se->instance_id,
> +                                                        se->opaque);
> +
> +        save_section_footer(f, se);
> +
> +        if (ret < 0) {
> +            qemu_file_set_error(f, ret);
> +            return -1;
> +        }
> +    }
> +
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           if (!se->ops ||
>               (in_postcopy && se->ops->has_postcopy &&
> @@ -1527,6 +1548,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>                                       end_ts_each - start_ts_each);
>       }
>
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
> +             se->ops->has_postcopy(se->opaque)) ||
> +            !se->ops->save_live_complete_precopy_end) {
> +            continue;
> +        }
> +
> +        ret = se->ops->save_live_complete_precopy_end(f, se->opaque);
> +        if (ret < 0) {
> +            qemu_file_set_error(f, ret);
> +            return -1;
> +        }
> +    }
> +
>       trace_vmstate_downtime_checkpoint("src-iterable-saved");
>
>       return 0;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-08-27 17:54 ` [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
  2024-08-30 19:05   ` Fabiano Rosas
@ 2024-09-05 14:15   ` Avihai Horon
  2024-09-09 18:05     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-05 14:15 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> qemu_loadvm_load_state_buffer() and its load_state_buffer
> SaveVMHandler allow providing device state buffer to explicitly
> specified device via its idstr and instance id.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/register.h | 15 +++++++++++++++
>   migration/savevm.c           | 25 +++++++++++++++++++++++++
>   migration/savevm.h           |  3 +++
>   3 files changed, 43 insertions(+)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 9de123252edf..4a578f140713 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -263,6 +263,21 @@ typedef struct SaveVMHandlers {
>        */
>       int (*load_state)(QEMUFile *f, void *opaque, int version_id);
>
> +    /**
> +     * @load_state_buffer
> +     *
> +     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
> +     *
> +     * @opaque: data pointer passed to register_savevm_live()
> +     * @data: the data buffer to load
> +     * @data_size: the data length in buffer
> +     * @errp: pointer to Error*, to store an error if it happens.
> +     *
> +     * Returns zero to indicate success and negative for error
> +     */
> +    int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
> +                             Error **errp);

Nit: Maybe rename data to buf and data_size to len to be consistent with 
qemu_loadvm_load_state_buffer()?

> +
>       /**
>        * @load_setup
>        *
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d43acbbf20cf..3fde5ca8c26b 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3101,6 +3101,31 @@ int qemu_loadvm_approve_switchover(void)
>       return migrate_send_rp_switchover_ack(mis);
>   }
>
> +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
> +                                  char *buf, size_t len, Error **errp)
> +{
> +    SaveStateEntry *se;
> +
> +    se = find_se(idstr, instance_id);
> +    if (!se) {
> +        error_setg(errp, "Unknown idstr %s or instance id %u for load state buffer",
> +                   idstr, instance_id);
> +        return -1;
> +    }
> +
> +    if (!se->ops || !se->ops->load_state_buffer) {
> +        error_setg(errp, "idstr %s / instance %u has no load state buffer operation",
> +                   idstr, instance_id);
> +        return -1;
> +    }
> +
> +    if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) {
> +        return -1;
> +    }
> +
> +    return 0;

Nit: this can be simplified to:
return se->ops->load_state_buffer(se->opaque, buf, len, errp);

Thanks.

> +}
> +
>   bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>                     bool has_devices, strList *devices, Error **errp)
>   {
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 9ec96a995c93..d388f1bfca98 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void);
>   int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>           bool in_postcopy, bool inactivate_disks);
>
> +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
> +                                  char *buf, size_t len, Error **errp);
> +
>   #endif


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-08-27 17:54 ` [PATCH v2 08/17] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
  2024-08-30 19:28   ` Fabiano Rosas
@ 2024-09-05 15:13   ` Avihai Horon
  2024-09-09 18:05     ` Maciej S. Szmigiero
  2024-09-09 20:03   ` Peter Xu
  2 siblings, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-05 15:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> load_finish SaveVMHandler allows migration code to poll whether
> a device-specific asynchronous device state loading operation had finished.
>
> In order to avoid calling this handler needlessly the device is supposed
> to notify the migration code of its possible readiness via a call to
> qemu_loadvm_load_finish_ready_broadcast() while holding
> qemu_loadvm_load_finish_ready_lock.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/register.h | 21 +++++++++++++++
>   migration/migration.c        |  6 +++++
>   migration/migration.h        |  3 +++
>   migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>   migration/savevm.h           |  4 +++
>   5 files changed, 86 insertions(+)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 4a578f140713..44d8cf5192ae 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>       int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>                                Error **errp);
>
> +    /**
> +     * @load_finish
> +     *
> +     * Poll whether all asynchronous device state loading had finished.
> +     * Not called on the load failure path.
> +     *
> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> +     *
> +     * If this method signals "not ready" then it might not be called
> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> +     * while holding qemu_loadvm_load_finish_ready_lock.
> +     *
> +     * @opaque: data pointer passed to register_savevm_live()
> +     * @is_finished: whether the loading had finished (output parameter)
> +     * @errp: pointer to Error*, to store an error if it happens.
> +     *
> +     * Returns zero to indicate success and negative for error
> +     * It's not an error that the loading still hasn't finished.
> +     */
> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> +
>       /**
>        * @load_setup
>        *
> diff --git a/migration/migration.c b/migration/migration.c
> index 3dea06d57732..d61e7b055e07 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -259,6 +259,9 @@ void migration_object_init(void)
>
>       current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
>
> +    qemu_mutex_init(&current_incoming->load_finish_ready_mutex);
> +    qemu_cond_init(&current_incoming->load_finish_ready_cond);
> +
>       migration_object_check(current_migration, &error_fatal);
>
>       ram_mig_init();
> @@ -410,6 +413,9 @@ void migration_incoming_state_destroy(void)
>           mis->postcopy_qemufile_dst = NULL;
>       }
>
> +    qemu_mutex_destroy(&mis->load_finish_ready_mutex);
> +    qemu_cond_destroy(&mis->load_finish_ready_cond);
> +
>       yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>   }
>
> diff --git a/migration/migration.h b/migration/migration.h
> index 38aa1402d516..4e2443e6c8ec 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -230,6 +230,9 @@ struct MigrationIncomingState {
>
>       /* Do exit on incoming migration failure */
>       bool exit_on_error;
> +
> +    QemuCond load_finish_ready_cond;
> +    QemuMutex load_finish_ready_mutex;
>   };
>
>   MigrationIncomingState *migration_incoming_get_current(void);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 3fde5ca8c26b..33c9200d1e78 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3022,6 +3022,37 @@ int qemu_loadvm_state(QEMUFile *f)
>           return ret;
>       }
>
> +    qemu_loadvm_load_finish_ready_lock();
> +    while (!ret) { /* Don't call load_finish() handlers on the load failure path */
> +        bool all_ready = true;

Nit: Maybe rename all_ready to all_finished to be consistent with 
load_finish() terminology? Same for this_ready.

> +        SaveStateEntry *se = NULL;
> +
> +        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +            bool this_ready;
> +
> +            if (!se->ops || !se->ops->load_finish) {
> +                continue;
> +            }
> +
> +            ret = se->ops->load_finish(se->opaque, &this_ready, &local_err);
> +            if (ret) {
> +                error_report_err(local_err);
> +
> +                qemu_loadvm_load_finish_ready_unlock();
> +                return -EINVAL;
> +            } else if (!this_ready) {
> +                all_ready = false;
> +            }
> +        }
> +
> +        if (all_ready) {
> +            break;
> +        }
> +
> +        qemu_cond_wait(&mis->load_finish_ready_cond, &mis->load_finish_ready_mutex);
> +    }
> +    qemu_loadvm_load_finish_ready_unlock();
> +
>       if (ret == 0) {
>           ret = qemu_file_get_error(f);
>       }
> @@ -3126,6 +3157,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
>       return 0;
>   }
>
> +void qemu_loadvm_load_finish_ready_lock(void)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    qemu_mutex_lock(&mis->load_finish_ready_mutex);
> +}
> +
> +void qemu_loadvm_load_finish_ready_unlock(void)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    qemu_mutex_unlock(&mis->load_finish_ready_mutex);
> +}
> +
> +void qemu_loadvm_load_finish_ready_broadcast(void)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    qemu_cond_broadcast(&mis->load_finish_ready_cond);

Do we need a broadcast? isn't signal enough as we only have one waiter 
thread?

Thanks.

> +}
> +
>   bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>                     bool has_devices, strList *devices, Error **errp)
>   {
> diff --git a/migration/savevm.h b/migration/savevm.h
> index d388f1bfca98..69ae22cded7a 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -73,4 +73,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>   int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
>                                     char *buf, size_t len, Error **errp);
>
> +void qemu_loadvm_load_finish_ready_lock(void);
> +void qemu_loadvm_load_finish_ready_unlock(void);
> +void qemu_loadvm_load_finish_ready_broadcast(void);
> +
>   #endif


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-08-27 17:54 ` [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
  2024-08-30 20:22   ` Fabiano Rosas
@ 2024-09-05 16:47   ` Avihai Horon
  2024-09-09 18:05     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-05 16:47 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Add a basic support for receiving device state via multifd channels -
> channels that are shared with RAM transfers.
>
> To differentiate between a device state and a RAM packet the packet
> header is read first.
>
> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
> packet header either device state (MultiFDPacketDeviceState_t) or RAM
> data (existing MultiFDPacket_t) is then read.
>
> The received device state data is provided to
> qemu_loadvm_load_state_buffer() function for processing in the
> device's load_state_buffer handler.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   migration/multifd.c | 127 +++++++++++++++++++++++++++++++++++++-------
>   migration/multifd.h |  31 ++++++++++-
>   2 files changed, 138 insertions(+), 20 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index b06a9fab500e..d5a8e5a9c9b5 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -21,6 +21,7 @@
>   #include "file.h"
>   #include "migration.h"
>   #include "migration-stats.h"
> +#include "savevm.h"
>   #include "socket.h"
>   #include "tls.h"
>   #include "qemu-file.h"
> @@ -209,10 +210,10 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
>
>       memset(packet, 0, p->packet_len);
>
> -    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> -    packet->version = cpu_to_be32(MULTIFD_VERSION);
> +    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
> +    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>
> -    packet->flags = cpu_to_be32(p->flags);
> +    packet->hdr.flags = cpu_to_be32(p->flags);
>       packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>
>       packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
> @@ -228,31 +229,49 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
>                               p->flags, p->next_packet_size);
>   }
>
> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
> +                                             MultiFDPacketHdr_t *hdr,
> +                                             Error **errp)
>   {
> -    MultiFDPacket_t *packet = p->packet;
> -    int ret = 0;
> -
> -    packet->magic = be32_to_cpu(packet->magic);
> -    if (packet->magic != MULTIFD_MAGIC) {
> +    hdr->magic = be32_to_cpu(hdr->magic);
> +    if (hdr->magic != MULTIFD_MAGIC) {
>           error_setg(errp, "multifd: received packet "
>                      "magic %x and expected magic %x",
> -                   packet->magic, MULTIFD_MAGIC);
> +                   hdr->magic, MULTIFD_MAGIC);
>           return -1;
>       }
>
> -    packet->version = be32_to_cpu(packet->version);
> -    if (packet->version != MULTIFD_VERSION) {
> +    hdr->version = be32_to_cpu(hdr->version);
> +    if (hdr->version != MULTIFD_VERSION) {
>           error_setg(errp, "multifd: received packet "
>                      "version %u and expected version %u",
> -                   packet->version, MULTIFD_VERSION);
> +                   hdr->version, MULTIFD_VERSION);
>           return -1;
>       }
>
> -    p->flags = be32_to_cpu(packet->flags);
> +    p->flags = be32_to_cpu(hdr->flags);
> +
> +    return 0;
> +}
> +
> +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
> +                                                   Error **errp)
> +{
> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
> +
> +    packet->instance_id = be32_to_cpu(packet->instance_id);
> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
> +
> +    return 0;
> +}
> +
> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
> +{
> +    MultiFDPacket_t *packet = p->packet;
> +    int ret = 0;
> +
>       p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>       p->packet_num = be64_to_cpu(packet->packet_num);
> -    p->packets_recved++;
>
>       if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>           ret = multifd_ram_unfill_packet(p, errp);
> @@ -264,6 +283,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>       return ret;
>   }
>
> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +{
> +    p->packets_recved++;
> +
> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
> +        return multifd_recv_unfill_packet_device_state(p, errp);
> +    } else {
> +        return multifd_recv_unfill_packet_ram(p, errp);
> +    }
> +
> +    g_assert_not_reached();

We can drop the assert and the "else":
if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
     return multifd_recv_unfill_packet_device_state(p, errp);
}

return multifd_recv_unfill_packet_ram(p, errp);

> +}
> +
>   static bool multifd_send_should_exit(void)
>   {
>       return qatomic_read(&multifd_send_state->exiting);
> @@ -1014,6 +1046,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
>       p->packet_len = 0;
>       g_free(p->packet);
>       p->packet = NULL;
> +    g_clear_pointer(&p->packet_dev_state, g_free);
>       g_free(p->normal);
>       p->normal = NULL;
>       g_free(p->zero);
> @@ -1126,8 +1159,13 @@ static void *multifd_recv_thread(void *opaque)
>       rcu_register_thread();
>
>       while (true) {
> +        MultiFDPacketHdr_t hdr;
>           uint32_t flags = 0;
> +        bool is_device_state = false;
>           bool has_data = false;
> +        uint8_t *pkt_buf;
> +        size_t pkt_len;
> +
>           p->normal_num = 0;
>
>           if (use_packets) {
> @@ -1135,8 +1173,28 @@ static void *multifd_recv_thread(void *opaque)
>                   break;
>               }
>
> -            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
> -                                           p->packet_len, &local_err);
> +            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
> +                                           sizeof(hdr), &local_err);
> +            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
> +                break;
> +            }
> +
> +            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
> +            if (ret) {
> +                break;
> +            }
> +
> +            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
> +            if (is_device_state) {
> +                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
> +                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
> +            } else {
> +                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
> +                pkt_len = p->packet_len - sizeof(hdr);
> +            }
> +
> +            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
> +                                           &local_err);
>               if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>                   break;
>               }
> @@ -1181,8 +1239,33 @@ static void *multifd_recv_thread(void *opaque)
>               has_data = !!p->data->size;
>           }
>
> -        if (has_data) {
> -            ret = multifd_recv_state->ops->recv(p, &local_err);
> +        if (!is_device_state) {
> +            if (has_data) {
> +                ret = multifd_recv_state->ops->recv(p, &local_err);
> +                if (ret != 0) {
> +                    break;
> +                }
> +            }
> +        } else {
> +            g_autofree char *idstr = NULL;
> +            g_autofree char *dev_state_buf = NULL;
> +
> +            assert(use_packets);
> +
> +            if (p->next_packet_size > 0) {
> +                dev_state_buf = g_malloc(p->next_packet_size);
> +
> +                ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, &local_err);
> +                if (ret != 0) {
> +                    break;
> +                }
> +            }
> +
> +            idstr = g_strndup(p->packet_dev_state->idstr, sizeof(p->packet_dev_state->idstr));
> +            ret = qemu_loadvm_load_state_buffer(idstr,
> +                                                p->packet_dev_state->instance_id,
> +                                                dev_state_buf, p->next_packet_size,
> +                                                &local_err);
>               if (ret != 0) {
>                   break;
>               }
> @@ -1190,6 +1273,11 @@ static void *multifd_recv_thread(void *opaque)
>
>           if (use_packets) {
>               if (flags & MULTIFD_FLAG_SYNC) {
> +                if (is_device_state) {
> +                    error_setg(&local_err, "multifd: received SYNC device state packet");
> +                    break;
> +                }
> +
>                   qemu_sem_post(&multifd_recv_state->sem_sync);
>                   qemu_sem_wait(&p->sem_sync);
>               }
> @@ -1258,6 +1346,7 @@ int multifd_recv_setup(Error **errp)
>               p->packet_len = sizeof(MultiFDPacket_t)
>                   + sizeof(uint64_t) * page_count;
>               p->packet = g_malloc0(p->packet_len);
> +            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
>           }
>           p->name = g_strdup_printf("mig/dst/recv_%d", i);
>           p->normal = g_new0(ram_addr_t, page_count);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index a3e35196d179..a8f3e4838c01 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>   #define MULTIFD_FLAG_QPL (4 << 1)
>   #define MULTIFD_FLAG_UADK (8 << 1)
>
> +/*
> + * If set it means that this packet contains device state
> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
> + */
> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
> +
>   /* This value needs to be a multiple of qemu_target_page_size() */
>   #define MULTIFD_PACKET_SIZE (512 * 1024)
>
> @@ -52,6 +58,11 @@ typedef struct {
>       uint32_t magic;
>       uint32_t version;
>       uint32_t flags;
> +} __attribute__((packed)) MultiFDPacketHdr_t;

Maybe split this patch into two: one that adds the packet header concept 
and another that adds the new device packet?

> +
> +typedef struct {
> +    MultiFDPacketHdr_t hdr;
> +
>       /* maximum number of allocated pages */
>       uint32_t pages_alloc;
>       /* non zero pages */
> @@ -72,6 +83,16 @@ typedef struct {
>       uint64_t offset[];
>   } __attribute__((packed)) MultiFDPacket_t;
>
> +typedef struct {
> +    MultiFDPacketHdr_t hdr;
> +
> +    char idstr[256] QEMU_NONSTRING;

idstr should be null terminated, or am I missing something?

Thanks.

> +    uint32_t instance_id;
> +
> +    /* size of the next packet that contains the actual data */
> +    uint32_t next_packet_size;
> +} __attribute__((packed)) MultiFDPacketDeviceState_t;
> +
>   typedef struct {
>       /* number of used pages */
>       uint32_t num;
> @@ -89,6 +110,13 @@ struct MultiFDRecvData {
>       off_t file_offset;
>   };
>
> +typedef struct {
> +    char *idstr;
> +    uint32_t instance_id;
> +    char *buf;
> +    size_t buf_len;
> +} MultiFDDeviceState_t;
> +
>   typedef enum {
>       MULTIFD_PAYLOAD_NONE,
>       MULTIFD_PAYLOAD_RAM,
> @@ -204,8 +232,9 @@ typedef struct {
>
>       /* thread local variables. No locking required */
>
> -    /* pointer to the packet */
> +    /* pointers to the possible packet types */
>       MultiFDPacket_t *packet;
> +    MultiFDPacketDeviceState_t *packet_dev_state;
>       /* size of the next packet that contains pages */
>       uint32_t next_packet_size;
>       /* packets received through this channel */


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side
  2024-08-27 17:54 ` [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
@ 2024-09-09  8:55   ` Avihai Horon
  2024-09-09 18:06     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-09  8:55 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> The multifd received data needs to be reassembled since device state
> packets sent via different multifd channels can arrive out-of-order.
>
> Therefore, each VFIO device state packet carries a header indicating
> its position in the stream.
>
> The last such VFIO device state packet should have
> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
> state.
>
> Since it's important to finish loading device state transferred via
> the main migration channel (via save_live_iterate handler) before
> starting loading the data asynchronously transferred via multifd
> a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
> mark the end of the main migration channel data.
>
> The device state loading process waits until that flag is seen before
> commencing loading of the multifd-transferred device state.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 338 +++++++++++++++++++++++++++++++++-
>   hw/vfio/pci.c                 |   2 +
>   hw/vfio/trace-events          |   9 +-
>   include/hw/vfio/vfio-common.h |  17 ++
>   4 files changed, 362 insertions(+), 4 deletions(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 24679d8c5034..57c1542528dc 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -15,6 +15,7 @@
>   #include <linux/vfio.h>
>   #include <sys/ioctl.h>
>
> +#include "io/channel-buffer.h"
>   #include "sysemu/runstate.h"
>   #include "hw/vfio/vfio-common.h"
>   #include "migration/misc.h"
> @@ -47,6 +48,7 @@
>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>   #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>   #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE    (0xffffffffef100006ULL)
>
>   /*
>    * This is an arbitrary size based on migration of mlx5 devices, where typically
> @@ -55,6 +57,15 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>
> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
> +
> +typedef struct VFIODeviceStatePacket {
> +    uint32_t version;
> +    uint32_t idx;
> +    uint32_t flags;
> +    uint8_t data[0];
> +} QEMU_PACKED VFIODeviceStatePacket;
> +
>   static int64_t bytes_transferred;
>
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
> @@ -254,6 +265,188 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>       return ret;
>   }
>
> +typedef struct LoadedBuffer {
> +    bool is_present;
> +    char *data;
> +    size_t len;
> +} LoadedBuffer;

Maybe rename LoadedBuffer to a more specific name, like VFIOStateBuffer?

I also feel like LoadedBuffer deserves a separate commit.
Plus, I think it will be good to add a full API for this, that wraps the 
g_array_* calls and holds the extra members.
E.g, VFIOStateBuffer, VFIOStateArray (will hold load_buf_idx, 
load_buf_idx_last, etc.), vfio_state_array_destroy(), 
vfio_state_array_alloc(), vfio_state_array_get(), etc...
IMHO, this will make it clearer.

> +
> +static void loaded_buffer_clear(gpointer data)
> +{
> +    LoadedBuffer *lb = data;
> +
> +    if (!lb->is_present) {
> +        return;
> +    }
> +
> +    g_clear_pointer(&lb->data, g_free);
> +    lb->is_present = false;
> +}
> +
> +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                                  Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);

Move lock to where it's needed? I.e., after 
trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx)

> +    LoadedBuffer *lb;
> +
> +    if (data_size < sizeof(*packet)) {
> +        error_setg(errp, "packet too short at %zu (min is %zu)",
> +                   data_size, sizeof(*packet));
> +        return -1;
> +    }
> +
> +    if (packet->version != 0) {
> +        error_setg(errp, "packet has unknown version %" PRIu32,
> +                   packet->version);
> +        return -1;
> +    }
> +
> +    if (packet->idx == UINT32_MAX) {
> +        error_setg(errp, "packet has too high idx %" PRIu32,
> +                   packet->idx);
> +        return -1;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
> +
> +    /* config state packet should be the last one in the stream */
> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
> +        migration->load_buf_idx_last = packet->idx;
> +    }
> +
> +    assert(migration->load_bufs);
> +    if (packet->idx >= migration->load_bufs->len) {
> +        g_array_set_size(migration->load_bufs, packet->idx + 1);
> +    }
> +
> +    lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
> +    if (lb->is_present) {
> +        error_setg(errp, "state buffer %" PRIu32 " already filled", packet->idx);
> +        return -1;
> +    }
> +
> +    assert(packet->idx >= migration->load_buf_idx);
> +
> +    migration->load_buf_queued_pending_buffers++;
> +    if (migration->load_buf_queued_pending_buffers >
> +        vbasedev->migration_max_queued_buffers) {
> +        error_setg(errp,
> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
> +                   packet->idx, vbasedev->migration_max_queued_buffers);
> +        return -1;
> +    }

I feel like max_queued_buffers accounting/checking/configuration should 
be split to a separate patch that will come after this patch.
Also, should we count bytes instead of buffers? Current buffer size is 
1MB but this could change, and the normal user should not care or know 
what is the buffer size.
So maybe rename to migration_max_pending_bytes or such?

> +
> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
> +    lb->len = data_size - sizeof(*packet);
> +    lb->is_present = true;
> +
> +    qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);

There is only one thread waiting, shouldn't signal be enough?

> +
> +    return 0;
> +}
> +
> +static void *vfio_load_bufs_thread(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    Error **errp = &migration->load_bufs_thread_errp;
> +    g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock(
> +        QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));

Any special reason to use QemuLockable?

> +    LoadedBuffer *lb;
> +
> +    while (!migration->load_bufs_device_ready &&
> +           !migration->load_bufs_thread_want_exit) {
> +        qemu_cond_wait(&migration->load_bufs_device_ready_cond, &migration->load_bufs_mutex);
> +    }
> +
> +    while (!migration->load_bufs_thread_want_exit) {
> +        bool starved;
> +        ssize_t ret;
> +
> +        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
> +
> +        if (migration->load_buf_idx >= migration->load_bufs->len) {
> +            assert(migration->load_buf_idx == migration->load_bufs->len);
> +            starved = true;
> +        } else {
> +            lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
> +            starved = !lb->is_present;
> +        }
> +
> +        if (starved) {
> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name, migration->load_buf_idx);
> +            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond, &migration->load_bufs_mutex);
> +            continue;
> +        }
> +
> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
> +            break;
> +        }
> +
> +        if (migration->load_buf_idx == 0) {
> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
> +        }
> +
> +        if (lb->len) {
> +            g_autofree char *buf = NULL;
> +            size_t buf_len;
> +            int errno_save;
> +
> +            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
> +                                                           migration->load_buf_idx);
> +
> +            /* lb might become re-allocated when we drop the lock */
> +            buf = g_steal_pointer(&lb->data);
> +            buf_len = lb->len;
> +
> +            /* Loading data to the device takes a while, drop the lock during this process */
> +            qemu_mutex_unlock(&migration->load_bufs_mutex);
> +            ret = write(migration->data_fd, buf, buf_len);
> +            errno_save = errno;
> +            qemu_mutex_lock(&migration->load_bufs_mutex);
> +
> +            if (ret < 0) {
> +                error_setg(errp, "write to state buffer %" PRIu32 " failed with %d",
> +                           migration->load_buf_idx, errno_save);
> +                break;
> +            } else if (ret < buf_len) {
> +                error_setg(errp, "write to state buffer %" PRIu32 " incomplete %zd / %zu",
> +                           migration->load_buf_idx, ret, buf_len);
> +                break;
> +            }
> +
> +            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
> +                                                         migration->load_buf_idx);
> +        }
> +
> +        assert(migration->load_buf_queued_pending_buffers > 0);
> +        migration->load_buf_queued_pending_buffers--;
> +
> +        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
> +        }
> +
> +        migration->load_buf_idx++;
> +    }
> +
> +    if (migration->load_bufs_thread_want_exit &&
> +        !*errp) {
> +        error_setg(errp, "load bufs thread asked to quit");
> +    }
> +
> +    g_clear_pointer(&locker, qemu_lockable_auto_unlock);
> +
> +    qemu_loadvm_load_finish_ready_lock();
> +    migration->load_bufs_thread_finished = true;
> +    qemu_loadvm_load_finish_ready_broadcast();
> +    qemu_loadvm_load_finish_ready_unlock();
> +
> +    return NULL;
> +}
> +
>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>                                            Error **errp)
>   {
> @@ -285,6 +478,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>       VFIODevice *vbasedev = opaque;
>       uint64_t data;
>
> +    trace_vfio_load_device_config_state_start(vbasedev->name);

Maybe split this and below trace_vfio_load_device_config_state_end to a 
separate patch?

> +
>       if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>           int ret;
>
> @@ -303,7 +498,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>           return -EINVAL;
>       }
>
> -    trace_vfio_load_device_config_state(vbasedev->name);
> +    trace_vfio_load_device_config_state_end(vbasedev->name);
>       return qemu_file_get_error(f);
>   }
>
> @@ -687,16 +882,70 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   vbasedev->migration->device_state, errp);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    assert(!migration->load_bufs);
> +    migration->load_bufs = g_array_new(FALSE, TRUE, sizeof(LoadedBuffer));
> +    g_array_set_clear_func(migration->load_bufs, loaded_buffer_clear);
> +
> +    qemu_mutex_init(&migration->load_bufs_mutex);
> +
> +    migration->load_bufs_device_ready = false;
> +    qemu_cond_init(&migration->load_bufs_device_ready_cond);
> +
> +    migration->load_buf_idx = 0;
> +    migration->load_buf_idx_last = UINT32_MAX;
> +    migration->load_buf_queued_pending_buffers = 0;
> +    qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
> +
> +    migration->config_state_loaded_to_dev = false;
> +
> +    assert(!migration->load_bufs_thread_started);

Maybe do all these allocations (and de-allocations) only if multifd 
device state is supported and enabled?
Extracting this to its own function could also be good.

>
> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> -                                    vbasedev->migration->device_state, errp);
> +    migration->load_bufs_thread_finished = false;
> +    migration->load_bufs_thread_want_exit = false;
> +    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs",
> +                       vfio_load_bufs_thread, opaque, QEMU_THREAD_JOINABLE);

The device state save threads are manged by migration core thread pool. 
Don't we want to apply the same thread management scheme for the load 
flow as well?

> +
> +    migration->load_bufs_thread_started = true;
> +
> +    return 0;
>   }
>
>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->load_bufs_thread_started) {
> +        qemu_mutex_lock(&migration->load_bufs_mutex);
> +        migration->load_bufs_thread_want_exit = true;
> +        qemu_mutex_unlock(&migration->load_bufs_mutex);
> +
> +        qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
> +        qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
> +
> +        qemu_thread_join(&migration->load_bufs_thread);
> +
> +        assert(migration->load_bufs_thread_finished);
> +
> +        migration->load_bufs_thread_started = false;
> +    }
>
>       vfio_migration_cleanup(vbasedev);
> +
> +    g_clear_pointer(&migration->load_bufs, g_array_unref);
> +    qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
> +    qemu_cond_destroy(&migration->load_bufs_device_ready_cond);
> +    qemu_mutex_destroy(&migration->load_bufs_mutex);
> +
>       trace_vfio_load_cleanup(vbasedev->name);
>
>       return 0;
> @@ -705,6 +954,7 @@ static int vfio_load_cleanup(void *opaque)
>   static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       int ret = 0;
>       uint64_t data;
>
> @@ -716,6 +966,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            migration->config_state_loaded_to_dev = true;
>               return vfio_load_device_config_state(f, opaque);
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> @@ -742,6 +993,15 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>               }
>               break;
>           }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE:
> +        {
> +            QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
> +
> +            migration->load_bufs_device_ready = true;
> +            qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
> +
> +            break;
> +        }
>           case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
>           {
>               if (!vfio_precopy_supported(vbasedev) ||
> @@ -774,6 +1034,76 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>       return ret;
>   }
>
> +static int vfio_load_finish(void *opaque, bool *is_finished, Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    g_autoptr(QemuLockable) locker = NULL;

Any special reason to use QemuLockable?

Thanks.

> +    LoadedBuffer *lb;
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    QEMUFile *f_out = NULL, *f_in = NULL;
> +    uint64_t mig_header;
> +    int ret;
> +
> +    if (migration->config_state_loaded_to_dev) {
> +        *is_finished = true;
> +        return 0;
> +    }
> +
> +    if (!migration->load_bufs_thread_finished) {
> +        assert(migration->load_bufs_thread_started);
> +        *is_finished = false;
> +        return 0;
> +    }
> +
> +    if (migration->load_bufs_thread_errp) {
> +        error_propagate(errp, g_steal_pointer(&migration->load_bufs_thread_errp));
> +        return -1;
> +    }
> +
> +    locker = qemu_lockable_auto_lock(QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));
> +
> +    assert(migration->load_buf_idx == migration->load_buf_idx_last);
> +    lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
> +    assert(lb->is_present);
> +
> +    bioc = qio_channel_buffer_new(lb->len);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
> +
> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
> +
> +    ret = qemu_fflush(f_out);
> +    if (ret) {
> +        error_setg(errp, "load device config state file flush failed with %d", ret);
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        return -1;
> +    }
> +
> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
> +
> +    mig_header = qemu_get_be64(f_in);
> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +        error_setg(errp, "load device config state invalid header %"PRIu64, mig_header);
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        g_clear_pointer(&f_in, qemu_fclose);
> +        return -1;
> +    }
> +
> +    ret = vfio_load_device_config_state(f_in, opaque);
> +    g_clear_pointer(&f_out, qemu_fclose);
> +    g_clear_pointer(&f_in, qemu_fclose);
> +    if (ret < 0) {
> +        error_setg(errp, "load device config state failed with %d", ret);
> +        return -1;
> +    }
> +
> +    migration->config_state_loaded_to_dev = true;
> +    *is_finished = true;
> +    return 0;
> +}
> +
>   static bool vfio_switchover_ack_needed(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> @@ -794,6 +1124,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .load_state_buffer = vfio_load_state_buffer,
> +    .load_finish = vfio_load_finish,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
>   };
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 2407720c3530..08cb56d27a05 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3378,6 +3378,8 @@ static Property vfio_pci_dev_properties[] = {
>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 013c602f30fa..9d2519a28a7e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,9 +149,16 @@ vfio_display_edid_write_error(void) ""
>
>   # migration.c
>   vfio_load_cleanup(const char *name) " (%s)"
> -vfio_load_device_config_state(const char *name) " (%s)"
> +vfio_load_device_config_state_start(const char *name) " (%s)"
> +vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
> +vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_start(const char *name) " (%s)"
> +vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_end(const char *name) " (%s)"
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 32d58e3e025b..ba5b9464e79a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -76,6 +76,22 @@ typedef struct VFIOMigration {
>
>       bool save_iterate_run;
>       bool save_iterate_empty_hit;
> +
> +    QemuThread load_bufs_thread;
> +    Error *load_bufs_thread_errp;
> +    bool load_bufs_thread_started;
> +    bool load_bufs_thread_finished;
> +    bool load_bufs_thread_want_exit;
> +
> +    GArray *load_bufs;
> +    bool load_bufs_device_ready;
> +    QemuCond load_bufs_device_ready_cond;
> +    QemuCond load_bufs_buffer_ready_cond;
> +    QemuMutex load_bufs_mutex;
> +    uint32_t load_buf_idx;
> +    uint32_t load_buf_idx_last;
> +    uint32_t load_buf_queued_pending_buffers;
> +    bool config_state_loaded_to_dev;
>   } VFIOMigration;
>
>   struct VFIOGroup;
> @@ -134,6 +150,7 @@ typedef struct VFIODevice {
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
>       bool migration_events;
> +    uint64_t migration_max_queued_buffers;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
>       unsigned int num_regions;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side
  2024-08-27 17:54 ` [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2024-09-09 11:41   ` Avihai Horon
  2024-09-09 18:07     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-09 11:41 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Implement the multifd device state transfer via additional per-device
> thread inside save_live_complete_precopy_thread handler.
>
> Switch between doing the data transfer in the new handler and doing it
> in the old save_state handler depending on the
> x-migration-multifd-transfer device property value.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 169 ++++++++++++++++++++++++++++++++++
>   hw/vfio/trace-events          |   2 +
>   include/hw/vfio/vfio-common.h |   1 +
>   3 files changed, 172 insertions(+)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 57c1542528dc..67996aa2df8b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -655,6 +655,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>       int ret;
>
> +    /* Make a copy of this setting at the start in case it is changed mid-migration */
> +    migration->multifd_transfer = vbasedev->migration_multifd_transfer;

Should VFIO multifd be controlled by main migration multifd capability, 
and let the per VFIO device migration_multifd_transfer property be 
immutable and enabled by default?
Then we would have a single point of configuration (and an extra one per 
VFIO device just to disable for backward compatibility).
Unless there are other benefits to have this property configurable?

> +
> +    if (migration->multifd_transfer && !migration_has_device_state_support()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return -EINVAL;
> +    }
> +
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> @@ -835,10 +845,20 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       ssize_t data_size;
>       int ret;
>       Error *local_err = NULL;
>
> +    if (migration->multifd_transfer) {
> +        /*
> +         * Emit dummy NOP data, vfio_save_complete_precopy_thread()
> +         * does the actual transfer.
> +         */
> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);

There are three places where we send this dummy end of state, maybe 
worth extracting it to a helper? I.e., vfio_send_end_of_state() and then 
document there the rationale.

> +        return 0;
> +    }
> +
>       trace_vfio_save_complete_precopy_started(vbasedev->name);
>
>       /* We reach here with device state STOP or STOP_COPY only */
> @@ -864,12 +884,159 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       return ret;
>   }
>
> +static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
> +                                                                char *idstr,
> +                                                                uint32_t instance_id,
> +                                                                uint32_t idx)
> +{
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    QEMUFile *f = NULL;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    size_t packet_len;
> +
> +    bioc = qio_channel_buffer_new(0);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
> +
> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
> +
> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
> +    if (ret) {
> +        return ret;

Need to close f in this case.

> +    }
> +
> +    ret = qemu_fflush(f);
> +    if (ret) {
> +        goto ret_close_file;
> +    }
> +
> +    packet_len = sizeof(*packet) + bioc->usage;
> +    packet = g_malloc0(packet_len);
> +    packet->idx = idx;
> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
> +    memcpy(&packet->data, bioc->data, bioc->usage);
> +
> +    if (!multifd_queue_device_state(idstr, instance_id,
> +                                    (char *)packet, packet_len)) {
> +        ret = -1;

goto ret_close_file?

> +    }
> +
> +    bytes_transferred += packet_len;

bytes_transferred is a global variable. Now that we access it from 
multiple threads it should be protected.
Note that now the VFIO device data is reported also in multifd stats (if 
I am not mistaken), is this the behavior we want? Maybe we should 
enhance multifd stats to distinguish between RAM data and device data?

> +
> +ret_close_file:

Rename to "out" as we only have one exit point?

> +    g_clear_pointer(&f, qemu_fclose);

f is a local variable, wouldn't qemu_fclose(f) be enough here?

> +    return ret;
> +}
> +
> +static int vfio_save_complete_precopy_thread(char *idstr,
> +                                             uint32_t instance_id,
> +                                             bool *abort_flag,
> +                                             void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    uint32_t idx;
> +
> +    if (!migration->multifd_transfer) {
> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
> +        return 0;
> +    }
> +
> +    trace_vfio_save_complete_precopy_thread_started(vbasedev->name,
> +                                                    idstr, instance_id);
> +
> +    /* We reach here with device state STOP or STOP_COPY only */
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +                                   VFIO_DEVICE_STATE_STOP, NULL);
> +    if (ret) {
> +        goto ret_finish;
> +    }
> +
> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
> +
> +    for (idx = 0; ; idx++) {
> +        ssize_t data_size;
> +        size_t packet_size;
> +
> +        if (qatomic_read(abort_flag)) {
> +            ret = -ECANCELED;
> +            goto ret_finish;
> +        }
> +
> +        data_size = read(migration->data_fd, &packet->data,
> +                         migration->data_buffer_size);
> +        if (data_size < 0) {
> +            if (errno != ENOMSG) {
> +                ret = -errno;
> +                goto ret_finish;
> +            }
> +
> +            /*
> +             * Pre-copy emptied all the device state for now. For more information,
> +             * please refer to the Linux kernel VFIO uAPI.
> +             */
> +            data_size = 0;

According to VFIO uAPI, ENOMSG can only be returned in the PRE_COPY 
device states.
Here we are in STOP_COPY, so we can drop the ENOMSG handling.

Thanks.

> +        }
> +
> +        if (data_size == 0)
> +            break;
> +
> +        packet->idx = idx;
> +        packet_size = sizeof(*packet) + data_size;
> +
> +        if (!multifd_queue_device_state(idstr, instance_id,
> +                                        (char *)packet, packet_size)) {
> +            ret = -1;
> +            goto ret_finish;
> +        }
> +
> +        bytes_transferred += packet_size;
> +    }
> +
> +    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
> +                                                               instance_id,
> +                                                               idx);
> +
> +ret_finish:
> +    trace_vfio_save_complete_precopy_thread_finished(vbasedev->name, ret);
> +
> +    return ret;
> +}
> +
> +static int vfio_save_complete_precopy_begin(QEMUFile *f,
> +                                            char *idstr, uint32_t instance_id,
> +                                            void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration->multifd_transfer) {
> +        /* Emit dummy NOP data */
> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +        return 0;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE);
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_fflush(f);
> +}
> +
>   static void vfio_save_state(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       Error *local_err = NULL;
>       int ret;
>
> +    if (migration->multifd_transfer) {
> +        /* Emit dummy NOP data */
> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +        return;
> +    }
> +
>       ret = vfio_save_device_config_state(f, opaque, &local_err);
>       if (ret) {
>           error_prepend(&local_err,
> @@ -1119,7 +1286,9 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .state_pending_exact = vfio_state_pending_exact,
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy_begin = vfio_save_complete_precopy_begin,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
>       .save_state = vfio_save_state,
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 9d2519a28a7e..b1d9c9d5f2e1 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -167,6 +167,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>   vfio_save_complete_precopy_started(const char *name) " (%s)"
> +vfio_save_complete_precopy_thread_started(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
> +vfio_save_complete_precopy_thread_finished(const char *name, int ret) " (%s) ret %d"
>   vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>   vfio_save_iterate_started(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index fe05acb9a5d1..4578a0ca6a5c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -72,6 +72,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    bool multifd_transfer;
>       bool initial_data_sent;
>
>       bool save_iterate_run;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet
  2024-08-27 17:54 ` [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
  2024-08-28 18:50   ` Fabiano Rosas
@ 2024-09-09 15:41   ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-09 15:41 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Aug 27, 2024 at 07:54:22PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This way there aren't stale flags there.
> 
> p->flags can't contain SYNC to be sent at the next RAM packet since syncs
> are now handled separately in multifd_send_thread.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-03 19:04       ` Stefan Hajnoczi
@ 2024-09-09 16:45         ` Peter Xu
  2024-09-09 18:38           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-09 16:45 UTC (permalink / raw)
  To: Stefan Hajnoczi, Maciej S. Szmigiero
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Paolo Bonzini

Hi, Stefan, Maciej,

Sorry to be slow on responding.

On Tue, Sep 03, 2024 at 03:04:54PM -0400, Stefan Hajnoczi wrote:
> On Tue, 3 Sept 2024 at 12:54, Maciej S. Szmigiero
> <mail@maciej.szmigiero.name> wrote:
> >
> > On 3.09.2024 15:55, Stefan Hajnoczi wrote:
> > > On Tue, 27 Aug 2024 at 13:58, Maciej S. Szmigiero
> > > <mail@maciej.szmigiero.name> wrote:
> > >>
> > >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > >>
> > >> Migration code wants to manage device data sending threads in one place.
> > >>
> > >> QEMU has an existing thread pool implementation, however it was limited
> > >> to queuing AIO operations only and essentially had a 1:1 mapping between
> > >> the current AioContext and the ThreadPool in use.
> > >>
> > >> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
> > >> too.
> > >>
> > >> This brings a few new operations on a pool:
> > >> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
> > >> thread count in the pool.
> > >>
> > >> * thread_pool_join() operation waits until all the submitted work requests
> > >> have finished.
> > >>
> > >> * thread_pool_poll() lets the new thread and / or thread completion bottom
> > >> halves run (if they are indeed scheduled to be run).
> > >> It is useful for thread pool users that need to launch or terminate new
> > >> threads without returning to the QEMU main loop.
> > >
> > > Did you consider glib's GThreadPool?
> > > https://docs.gtk.org/glib/struct.ThreadPool.html
> > >
> > > QEMU's thread pool is integrated into the QEMU event loop. If your
> > > goal is to bypass the QEMU event loop, then you may as well use the
> > > glib API instead.
> > >
> > > thread_pool_join() and thread_pool_poll() will lead to code that
> > > blocks the event loop. QEMU's aio_poll() and nested event loops in
> > > general are a source of hangs and re-entrancy bugs. I would prefer not
> > > introducing these issues in the QEMU ThreadPool API.
> > >
> >
> > Unfortunately, the problem with the migration code is that it is
> > synchronous - it does not return to the main event loop until the
> > migration is done.
> >
> > So the only way to handle things that need working event loop is to
> > pump it manually from inside the migration code.
> >
> > The reason why I used the QEMU thread pool in the first place in this
> > patch set version is because Peter asked me to do so during the review
> > of its previous iteration [1].
> >
> > Peter also asked me previously to move to QEMU synchronization
> > primitives from using the Glib ones in the early version of this
> > patch set [2].
> >
> > I personally would rather use something common to many applications,
> > well tested and with more pairs of eyes looking at it rather to
> > re-invent things in QEMU.
> >
> > Looking at GThreadPool it seems that it lacks ability to wait until
> > all queued work have finished, so this would need to be open-coded
> > in the migration code.
> >
> > @Peter, what's your opinion on using Glib's thread pool instead of
> > QEMU's one, considering the above things?
> 
> I'll add a bit more about my thinking:
> 
> Using QEMU's event-driven model is usually preferred because it makes
> integrating with the rest of QEMU easy and avoids having lots of
> single-purpose threads that are hard to observe/manage (e.g. through
> the QMP monitor).
> 
> When there is a genuine need to spawn a thread and write synchronous
> code (e.g. a blocking ioctl(2) call or something CPU-intensive), then

Right, AFAIU this is the current use case for VFIO, and anything beyond in
migration context, where we want to use genuine threads with no need to
integrate with the main even loop.

Currently the VFIO workfn should read() the VFIO fd in a blocked way, then
dump them to multifd threads (further dump to migration channels), during
which it can wait on a semaphore.

> it's okay to do that. Use QEMUBH, EventNotifier, or other QEMU APIs to
> synchronize between event loop threads and special-purpose synchronous
> threads.
> 
> I haven't looked at the patch series enough to have an opinion about
> whether this use case needs a special-purpose thread or not. I am
> assuming it really needs to be a special-purpose thread. Peter and you
> could discuss that further if you want.
> 
> I agree with Peter's request to use QEMU's synchronization primitives.
> They do not depend on the event loop so they can be used outside the
> event loop.
> 
> The issue I'm raising with this patch is that adding new join()/poll()
> APIs that shouldn't be called from the event loop is bug-prone. It
> will make the QEMU ThreadPool code harder to understand and maintain
> because now there are two different contexts where different subsets
> of this API can be used and mixing them leads to problems. To me the
> non-event loop case is beyond the scope of QEMU's ThreadPool. I have
> CCed Paolo, who wrote the thread pool in its current form in case he
> wants to participate in the discussion.
> 
> Using glib's ThreadPool solves the issue while still reusing an
> existing thread pool implementation. Waiting for all work to complete
> can be done using QemuSemaphore.

Right.  It's a pity that g_thread_pool_unprocessed() only monitors
unqueuing of tasks, and looks like there's no g_thread_pool_flush().

Indeed the current thread poll is very aio-centric, and if we worry about
misuse of the APIs we can switch to glib's threadpool.  Sorry Maciej, looks
like I routed you to a direction that I didn't see the side effects..

I think the fundamental request from my side (on behalf of migration) is we
should avoid a specific vmstate handler managing threads on its own.  E.g.,
any future devices (vdpa, vcpu, etc.) that may also be able to offload
save() processes concurrently to threads (just like what VFIO can already
do right now) should share the same pool of threads.  As long as that can
be achieved I am ok.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-05 13:45   ` Avihai Horon
@ 2024-09-09 17:59     ` Peter Xu
  2024-09-09 18:32       ` Maciej S. Szmigiero
  2024-09-09 18:05     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-09 17:59 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel

On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > 
> > These SaveVMHandlers help device provide its own asynchronous
> > transmission of the remaining data at the end of a precopy phase.
> > 
> > In this use case the save_live_complete_precopy_begin handler might
> > be used to mark the stream boundary before proceeding with asynchronous
> > transmission of the remaining data while the
> > save_live_complete_precopy_end handler might be used to mark the
> > stream boundary after performing the asynchronous transmission.
> > 
> > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > ---
> >   include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
> >   migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
> >   2 files changed, 71 insertions(+)
> > 
> > diff --git a/include/migration/register.h b/include/migration/register.h
> > index f60e797894e5..9de123252edf 100644
> > --- a/include/migration/register.h
> > +++ b/include/migration/register.h
> > @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
> >        */
> >       int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
> > 
> > +    /**
> > +     * @save_live_complete_precopy_begin
> > +     *
> > +     * Called at the end of a precopy phase, before all
> > +     * @save_live_complete_precopy handlers and before launching
> > +     * all @save_live_complete_precopy_thread threads.
> > +     * The handler might, for example, mark the stream boundary before
> > +     * proceeding with asynchronous transmission of the remaining data via
> > +     * @save_live_complete_precopy_thread.
> > +     * When postcopy is enabled, devices that support postcopy will skip this step.
> > +     *
> > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > +     * @idstr: this device section idstr
> > +     * @instance_id: this device section instance_id
> > +     * @opaque: data pointer passed to register_savevm_live()
> > +     *
> > +     * Returns zero to indicate success and negative for error
> > +     */
> > +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
> > +                                            char *idstr, uint32_t instance_id,
> > +                                            void *opaque);
> > +    /**
> > +     * @save_live_complete_precopy_end
> > +     *
> > +     * Called at the end of a precopy phase, after @save_live_complete_precopy
> > +     * handlers and after all @save_live_complete_precopy_thread threads have
> > +     * finished. When postcopy is enabled, devices that support postcopy will
> > +     * skip this step.
> > +     *
> > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > +     * @opaque: data pointer passed to register_savevm_live()
> > +     *
> > +     * Returns zero to indicate success and negative for error
> > +     */
> > +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
> 
> Is this handler necessary now that migration core is responsible for the
> threads and joins them? I don't see VFIO implementing it later on.

Right, I spot the same thing.

This series added three hooks: begin, end, precopy_thread.

What I think is it only needs one, which is precopy_async.  My vague memory
was that was what we used to discuss too, so that when migration precopy
flushes the final round of iterable data, it does:

  (1) loop over all complete_precopy_async() and enqueue the tasks if
      existed into the migration worker pool.  Then,

  (2) loop over all complete_precopy() like before.

Optionally, we can enforce one vmstate handler only provides either
complete_precopy_async() or complete_precopy().  In this case VFIO can
update the two hooks during setup() by detecting multifd && !mapped_ram &&
nocomp.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 01/17] vfio/migration: Add save_{iterate,complete_precopy}_started trace events
  2024-09-05 13:08   ` [PATCH v2 01/17] vfio/migration: Add save_{iterate,complete_precopy}_started " Avihai Horon
@ 2024-09-09 18:04     ` Maciej S. Szmigiero
  2024-09-11 14:50       ` Avihai Horon
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:04 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 5.09.2024 15:08, Avihai Horon wrote:
> Hi Maciej,
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This way both the start and end points of migrating a particular VFIO
>> device are known.
>>
>> Add also a vfio_save_iterate_empty_hit trace event so it is known when
>> there's no more data to send for that device.
> 
> Out of curiosity, what are these traces used for?

Just for benchmarking, collecting these data makes it easier to
reason where possible bottlenecks may be.

>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 13 +++++++++++++
>>   hw/vfio/trace-events          |  3 +++
>>   include/hw/vfio/vfio-common.h |  3 +++
>>   3 files changed, 19 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 262d42a46e58..24679d8c5034 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -472,6 +472,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>           return -ENOMEM;
>>       }
>>
>> +    migration->save_iterate_run = false;
>> +    migration->save_iterate_empty_hit = false;
>> +
>>       if (vfio_precopy_supported(vbasedev)) {
>>           switch (migration->device_state) {
>>           case VFIO_DEVICE_STATE_RUNNING:
>> @@ -605,9 +608,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>       VFIOMigration *migration = vbasedev->migration;
>>       ssize_t data_size;
>>
>> +    if (!migration->save_iterate_run) {
>> +        trace_vfio_save_iterate_started(vbasedev->name);
>> +        migration->save_iterate_run = true;
> 
> Maybe rename save_iterate_run to save_iterate_started so it's aligned with trace_vfio_save_iterate_started and trace_vfio_save_complete_precopy_started?

Will do.

>> +    }
>> +
>>       data_size = vfio_save_block(f, migration);
>>       if (data_size < 0) {
>>           return data_size;
>> +    } else if (data_size == 0 && !migration->save_iterate_empty_hit) {
>> +        trace_vfio_save_iterate_empty_hit(vbasedev->name);
>> +        migration->save_iterate_empty_hit = true;
> 
> During precopy we could hit empty multiple times. Any reason why only the first time should be traced?

This trace point is supposed to indicate whether the device state
transfer during the time the VM was still running likely has
exhausted the amount of data that can be transferred during
that phase.

In other words, the stopped-time device state transfer likely
only had to transfer the data which the device does not support
transferring during the live VM phase (with just a small possible
residual accrued since that trace point was hit).

If that trace point was hit then delaying the switch over point
further likely wouldn't help the device transfer less data during
the downtime.

>>       }
>>
>>       vfio_update_estimated_pending_data(migration, data_size);
>> @@ -633,6 +644,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       int ret;
>>       Error *local_err = NULL;
>>
>> +    trace_vfio_save_complete_precopy_started(vbasedev->name);
>> +
>>       /* We reach here with device state STOP or STOP_COPY only */
>>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>                                      VFIO_DEVICE_STATE_STOP, &local_err);
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 98bd4dcceadc..013c602f30fa 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -159,8 +159,11 @@ vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
>>   vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>>   vfio_save_cleanup(const char *name) " (%s)"
>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>> +vfio_save_complete_precopy_started(const char *name) " (%s)"
>>   vfio_save_device_config_state(const char *name) " (%s)"
>>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>> +vfio_save_iterate_started(const char *name) " (%s)"
>> +vfio_save_iterate_empty_hit(const char *name) " (%s)"
> 
> Let's keep it sorted in alphabetical order.

Ack.
  
> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-05 13:45   ` Avihai Horon
  2024-09-09 17:59     ` Peter Xu
@ 2024-09-09 18:05     ` Maciej S. Szmigiero
  1 sibling, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:05 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 5.09.2024 15:45, Avihai Horon wrote:
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> These SaveVMHandlers help device provide its own asynchronous
>> transmission of the remaining data at the end of a precopy phase.
>>
>> In this use case the save_live_complete_precopy_begin handler might
>> be used to mark the stream boundary before proceeding with asynchronous
>> transmission of the remaining data while the
>> save_live_complete_precopy_end handler might be used to mark the
>> stream boundary after performing the asynchronous transmission.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
>>   migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
>>   2 files changed, 71 insertions(+)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index f60e797894e5..9de123252edf 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
>>        */
>>       int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>>
>> +    /**
>> +     * @save_live_complete_precopy_begin
>> +     *
>> +     * Called at the end of a precopy phase, before all
>> +     * @save_live_complete_precopy handlers and before launching
>> +     * all @save_live_complete_precopy_thread threads.
>> +     * The handler might, for example, mark the stream boundary before
>> +     * proceeding with asynchronous transmission of the remaining data via
>> +     * @save_live_complete_precopy_thread.
>> +     * When postcopy is enabled, devices that support postcopy will skip this step.
>> +     *
>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>> +     * @idstr: this device section idstr
>> +     * @instance_id: this device section instance_id
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     */
>> +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
>> +                                            char *idstr, uint32_t instance_id,
>> +                                            void *opaque);
>> +    /**
>> +     * @save_live_complete_precopy_end
>> +     *
>> +     * Called at the end of a precopy phase, after @save_live_complete_precopy
>> +     * handlers and after all @save_live_complete_precopy_thread threads have
>> +     * finished. When postcopy is enabled, devices that support postcopy will
>> +     * skip this step.
>> +     *
>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     */
>> +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
> 
> Is this handler necessary now that migration core is responsible for the threads and joins them? I don't see VFIO implementing it later on.

It's not 100% necessary for the current implementation but preserved
for future usage and code consistency with the "_begin" handler
(which IS necessary).

> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-09-05 14:15   ` Avihai Horon
@ 2024-09-09 18:05     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:05 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Fabiano Rosas, Cédric Le Goater, Peter Xu,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 5.09.2024 16:15, Avihai Horon wrote:
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> qemu_loadvm_load_state_buffer() and its load_state_buffer
>> SaveVMHandler allow providing device state buffer to explicitly
>> specified device via its idstr and instance id.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/register.h | 15 +++++++++++++++
>>   migration/savevm.c           | 25 +++++++++++++++++++++++++
>>   migration/savevm.h           |  3 +++
>>   3 files changed, 43 insertions(+)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 9de123252edf..4a578f140713 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -263,6 +263,21 @@ typedef struct SaveVMHandlers {
>>        */
>>       int (*load_state)(QEMUFile *f, void *opaque, int version_id);
>>
>> +    /**
>> +     * @load_state_buffer
>> +     *
>> +     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
>> +     *
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     * @data: the data buffer to load
>> +     * @data_size: the data length in buffer
>> +     * @errp: pointer to Error*, to store an error if it happens.
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     */
>> +    int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>> +                             Error **errp);
> 
> Nit: Maybe rename data to buf and data_size to len to be consistent with qemu_loadvm_load_state_buffer()?

Will do.

>> +
>>       /**
>>        * @load_setup
>>        *
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index d43acbbf20cf..3fde5ca8c26b 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -3101,6 +3101,31 @@ int qemu_loadvm_approve_switchover(void)
>>       return migrate_send_rp_switchover_ack(mis);
>>   }
>>
>> +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
>> +                                  char *buf, size_t len, Error **errp)
>> +{
>> +    SaveStateEntry *se;
>> +
>> +    se = find_se(idstr, instance_id);
>> +    if (!se) {
>> +        error_setg(errp, "Unknown idstr %s or instance id %u for load state buffer",
>> +                   idstr, instance_id);
>> +        return -1;
>> +    }
>> +
>> +    if (!se->ops || !se->ops->load_state_buffer) {
>> +        error_setg(errp, "idstr %s / instance %u has no load state buffer operation",
>> +                   idstr, instance_id);
>> +        return -1;
>> +    }
>> +
>> +    if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) {
>> +        return -1;
>> +    }
>> +
>> +    return 0;
> 
> Nit: this can be simplified to:
> return se->ops->load_state_buffer(se->opaque, buf, len, errp);
You're right - will change it so.
  
> Thanks.

Thanks,
Maciej




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-05 15:13   ` Avihai Horon
@ 2024-09-09 18:05     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:05 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 5.09.2024 17:13, Avihai Horon wrote:
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> load_finish SaveVMHandler allows migration code to poll whether
>> a device-specific asynchronous device state loading operation had finished.
>>
>> In order to avoid calling this handler needlessly the device is supposed
>> to notify the migration code of its possible readiness via a call to
>> qemu_loadvm_load_finish_ready_broadcast() while holding
>> qemu_loadvm_load_finish_ready_lock.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/register.h | 21 +++++++++++++++
>>   migration/migration.c        |  6 +++++
>>   migration/migration.h        |  3 +++
>>   migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>>   migration/savevm.h           |  4 +++
>>   5 files changed, 86 insertions(+)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 4a578f140713..44d8cf5192ae 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>>       int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>>                                Error **errp);
>>
>> +    /**
>> +     * @load_finish
>> +     *
>> +     * Poll whether all asynchronous device state loading had finished.
>> +     * Not called on the load failure path.
>> +     *
>> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
>> +     *
>> +     * If this method signals "not ready" then it might not be called
>> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>> +     * while holding qemu_loadvm_load_finish_ready_lock.
>> +     *
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     * @is_finished: whether the loading had finished (output parameter)
>> +     * @errp: pointer to Error*, to store an error if it happens.
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     * It's not an error that the loading still hasn't finished.
>> +     */
>> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
>> +
>>       /**
>>        * @load_setup
>>        *
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 3dea06d57732..d61e7b055e07 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -259,6 +259,9 @@ void migration_object_init(void)
>>
>>       current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
>>
>> +    qemu_mutex_init(&current_incoming->load_finish_ready_mutex);
>> +    qemu_cond_init(&current_incoming->load_finish_ready_cond);
>> +
>>       migration_object_check(current_migration, &error_fatal);
>>
>>       ram_mig_init();
>> @@ -410,6 +413,9 @@ void migration_incoming_state_destroy(void)
>>           mis->postcopy_qemufile_dst = NULL;
>>       }
>>
>> +    qemu_mutex_destroy(&mis->load_finish_ready_mutex);
>> +    qemu_cond_destroy(&mis->load_finish_ready_cond);
>> +
>>       yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>>   }
>>
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 38aa1402d516..4e2443e6c8ec 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -230,6 +230,9 @@ struct MigrationIncomingState {
>>
>>       /* Do exit on incoming migration failure */
>>       bool exit_on_error;
>> +
>> +    QemuCond load_finish_ready_cond;
>> +    QemuMutex load_finish_ready_mutex;
>>   };
>>
>>   MigrationIncomingState *migration_incoming_get_current(void);
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 3fde5ca8c26b..33c9200d1e78 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -3022,6 +3022,37 @@ int qemu_loadvm_state(QEMUFile *f)
>>           return ret;
>>       }
>>
>> +    qemu_loadvm_load_finish_ready_lock();
>> +    while (!ret) { /* Don't call load_finish() handlers on the load failure path */
>> +        bool all_ready = true;
> 
> Nit: Maybe rename all_ready to all_finished to be consistent with load_finish() terminology? Same for this_ready.

Will rename it accordingly.

>> +        SaveStateEntry *se = NULL;
>> +
>> +        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +            bool this_ready;
>> +
>> +            if (!se->ops || !se->ops->load_finish) {
>> +                continue;
>> +            }
>> +
>> +            ret = se->ops->load_finish(se->opaque, &this_ready, &local_err);
>> +            if (ret) {
>> +                error_report_err(local_err);
>> +
>> +                qemu_loadvm_load_finish_ready_unlock();
>> +                return -EINVAL;
>> +            } else if (!this_ready) {
>> +                all_ready = false;
>> +            }
>> +        }
>> +
>> +        if (all_ready) {
>> +            break;
>> +        }
>> +
>> +        qemu_cond_wait(&mis->load_finish_ready_cond, &mis->load_finish_ready_mutex);
>> +    }
>> +    qemu_loadvm_load_finish_ready_unlock();
>> +
>>       if (ret == 0) {
>>           ret = qemu_file_get_error(f);
>>       }
>> @@ -3126,6 +3157,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
>>       return 0;
>>   }
>>
>> +void qemu_loadvm_load_finish_ready_lock(void)
>> +{
>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>> +
>> +    qemu_mutex_lock(&mis->load_finish_ready_mutex);
>> +}
>> +
>> +void qemu_loadvm_load_finish_ready_unlock(void)
>> +{
>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>> +
>> +    qemu_mutex_unlock(&mis->load_finish_ready_mutex);
>> +}
>> +
>> +void qemu_loadvm_load_finish_ready_broadcast(void)
>> +{
>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>> +
>> +    qemu_cond_broadcast(&mis->load_finish_ready_cond);
> 
> Do we need a broadcast? isn't signal enough as we only have one waiter thread?

Currently, there's just one waiter but looking at the relatively small
implementation difference between pthread_cond_signal() and
pthread_cond_broadcast() I'm not sure whether it is worth changing it
it to _signal() and not having a possibility of signalling multiple
waiters upfront.

> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-05 16:47   ` Avihai Horon
@ 2024-09-09 18:05     ` Maciej S. Szmigiero
  2024-09-12  8:13       ` Avihai Horon
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:05 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 5.09.2024 18:47, Avihai Horon wrote:
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add a basic support for receiving device state via multifd channels -
>> channels that are shared with RAM transfers.
>>
>> To differentiate between a device state and a RAM packet the packet
>> header is read first.
>>
>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>> data (existing MultiFDPacket_t) is then read.
>>
>> The received device state data is provided to
>> qemu_loadvm_load_state_buffer() function for processing in the
>> device's load_state_buffer handler.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   migration/multifd.c | 127 +++++++++++++++++++++++++++++++++++++-------
>>   migration/multifd.h |  31 ++++++++++-
>>   2 files changed, 138 insertions(+), 20 deletions(-)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index b06a9fab500e..d5a8e5a9c9b5 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -21,6 +21,7 @@
>>   #include "file.h"
>>   #include "migration.h"
>>   #include "migration-stats.h"
>> +#include "savevm.h"
>>   #include "socket.h"
>>   #include "tls.h"
>>   #include "qemu-file.h"
>> @@ -209,10 +210,10 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
>>
>>       memset(packet, 0, p->packet_len);
>>
>> -    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>> -    packet->version = cpu_to_be32(MULTIFD_VERSION);
>> +    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
>> +    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>>
>> -    packet->flags = cpu_to_be32(p->flags);
>> +    packet->hdr.flags = cpu_to_be32(p->flags);
>>       packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>>
>>       packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
>> @@ -228,31 +229,49 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
>>                               p->flags, p->next_packet_size);
>>   }
>>
>> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>> +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
>> +                                             MultiFDPacketHdr_t *hdr,
>> +                                             Error **errp)
>>   {
>> -    MultiFDPacket_t *packet = p->packet;
>> -    int ret = 0;
>> -
>> -    packet->magic = be32_to_cpu(packet->magic);
>> -    if (packet->magic != MULTIFD_MAGIC) {
>> +    hdr->magic = be32_to_cpu(hdr->magic);
>> +    if (hdr->magic != MULTIFD_MAGIC) {
>>           error_setg(errp, "multifd: received packet "
>>                      "magic %x and expected magic %x",
>> -                   packet->magic, MULTIFD_MAGIC);
>> +                   hdr->magic, MULTIFD_MAGIC);
>>           return -1;
>>       }
>>
>> -    packet->version = be32_to_cpu(packet->version);
>> -    if (packet->version != MULTIFD_VERSION) {
>> +    hdr->version = be32_to_cpu(hdr->version);
>> +    if (hdr->version != MULTIFD_VERSION) {
>>           error_setg(errp, "multifd: received packet "
>>                      "version %u and expected version %u",
>> -                   packet->version, MULTIFD_VERSION);
>> +                   hdr->version, MULTIFD_VERSION);
>>           return -1;
>>       }
>>
>> -    p->flags = be32_to_cpu(packet->flags);
>> +    p->flags = be32_to_cpu(hdr->flags);
>> +
>> +    return 0;
>> +}
>> +
>> +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
>> +                                                   Error **errp)
>> +{
>> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
>> +
>> +    packet->instance_id = be32_to_cpu(packet->instance_id);
>> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>> +
>> +    return 0;
>> +}
>> +
>> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
>> +{
>> +    MultiFDPacket_t *packet = p->packet;
>> +    int ret = 0;
>> +
>>       p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>       p->packet_num = be64_to_cpu(packet->packet_num);
>> -    p->packets_recved++;
>>
>>       if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>>           ret = multifd_ram_unfill_packet(p, errp);
>> @@ -264,6 +283,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>>       return ret;
>>   }
>>
>> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>> +{
>> +    p->packets_recved++;
>> +
>> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>> +        return multifd_recv_unfill_packet_device_state(p, errp);
>> +    } else {
>> +        return multifd_recv_unfill_packet_ram(p, errp);
>> +    }
>> +
>> +    g_assert_not_reached();
> 
> We can drop the assert and the "else":
> if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>      return multifd_recv_unfill_packet_device_state(p, errp);
> }
> 
> return multifd_recv_unfill_packet_ram(p, errp);

Ack.

>> +}
>> +
>>   static bool multifd_send_should_exit(void)
>>   {
>>       return qatomic_read(&multifd_send_state->exiting);
>> diff --git a/migration/multifd.h b/migration/multifd.h
>> index a3e35196d179..a8f3e4838c01 100644
>> --- a/migration/multifd.h
>> +++ b/migration/multifd.h
>> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>   #define MULTIFD_FLAG_QPL (4 << 1)
>>   #define MULTIFD_FLAG_UADK (8 << 1)
>>
>> +/*
>> + * If set it means that this packet contains device state
>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>> + */
>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
>> +
>>   /* This value needs to be a multiple of qemu_target_page_size() */
>>   #define MULTIFD_PACKET_SIZE (512 * 1024)
>>
>> @@ -52,6 +58,11 @@ typedef struct {
>>       uint32_t magic;
>>       uint32_t version;
>>       uint32_t flags;
>> +} __attribute__((packed)) MultiFDPacketHdr_t;
> 
> Maybe split this patch into two: one that adds the packet header concept and another that adds the new device packet?

Can do.

>> +
>> +typedef struct {
>> +    MultiFDPacketHdr_t hdr;
>> +
>>       /* maximum number of allocated pages */
>>       uint32_t pages_alloc;
>>       /* non zero pages */
>> @@ -72,6 +83,16 @@ typedef struct {
>>       uint64_t offset[];
>>   } __attribute__((packed)) MultiFDPacket_t;
>>
>> +typedef struct {
>> +    MultiFDPacketHdr_t hdr;
>> +
>> +    char idstr[256] QEMU_NONSTRING;
> 
> idstr should be null terminated, or am I missing something?

There's no need to always NULL-terminate a constant-size field,
since the strncpy() already stops at the field size, so we can
gain another byte for actual string use this way.

RAM block idstr also uses the same "trick":
> void multifd_ram_fill_packet(MultiFDSendParams *p):
> strncpy(packet->ramblock, pages->block->idstr, 256);

> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side
  2024-09-09  8:55   ` Avihai Horon
@ 2024-09-09 18:06     ` Maciej S. Szmigiero
  2024-09-12  8:20       ` Avihai Horon
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:06 UTC (permalink / raw)
  To: Avihai Horon, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

On 9.09.2024 10:55, Avihai Horon wrote:
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The multifd received data needs to be reassembled since device state
>> packets sent via different multifd channels can arrive out-of-order.
>>
>> Therefore, each VFIO device state packet carries a header indicating
>> its position in the stream.
>>
>> The last such VFIO device state packet should have
>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
>> state.
>>
>> Since it's important to finish loading device state transferred via
>> the main migration channel (via save_live_iterate handler) before
>> starting loading the data asynchronously transferred via multifd
>> a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
>> mark the end of the main migration channel data.
>>
>> The device state loading process waits until that flag is seen before
>> commencing loading of the multifd-transferred device state.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 338 +++++++++++++++++++++++++++++++++-
>>   hw/vfio/pci.c                 |   2 +
>>   hw/vfio/trace-events          |   9 +-
>>   include/hw/vfio/vfio-common.h |  17 ++
>>   4 files changed, 362 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 24679d8c5034..57c1542528dc 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/vfio.h>
>>   #include <sys/ioctl.h>
>>
>> +#include "io/channel-buffer.h"
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "migration/misc.h"
>> @@ -47,6 +48,7 @@
>>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>>   #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>>   #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE    (0xffffffffef100006ULL)
>>
>>   /*
>>    * This is an arbitrary size based on migration of mlx5 devices, where typically
>> @@ -55,6 +57,15 @@
>>    */
>>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>>
>> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>> +
>> +typedef struct VFIODeviceStatePacket {
>> +    uint32_t version;
>> +    uint32_t idx;
>> +    uint32_t flags;
>> +    uint8_t data[0];
>> +} QEMU_PACKED VFIODeviceStatePacket;
>> +
>>   static int64_t bytes_transferred;
>>
>>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>> @@ -254,6 +265,188 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>       return ret;
>>   }
>>
>> +typedef struct LoadedBuffer {
>> +    bool is_present;
>> +    char *data;
>> +    size_t len;
>> +} LoadedBuffer;
> 
> Maybe rename LoadedBuffer to a more specific name, like VFIOStateBuffer?

Will do.

> I also feel like LoadedBuffer deserves a separate commit.
> Plus, I think it will be good to add a full API for this, that wraps the g_array_* calls and holds the extra members.
> E.g, VFIOStateBuffer, VFIOStateArray (will hold load_buf_idx, load_buf_idx_last, etc.), vfio_state_array_destroy(), vfio_state_array_alloc(), vfio_state_array_get(), etc...
> IMHO, this will make it clearer.

Will think about wrapping GArray accesses in separate methods,
however wrapping a single line GArray call in a separate function
normally would seem a bit excessive.

>> +
>> +static void loaded_buffer_clear(gpointer data)
>> +{
>> +    LoadedBuffer *lb = data;
>> +
>> +    if (!lb->is_present) {
>> +        return;
>> +    }
>> +
>> +    g_clear_pointer(&lb->data, g_free);
>> +    lb->is_present = false;
>> +}
>> +
>> +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>> +                                  Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
> 
> Move lock to where it's needed? I.e., after trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx)

It's a declaration of a new variable so I guess it should always be
at the top of the code block in the kernel / QEMU code style?

Also, these checks below are very unlikely to fail and even if they do,
I doubt a failed migration due to bit stream corruption is a scenario
worth optimizing run time performance for.

>> +    LoadedBuffer *lb;
>> +
>> +    if (data_size < sizeof(*packet)) {
>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>> +                   data_size, sizeof(*packet));
>> +        return -1;
>> +    }
>> +
>> +    if (packet->version != 0) {
>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>> +                   packet->version);
>> +        return -1;
>> +    }
>> +
>> +    if (packet->idx == UINT32_MAX) {
>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>> +                   packet->idx);
>> +        return -1;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>> +
>> +    /* config state packet should be the last one in the stream */
>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>> +        migration->load_buf_idx_last = packet->idx;
>> +    }
>> +
>> +    assert(migration->load_bufs);
>> +    if (packet->idx >= migration->load_bufs->len) {
>> +        g_array_set_size(migration->load_bufs, packet->idx + 1);
>> +    }
>> +
>> +    lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
>> +    if (lb->is_present) {
>> +        error_setg(errp, "state buffer %" PRIu32 " already filled", packet->idx);
>> +        return -1;
>> +    }
>> +
>> +    assert(packet->idx >= migration->load_buf_idx);
>> +
>> +    migration->load_buf_queued_pending_buffers++;
>> +    if (migration->load_buf_queued_pending_buffers >
>> +        vbasedev->migration_max_queued_buffers) {
>> +        error_setg(errp,
>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>> +        return -1;
>> +    }
> 
> I feel like max_queued_buffers accounting/checking/configuration should be split to a separate patch that will come after this patch.
> Also, should we count bytes instead of buffers? Current buffer size is 1MB but this could change, and the normal user should not care or know what is the buffer size.
> So maybe rename to migration_max_pending_bytes or such?

Since it's Peter that asked for this limit to be introduced in the first place
I would like to ask him what his preference here.

@Peter: max queued buffers or bytes?

>> +
>> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>> +    lb->len = data_size - sizeof(*packet);
>> +    lb->is_present = true;
>> +
>> +    qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
> 
> There is only one thread waiting, shouldn't signal be enough?

Will change this to _signal() since it clearly doesn't
make sense to have a future-proof API here - it's an
implementation detail.

>> +
>> +    return 0;
>> +}
>> +
>> +static void *vfio_load_bufs_thread(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Error **errp = &migration->load_bufs_thread_errp;
>> +    g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock(
>> +        QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));
> 
> Any special reason to use QemuLockable?

I prefer automatic lock management (RAII-like) for the same reason
I prefer automatic memory management: it makes it much harder to
forget to unlock the lock (or free memory) in some error path.

That's the reason these primitives were introduced in QEMU in the
first place (apparently modeled after its Glib equivalents) and
why these are being (slowly) introduced to Linux kernel too.

>> +    LoadedBuffer *lb;
>> +
>> +    while (!migration->load_bufs_device_ready &&
>> +           !migration->load_bufs_thread_want_exit) {
>> +        qemu_cond_wait(&migration->load_bufs_device_ready_cond, &migration->load_bufs_mutex);
>> +    }
>> +
>> +    while (!migration->load_bufs_thread_want_exit) {
>> +        bool starved;
>> +        ssize_t ret;
>> +
>> +        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
>> +
>> +        if (migration->load_buf_idx >= migration->load_bufs->len) {
>> +            assert(migration->load_buf_idx == migration->load_bufs->len);
>> +            starved = true;
>> +        } else {
>> +            lb = &g_array_index(migration->load_bufs, typeof(*lb), migration->load_buf_idx);
>> +            starved = !lb->is_present;
>> +        }
>> +
>> +        if (starved) {
>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name, migration->load_buf_idx);
>> +            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond, &migration->load_bufs_mutex);
>> +            continue;
>> +        }
>> +
>> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
>> +            break;
>> +        }
>> +
>> +        if (migration->load_buf_idx == 0) {
>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>> +        }
>> +
>> +        if (lb->len) {
>> +            g_autofree char *buf = NULL;
>> +            size_t buf_len;
>> +            int errno_save;
>> +
>> +            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>> +                                                           migration->load_buf_idx);
>> +
>> +            /* lb might become re-allocated when we drop the lock */
>> +            buf = g_steal_pointer(&lb->data);
>> +            buf_len = lb->len;
>> +
>> +            /* Loading data to the device takes a while, drop the lock during this process */
>> +            qemu_mutex_unlock(&migration->load_bufs_mutex);
>> +            ret = write(migration->data_fd, buf, buf_len);
>> +            errno_save = errno;
>> +            qemu_mutex_lock(&migration->load_bufs_mutex);
>> +
>> +            if (ret < 0) {
>> +                error_setg(errp, "write to state buffer %" PRIu32 " failed with %d",
>> +                           migration->load_buf_idx, errno_save);
>> +                break;
>> +            } else if (ret < buf_len) {
>> +                error_setg(errp, "write to state buffer %" PRIu32 " incomplete %zd / %zu",
>> +                           migration->load_buf_idx, ret, buf_len);
>> +                break;
>> +            }
>> +
>> +            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>> +                                                         migration->load_buf_idx);
>> +        }
>> +
>> +        assert(migration->load_buf_queued_pending_buffers > 0);
>> +        migration->load_buf_queued_pending_buffers--;
>> +
>> +        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>> +        }
>> +
>> +        migration->load_buf_idx++;
>> +    }
>> +
>> +    if (migration->load_bufs_thread_want_exit &&
>> +        !*errp) {
>> +        error_setg(errp, "load bufs thread asked to quit");
>> +    }
>> +
>> +    g_clear_pointer(&locker, qemu_lockable_auto_unlock);
>> +
>> +    qemu_loadvm_load_finish_ready_lock();
>> +    migration->load_bufs_thread_finished = true;
>> +    qemu_loadvm_load_finish_ready_broadcast();
>> +    qemu_loadvm_load_finish_ready_unlock();
>> +
>> +    return NULL;
>> +}
>> +
>>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>                                            Error **errp)
>>   {
>> @@ -285,6 +478,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       VFIODevice *vbasedev = opaque;
>>       uint64_t data;
>>
>> +    trace_vfio_load_device_config_state_start(vbasedev->name);
> 
> Maybe split this and below trace_vfio_load_device_config_state_end to a separate patch?

I guess you mean to add these trace points in a separate patch?
Can do.

>> +
>>       if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>>           int ret;
>>
>> @@ -303,7 +498,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>           return -EINVAL;
>>       }
>>
>> -    trace_vfio_load_device_config_state(vbasedev->name);
>> +    trace_vfio_load_device_config_state_end(vbasedev->name);
>>       return qemu_file_get_error(f);
>>   }
>>
>> @@ -687,16 +882,70 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> +                                   vbasedev->migration->device_state, errp);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    assert(!migration->load_bufs);
>> +    migration->load_bufs = g_array_new(FALSE, TRUE, sizeof(LoadedBuffer));
>> +    g_array_set_clear_func(migration->load_bufs, loaded_buffer_clear);
>> +
>> +    qemu_mutex_init(&migration->load_bufs_mutex);
>> +
>> +    migration->load_bufs_device_ready = false;
>> +    qemu_cond_init(&migration->load_bufs_device_ready_cond);
>> +
>> +    migration->load_buf_idx = 0;
>> +    migration->load_buf_idx_last = UINT32_MAX;
>> +    migration->load_buf_queued_pending_buffers = 0;
>> +    qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
>> +
>> +    migration->config_state_loaded_to_dev = false;
>> +
>> +    assert(!migration->load_bufs_thread_started);
> 
> Maybe do all these allocations (and de-allocations) only if multifd device state is supported and enabled?
> Extracting this to its own function could also be good.

Sure, will try to avoid unnecessarily allocating multifd device state
related things if this functionality is unavailable anyway.
  
>>
>> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> -                                    vbasedev->migration->device_state, errp);
>> +    migration->load_bufs_thread_finished = false;
>> +    migration->load_bufs_thread_want_exit = false;
>> +    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs",
>> +                       vfio_load_bufs_thread, opaque, QEMU_THREAD_JOINABLE);
> 
> The device state save threads are manged by migration core thread pool. Don't we want to apply the same thread management scheme for the load flow as well?

I think that (in contrast with the device state saving threads)
the buffer loading / reordering thread is an implementation detail
of the VFIO driver, so I don't think it really makes sense for multifd code
to manage it.

>> +
>> +    migration->load_bufs_thread_started = true;
>> +
>> +    return 0;
>>   }
>>
>>   static int vfio_load_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->load_bufs_thread_started) {
>> +        qemu_mutex_lock(&migration->load_bufs_mutex);
>> +        migration->load_bufs_thread_want_exit = true;
>> +        qemu_mutex_unlock(&migration->load_bufs_mutex);
>> +
>> +        qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
>> +        qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
>> +
>> +        qemu_thread_join(&migration->load_bufs_thread);
>> +
>> +        assert(migration->load_bufs_thread_finished);
>> +
>> +        migration->load_bufs_thread_started = false;
>> +    }
>>
>>       vfio_migration_cleanup(vbasedev);
>> +
>> +    g_clear_pointer(&migration->load_bufs, g_array_unref);
>> +    qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
>> +    qemu_cond_destroy(&migration->load_bufs_device_ready_cond);
>> +    qemu_mutex_destroy(&migration->load_bufs_mutex);
>> +
>>       trace_vfio_load_cleanup(vbasedev->name);
>>
>>       return 0;
>> @@ -705,6 +954,7 @@ static int vfio_load_cleanup(void *opaque)
>>   static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       int ret = 0;
>>       uint64_t data;
>>
>> @@ -716,6 +966,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>           switch (data) {
>>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>           {
>> +            migration->config_state_loaded_to_dev = true;
>>               return vfio_load_device_config_state(f, opaque);
>>           }
>>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>> @@ -742,6 +993,15 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>               }
>>               break;
>>           }
>> +        case VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE:
>> +        {
>> +            QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>> +
>> +            migration->load_bufs_device_ready = true;
>> +            qemu_cond_broadcast(&migration->load_bufs_device_ready_cond);
>> +
>> +            break;
>> +        }
>>           case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
>>           {
>>               if (!vfio_precopy_supported(vbasedev) ||
>> @@ -774,6 +1034,76 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>       return ret;
>>   }
>>
>> +static int vfio_load_finish(void *opaque, bool *is_finished, Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    g_autoptr(QemuLockable) locker = NULL;
> 
> Any special reason to use QemuLockable?

The same reason as for the automatic locking above.
  
> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side
  2024-09-09 11:41   ` Avihai Horon
@ 2024-09-09 18:07     ` Maciej S. Szmigiero
  2024-09-12  8:26       ` Avihai Horon
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:07 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 9.09.2024 13:41, Avihai Horon wrote:
> 
> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Implement the multifd device state transfer via additional per-device
>> thread inside save_live_complete_precopy_thread handler.
>>
>> Switch between doing the data transfer in the new handler and doing it
>> in the old save_state handler depending on the
>> x-migration-multifd-transfer device property value.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 169 ++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   2 +
>>   include/hw/vfio/vfio-common.h |   1 +
>>   3 files changed, 172 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 57c1542528dc..67996aa2df8b 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -655,6 +655,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>       int ret;
>>
>> +    /* Make a copy of this setting at the start in case it is changed mid-migration */
>> +    migration->multifd_transfer = vbasedev->migration_multifd_transfer;
> 
> Should VFIO multifd be controlled by main migration multifd capability, and let the per VFIO device migration_multifd_transfer property be immutable and enabled by default?
> Then we would have a single point of configuration (and an extra one per VFIO device just to disable for backward compatibility).
> Unless there are other benefits to have this property configurable?

We want multifd device state transfer property to be configurable per-device
in case in the future we add another device type (besides VFIO) that supports
multifd device state transfer.

In this case, we might need to enable the multifd device state transfer just
for VFIO devices, but not for this new device type when we are migrating to a
QEMU target that supports just the VFIO multifd device state transfer.

TBH, I'm not opposed to adding a additional global multifd device state transfer
switch (if we keep the per-device ones too) but I am not sure what value it adds.

>> +
>> +    if (migration->multifd_transfer && !migration_has_device_state_support()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return -EINVAL;
>> +    }
>> +
>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>> @@ -835,10 +845,20 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       ssize_t data_size;
>>       int ret;
>>       Error *local_err = NULL;
>>
>> +    if (migration->multifd_transfer) {
>> +        /*
>> +         * Emit dummy NOP data, vfio_save_complete_precopy_thread()
>> +         * does the actual transfer.
>> +         */
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> 
> There are three places where we send this dummy end of state, maybe worth extracting it to a helper? I.e., vfio_send_end_of_state() and then document there the rationale.

I'm not totally against it but it's wrapping just a single line of code in
a separate function?

>> +        return 0;
>> +    }
>> +
>>       trace_vfio_save_complete_precopy_started(vbasedev->name);
>>
>>       /* We reach here with device state STOP or STOP_COPY only */
>> @@ -864,12 +884,159 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>>
>> +static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
>> +                                                                char *idstr,
>> +                                                                uint32_t instance_id,
>> +                                                                uint32_t idx)
>> +{
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    QEMUFile *f = NULL;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    size_t packet_len;
>> +
>> +    bioc = qio_channel_buffer_new(0);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>> +
>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +
>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
>> +    if (ret) {
>> +        return ret;
> 
> Need to close f in this case.

Right - by the way, that's a good example why RAII
helps avoid such mistakes.

>> +    }
>> +
>> +    ret = qemu_fflush(f);
>> +    if (ret) {
>> +        goto ret_close_file;
>> +    }
>> +
>> +    packet_len = sizeof(*packet) + bioc->usage;
>> +    packet = g_malloc0(packet_len);
>> +    packet->idx = idx;
>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>> +
>> +    if (!multifd_queue_device_state(idstr, instance_id,
>> +                                    (char *)packet, packet_len)) {
>> +        ret = -1;
> 
> goto ret_close_file?

Right, it would be better not to increment the counter in this case.

>> +    }
>> +
>> +    bytes_transferred += packet_len;
> 
> bytes_transferred is a global variable. Now that we access it from multiple threads it should be protected.

Right, this stat needs some concurrent access protection.

> Note that now the VFIO device data is reported also in multifd stats (if I am not mistaken), is this the behavior we want? Maybe we should enhance multifd stats to distinguish between RAM data and device data?

Multifd stats report total size of data transferred via multifd so
they should include device state too.

It may make sense to add a dedicated device state transfer counter
at some time though.

>> +
>> +ret_close_file:
> 
> Rename to "out" as we only have one exit point?
> 
>> +    g_clear_pointer(&f, qemu_fclose);
> 
> f is a local variable, wouldn't qemu_fclose(f) be enough here?

Sure, but why leave a dangling pointer?

Currently, it is obviously a NOP (probably deleted by dead store
elimination anyway) but the code might get refactored at some point
and I think it's good practice to always NULL pointers after freeing
them where possible and so be on the safe side.

>> +    return ret;
>> +}
>> +
>> +static int vfio_save_complete_precopy_thread(char *idstr,
>> +                                             uint32_t instance_id,
>> +                                             bool *abort_flag,
>> +                                             void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    uint32_t idx;
>> +
>> +    if (!migration->multifd_transfer) {
>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>> +        return 0;
>> +    }
>> +
>> +    trace_vfio_save_complete_precopy_thread_started(vbasedev->name,
>> +                                                    idstr, instance_id);
>> +
>> +    /* We reach here with device state STOP or STOP_COPY only */
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>> +                                   VFIO_DEVICE_STATE_STOP, NULL);
>> +    if (ret) {
>> +        goto ret_finish;
>> +    }
>> +
>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>> +
>> +    for (idx = 0; ; idx++) {
>> +        ssize_t data_size;
>> +        size_t packet_size;
>> +
>> +        if (qatomic_read(abort_flag)) {
>> +            ret = -ECANCELED;
>> +            goto ret_finish;
>> +        }
>> +
>> +        data_size = read(migration->data_fd, &packet->data,
>> +                         migration->data_buffer_size);
>> +        if (data_size < 0) {
>> +            if (errno != ENOMSG) {
>> +                ret = -errno;
>> +                goto ret_finish;
>> +            }
>> +
>> +            /*
>> +             * Pre-copy emptied all the device state for now. For more information,
>> +             * please refer to the Linux kernel VFIO uAPI.
>> +             */
>> +            data_size = 0;
> 
> According to VFIO uAPI, ENOMSG can only be returned in the PRE_COPY device states.
> Here we are in STOP_COPY, so we can drop the ENOMSG handling.

Will drop this ENOMSG handling.

> Thanks.

Thanks,
Maciej




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-09 17:59     ` Peter Xu
@ 2024-09-09 18:32       ` Maciej S. Szmigiero
  2024-09-09 19:08         ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:32 UTC (permalink / raw)
  To: Peter Xu, Avihai Horon
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Joao Martins,
	qemu-devel

On 9.09.2024 19:59, Peter Xu wrote:
> On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
>>
>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> These SaveVMHandlers help device provide its own asynchronous
>>> transmission of the remaining data at the end of a precopy phase.
>>>
>>> In this use case the save_live_complete_precopy_begin handler might
>>> be used to mark the stream boundary before proceeding with asynchronous
>>> transmission of the remaining data while the
>>> save_live_complete_precopy_end handler might be used to mark the
>>> stream boundary after performing the asynchronous transmission.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>    include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
>>>    migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
>>>    2 files changed, 71 insertions(+)
>>>
>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>> index f60e797894e5..9de123252edf 100644
>>> --- a/include/migration/register.h
>>> +++ b/include/migration/register.h
>>> @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
>>>         */
>>>        int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>>>
>>> +    /**
>>> +     * @save_live_complete_precopy_begin
>>> +     *
>>> +     * Called at the end of a precopy phase, before all
>>> +     * @save_live_complete_precopy handlers and before launching
>>> +     * all @save_live_complete_precopy_thread threads.
>>> +     * The handler might, for example, mark the stream boundary before
>>> +     * proceeding with asynchronous transmission of the remaining data via
>>> +     * @save_live_complete_precopy_thread.
>>> +     * When postcopy is enabled, devices that support postcopy will skip this step.
>>> +     *
>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>> +     * @idstr: this device section idstr
>>> +     * @instance_id: this device section instance_id
>>> +     * @opaque: data pointer passed to register_savevm_live()
>>> +     *
>>> +     * Returns zero to indicate success and negative for error
>>> +     */
>>> +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
>>> +                                            char *idstr, uint32_t instance_id,
>>> +                                            void *opaque);
>>> +    /**
>>> +     * @save_live_complete_precopy_end
>>> +     *
>>> +     * Called at the end of a precopy phase, after @save_live_complete_precopy
>>> +     * handlers and after all @save_live_complete_precopy_thread threads have
>>> +     * finished. When postcopy is enabled, devices that support postcopy will
>>> +     * skip this step.
>>> +     *
>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>> +     * @opaque: data pointer passed to register_savevm_live()
>>> +     *
>>> +     * Returns zero to indicate success and negative for error
>>> +     */
>>> +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
>>
>> Is this handler necessary now that migration core is responsible for the
>> threads and joins them? I don't see VFIO implementing it later on.
> 
> Right, I spot the same thing.
> 
> This series added three hooks: begin, end, precopy_thread.
> 
> What I think is it only needs one, which is precopy_async.  My vague memory
> was that was what we used to discuss too, so that when migration precopy
> flushes the final round of iterable data, it does:
> 
>    (1) loop over all complete_precopy_async() and enqueue the tasks if
>        existed into the migration worker pool.  Then,
> 
>    (2) loop over all complete_precopy() like before.
> 
> Optionally, we can enforce one vmstate handler only provides either
> complete_precopy_async() or complete_precopy().  In this case VFIO can
> update the two hooks during setup() by detecting multifd && !mapped_ram &&
> nocomp.
> 

The "_begin" hook is still necessary to mark the end of the device state
sent via the main migration stream (during the phase VM is still running)
since we can't start loading the multifd sent device state until all of
that earlier data finishes loading first.

We shouldn't send that boundary marker in .save_live_complete_precopy
either since it would meant unnecessary waiting for other devices
(not necessary VFIO ones) .save_live_complete_precopy bulk data.

And VFIO SaveVMHandlers are shared for all VFIO devices (and const) so
we can't really change them at runtime.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-09 16:45         ` Peter Xu
@ 2024-09-09 18:38           ` Maciej S. Szmigiero
  2024-09-09 19:12             ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 18:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Stefan Hajnoczi, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Paolo Bonzini

On 9.09.2024 18:45, Peter Xu wrote:
> Hi, Stefan, Maciej,
> 
> Sorry to be slow on responding.
> 
> On Tue, Sep 03, 2024 at 03:04:54PM -0400, Stefan Hajnoczi wrote:
>> On Tue, 3 Sept 2024 at 12:54, Maciej S. Szmigiero
>> <mail@maciej.szmigiero.name> wrote:
>>>
>>> On 3.09.2024 15:55, Stefan Hajnoczi wrote:
>>>> On Tue, 27 Aug 2024 at 13:58, Maciej S. Szmigiero
>>>> <mail@maciej.szmigiero.name> wrote:
>>>>>
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> Migration code wants to manage device data sending threads in one place.
>>>>>
>>>>> QEMU has an existing thread pool implementation, however it was limited
>>>>> to queuing AIO operations only and essentially had a 1:1 mapping between
>>>>> the current AioContext and the ThreadPool in use.
>>>>>
>>>>> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
>>>>> too.
>>>>>
>>>>> This brings a few new operations on a pool:
>>>>> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
>>>>> thread count in the pool.
>>>>>
>>>>> * thread_pool_join() operation waits until all the submitted work requests
>>>>> have finished.
>>>>>
>>>>> * thread_pool_poll() lets the new thread and / or thread completion bottom
>>>>> halves run (if they are indeed scheduled to be run).
>>>>> It is useful for thread pool users that need to launch or terminate new
>>>>> threads without returning to the QEMU main loop.
>>>>
>>>> Did you consider glib's GThreadPool?
>>>> https://docs.gtk.org/glib/struct.ThreadPool.html
>>>>
>>>> QEMU's thread pool is integrated into the QEMU event loop. If your
>>>> goal is to bypass the QEMU event loop, then you may as well use the
>>>> glib API instead.
>>>>
>>>> thread_pool_join() and thread_pool_poll() will lead to code that
>>>> blocks the event loop. QEMU's aio_poll() and nested event loops in
>>>> general are a source of hangs and re-entrancy bugs. I would prefer not
>>>> introducing these issues in the QEMU ThreadPool API.
>>>>
>>>
>>> Unfortunately, the problem with the migration code is that it is
>>> synchronous - it does not return to the main event loop until the
>>> migration is done.
>>>
>>> So the only way to handle things that need working event loop is to
>>> pump it manually from inside the migration code.
>>>
>>> The reason why I used the QEMU thread pool in the first place in this
>>> patch set version is because Peter asked me to do so during the review
>>> of its previous iteration [1].
>>>
>>> Peter also asked me previously to move to QEMU synchronization
>>> primitives from using the Glib ones in the early version of this
>>> patch set [2].
>>>
>>> I personally would rather use something common to many applications,
>>> well tested and with more pairs of eyes looking at it rather to
>>> re-invent things in QEMU.
>>>
>>> Looking at GThreadPool it seems that it lacks ability to wait until
>>> all queued work have finished, so this would need to be open-coded
>>> in the migration code.
>>>
>>> @Peter, what's your opinion on using Glib's thread pool instead of
>>> QEMU's one, considering the above things?
>>
>> I'll add a bit more about my thinking:
>>
>> Using QEMU's event-driven model is usually preferred because it makes
>> integrating with the rest of QEMU easy and avoids having lots of
>> single-purpose threads that are hard to observe/manage (e.g. through
>> the QMP monitor).
>>
>> When there is a genuine need to spawn a thread and write synchronous
>> code (e.g. a blocking ioctl(2) call or something CPU-intensive), then
> 
> Right, AFAIU this is the current use case for VFIO, and anything beyond in
> migration context, where we want to use genuine threads with no need to
> integrate with the main even loop.
> 
> Currently the VFIO workfn should read() the VFIO fd in a blocked way, then
> dump them to multifd threads (further dump to migration channels), during
> which it can wait on a semaphore.
> 
>> it's okay to do that. Use QEMUBH, EventNotifier, or other QEMU APIs to
>> synchronize between event loop threads and special-purpose synchronous
>> threads.
>>
>> I haven't looked at the patch series enough to have an opinion about
>> whether this use case needs a special-purpose thread or not. I am
>> assuming it really needs to be a special-purpose thread. Peter and you
>> could discuss that further if you want.
>>
>> I agree with Peter's request to use QEMU's synchronization primitives.
>> They do not depend on the event loop so they can be used outside the
>> event loop.
>>
>> The issue I'm raising with this patch is that adding new join()/poll()
>> APIs that shouldn't be called from the event loop is bug-prone. It
>> will make the QEMU ThreadPool code harder to understand and maintain
>> because now there are two different contexts where different subsets
>> of this API can be used and mixing them leads to problems. To me the
>> non-event loop case is beyond the scope of QEMU's ThreadPool. I have
>> CCed Paolo, who wrote the thread pool in its current form in case he
>> wants to participate in the discussion.
>>
>> Using glib's ThreadPool solves the issue while still reusing an
>> existing thread pool implementation. Waiting for all work to complete
>> can be done using QemuSemaphore.
> 
> Right.  It's a pity that g_thread_pool_unprocessed() only monitors
> unqueuing of tasks, and looks like there's no g_thread_pool_flush().
> 
> Indeed the current thread poll is very aio-centric, and if we worry about
> misuse of the APIs we can switch to glib's threadpool.  Sorry Maciej, looks
> like I routed you to a direction that I didn't see the side effects..
> 
> I think the fundamental request from my side (on behalf of migration) is we
> should avoid a specific vmstate handler managing threads on its own.  E.g.,
> any future devices (vdpa, vcpu, etc.) that may also be able to offload
> save() processes concurrently to threads (just like what VFIO can already
> do right now) should share the same pool of threads.  As long as that can
> be achieved I am ok.

So, to be clear - do you still prefer using the (extended) QEMU's thread pool
or rather prefer switching to the Glib thread pool instead (with
thread_pool_wait() equivalent reimplemented inside QEMU since Glib lacks it)?
  
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-09 18:32       ` Maciej S. Szmigiero
@ 2024-09-09 19:08         ` Peter Xu
  2024-09-09 19:32           ` Peter Xu
  2024-09-19 19:47           ` Maciej S. Szmigiero
  0 siblings, 2 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-09 19:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On Mon, Sep 09, 2024 at 08:32:45PM +0200, Maciej S. Szmigiero wrote:
> On 9.09.2024 19:59, Peter Xu wrote:
> > On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
> > > 
> > > On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > 
> > > > These SaveVMHandlers help device provide its own asynchronous
> > > > transmission of the remaining data at the end of a precopy phase.
> > > > 
> > > > In this use case the save_live_complete_precopy_begin handler might
> > > > be used to mark the stream boundary before proceeding with asynchronous
> > > > transmission of the remaining data while the
> > > > save_live_complete_precopy_end handler might be used to mark the
> > > > stream boundary after performing the asynchronous transmission.
> > > > 
> > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > ---
> > > >    include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
> > > >    migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
> > > >    2 files changed, 71 insertions(+)
> > > > 
> > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > index f60e797894e5..9de123252edf 100644
> > > > --- a/include/migration/register.h
> > > > +++ b/include/migration/register.h
> > > > @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
> > > >         */
> > > >        int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
> > > > 
> > > > +    /**
> > > > +     * @save_live_complete_precopy_begin
> > > > +     *
> > > > +     * Called at the end of a precopy phase, before all
> > > > +     * @save_live_complete_precopy handlers and before launching
> > > > +     * all @save_live_complete_precopy_thread threads.
> > > > +     * The handler might, for example, mark the stream boundary before
> > > > +     * proceeding with asynchronous transmission of the remaining data via
> > > > +     * @save_live_complete_precopy_thread.
> > > > +     * When postcopy is enabled, devices that support postcopy will skip this step.
> > > > +     *
> > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > +     * @idstr: this device section idstr
> > > > +     * @instance_id: this device section instance_id
> > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > +     *
> > > > +     * Returns zero to indicate success and negative for error
> > > > +     */
> > > > +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
> > > > +                                            char *idstr, uint32_t instance_id,
> > > > +                                            void *opaque);
> > > > +    /**
> > > > +     * @save_live_complete_precopy_end
> > > > +     *
> > > > +     * Called at the end of a precopy phase, after @save_live_complete_precopy
> > > > +     * handlers and after all @save_live_complete_precopy_thread threads have
> > > > +     * finished. When postcopy is enabled, devices that support postcopy will
> > > > +     * skip this step.
> > > > +     *
> > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > +     *
> > > > +     * Returns zero to indicate success and negative for error
> > > > +     */
> > > > +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
> > > 
> > > Is this handler necessary now that migration core is responsible for the
> > > threads and joins them? I don't see VFIO implementing it later on.
> > 
> > Right, I spot the same thing.
> > 
> > This series added three hooks: begin, end, precopy_thread.
> > 
> > What I think is it only needs one, which is precopy_async.  My vague memory
> > was that was what we used to discuss too, so that when migration precopy
> > flushes the final round of iterable data, it does:
> > 
> >    (1) loop over all complete_precopy_async() and enqueue the tasks if
> >        existed into the migration worker pool.  Then,
> > 
> >    (2) loop over all complete_precopy() like before.
> > 
> > Optionally, we can enforce one vmstate handler only provides either
> > complete_precopy_async() or complete_precopy().  In this case VFIO can
> > update the two hooks during setup() by detecting multifd && !mapped_ram &&
> > nocomp.
> > 
> 
> The "_begin" hook is still necessary to mark the end of the device state
> sent via the main migration stream (during the phase VM is still running)
> since we can't start loading the multifd sent device state until all of
> that earlier data finishes loading first.

Ah I remembered some more now, thanks.

If vfio can send data during iterations this new hook will also not be
needed, right?

I remember you mentioned you'd have a look and see the challenges there, is
there any conclusion yet on whether we can use multifd even during that?

It's also a pity that we introduce this hook only because we want a
boundary between "iterable stage" and "final stage".  IIUC if we have any
kind of message telling dest before hand that "we're going to the last
stage" then this hook can be avoided.  Now it's at least inefficient
because we need to trigger begin() per-device, even if I think it's more of
a global request saying that "we need to load all main stream data first
before moving on".

> 
> We shouldn't send that boundary marker in .save_live_complete_precopy
> either since it would meant unnecessary waiting for other devices
> (not necessary VFIO ones) .save_live_complete_precopy bulk data.
> 
> And VFIO SaveVMHandlers are shared for all VFIO devices (and const) so
> we can't really change them at runtime.

In all cases, please consider dropping end() if it's never used; IMO it's
fine if there is only begin(), and we shouldn't keep hooks that are never
used.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-09 18:38           ` Maciej S. Szmigiero
@ 2024-09-09 19:12             ` Peter Xu
  2024-09-09 19:16               ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-09 19:12 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Stefan Hajnoczi, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Paolo Bonzini

On Mon, Sep 09, 2024 at 08:38:45PM +0200, Maciej S. Szmigiero wrote:
> On 9.09.2024 18:45, Peter Xu wrote:
> > Hi, Stefan, Maciej,
> > 
> > Sorry to be slow on responding.
> > 
> > On Tue, Sep 03, 2024 at 03:04:54PM -0400, Stefan Hajnoczi wrote:
> > > On Tue, 3 Sept 2024 at 12:54, Maciej S. Szmigiero
> > > <mail@maciej.szmigiero.name> wrote:
> > > > 
> > > > On 3.09.2024 15:55, Stefan Hajnoczi wrote:
> > > > > On Tue, 27 Aug 2024 at 13:58, Maciej S. Szmigiero
> > > > > <mail@maciej.szmigiero.name> wrote:
> > > > > > 
> > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > 
> > > > > > Migration code wants to manage device data sending threads in one place.
> > > > > > 
> > > > > > QEMU has an existing thread pool implementation, however it was limited
> > > > > > to queuing AIO operations only and essentially had a 1:1 mapping between
> > > > > > the current AioContext and the ThreadPool in use.
> > > > > > 
> > > > > > Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
> > > > > > too.
> > > > > > 
> > > > > > This brings a few new operations on a pool:
> > > > > > * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
> > > > > > thread count in the pool.
> > > > > > 
> > > > > > * thread_pool_join() operation waits until all the submitted work requests
> > > > > > have finished.
> > > > > > 
> > > > > > * thread_pool_poll() lets the new thread and / or thread completion bottom
> > > > > > halves run (if they are indeed scheduled to be run).
> > > > > > It is useful for thread pool users that need to launch or terminate new
> > > > > > threads without returning to the QEMU main loop.
> > > > > 
> > > > > Did you consider glib's GThreadPool?
> > > > > https://docs.gtk.org/glib/struct.ThreadPool.html
> > > > > 
> > > > > QEMU's thread pool is integrated into the QEMU event loop. If your
> > > > > goal is to bypass the QEMU event loop, then you may as well use the
> > > > > glib API instead.
> > > > > 
> > > > > thread_pool_join() and thread_pool_poll() will lead to code that
> > > > > blocks the event loop. QEMU's aio_poll() and nested event loops in
> > > > > general are a source of hangs and re-entrancy bugs. I would prefer not
> > > > > introducing these issues in the QEMU ThreadPool API.
> > > > > 
> > > > 
> > > > Unfortunately, the problem with the migration code is that it is
> > > > synchronous - it does not return to the main event loop until the
> > > > migration is done.
> > > > 
> > > > So the only way to handle things that need working event loop is to
> > > > pump it manually from inside the migration code.
> > > > 
> > > > The reason why I used the QEMU thread pool in the first place in this
> > > > patch set version is because Peter asked me to do so during the review
> > > > of its previous iteration [1].
> > > > 
> > > > Peter also asked me previously to move to QEMU synchronization
> > > > primitives from using the Glib ones in the early version of this
> > > > patch set [2].
> > > > 
> > > > I personally would rather use something common to many applications,
> > > > well tested and with more pairs of eyes looking at it rather to
> > > > re-invent things in QEMU.
> > > > 
> > > > Looking at GThreadPool it seems that it lacks ability to wait until
> > > > all queued work have finished, so this would need to be open-coded
> > > > in the migration code.
> > > > 
> > > > @Peter, what's your opinion on using Glib's thread pool instead of
> > > > QEMU's one, considering the above things?
> > > 
> > > I'll add a bit more about my thinking:
> > > 
> > > Using QEMU's event-driven model is usually preferred because it makes
> > > integrating with the rest of QEMU easy and avoids having lots of
> > > single-purpose threads that are hard to observe/manage (e.g. through
> > > the QMP monitor).
> > > 
> > > When there is a genuine need to spawn a thread and write synchronous
> > > code (e.g. a blocking ioctl(2) call or something CPU-intensive), then
> > 
> > Right, AFAIU this is the current use case for VFIO, and anything beyond in
> > migration context, where we want to use genuine threads with no need to
> > integrate with the main even loop.
> > 
> > Currently the VFIO workfn should read() the VFIO fd in a blocked way, then
> > dump them to multifd threads (further dump to migration channels), during
> > which it can wait on a semaphore.
> > 
> > > it's okay to do that. Use QEMUBH, EventNotifier, or other QEMU APIs to
> > > synchronize between event loop threads and special-purpose synchronous
> > > threads.
> > > 
> > > I haven't looked at the patch series enough to have an opinion about
> > > whether this use case needs a special-purpose thread or not. I am
> > > assuming it really needs to be a special-purpose thread. Peter and you
> > > could discuss that further if you want.
> > > 
> > > I agree with Peter's request to use QEMU's synchronization primitives.
> > > They do not depend on the event loop so they can be used outside the
> > > event loop.
> > > 
> > > The issue I'm raising with this patch is that adding new join()/poll()
> > > APIs that shouldn't be called from the event loop is bug-prone. It
> > > will make the QEMU ThreadPool code harder to understand and maintain
> > > because now there are two different contexts where different subsets
> > > of this API can be used and mixing them leads to problems. To me the
> > > non-event loop case is beyond the scope of QEMU's ThreadPool. I have
> > > CCed Paolo, who wrote the thread pool in its current form in case he
> > > wants to participate in the discussion.
> > > 
> > > Using glib's ThreadPool solves the issue while still reusing an
> > > existing thread pool implementation. Waiting for all work to complete
> > > can be done using QemuSemaphore.
> > 
> > Right.  It's a pity that g_thread_pool_unprocessed() only monitors
> > unqueuing of tasks, and looks like there's no g_thread_pool_flush().
> > 
> > Indeed the current thread poll is very aio-centric, and if we worry about
> > misuse of the APIs we can switch to glib's threadpool.  Sorry Maciej, looks
> > like I routed you to a direction that I didn't see the side effects..
> > 
> > I think the fundamental request from my side (on behalf of migration) is we
> > should avoid a specific vmstate handler managing threads on its own.  E.g.,
> > any future devices (vdpa, vcpu, etc.) that may also be able to offload
> > save() processes concurrently to threads (just like what VFIO can already
> > do right now) should share the same pool of threads.  As long as that can
> > be achieved I am ok.
> 
> So, to be clear - do you still prefer using the (extended) QEMU's thread pool
> or rather prefer switching to the Glib thread pool instead (with
> thread_pool_wait() equivalent reimplemented inside QEMU since Glib lacks it)?

After reading Stefan's comment, I prefer the latter.

I wonder whether we should rename the current ThreadPool to AioThreadPool
or similar, so that it'll be crystal clear we want it to stick to aio
context.  Then the new pool can be the raw thread pool (and I also wonder
whether at some point the aio thread pool can still reuse the raw thread
pool to some degree).

And yes, it'll be nice we can wrap glib's with a wait() semantics.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-09 19:12             ` Peter Xu
@ 2024-09-09 19:16               ` Maciej S. Szmigiero
  2024-09-09 19:24                 ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-09 19:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Stefan Hajnoczi, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Paolo Bonzini

On 9.09.2024 21:12, Peter Xu wrote:
> On Mon, Sep 09, 2024 at 08:38:45PM +0200, Maciej S. Szmigiero wrote:
>> On 9.09.2024 18:45, Peter Xu wrote:
>>> Hi, Stefan, Maciej,
>>>
>>> Sorry to be slow on responding.
>>>
>>> On Tue, Sep 03, 2024 at 03:04:54PM -0400, Stefan Hajnoczi wrote:
>>>> On Tue, 3 Sept 2024 at 12:54, Maciej S. Szmigiero
>>>> <mail@maciej.szmigiero.name> wrote:
>>>>>
>>>>> On 3.09.2024 15:55, Stefan Hajnoczi wrote:
>>>>>> On Tue, 27 Aug 2024 at 13:58, Maciej S. Szmigiero
>>>>>> <mail@maciej.szmigiero.name> wrote:
>>>>>>>
>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>
>>>>>>> Migration code wants to manage device data sending threads in one place.
>>>>>>>
>>>>>>> QEMU has an existing thread pool implementation, however it was limited
>>>>>>> to queuing AIO operations only and essentially had a 1:1 mapping between
>>>>>>> the current AioContext and the ThreadPool in use.
>>>>>>>
>>>>>>> Implement what is necessary to queue generic (non-AIO) work on a ThreadPool
>>>>>>> too.
>>>>>>>
>>>>>>> This brings a few new operations on a pool:
>>>>>>> * thread_pool_set_minmax_threads() explicitly sets the minimum and maximum
>>>>>>> thread count in the pool.
>>>>>>>
>>>>>>> * thread_pool_join() operation waits until all the submitted work requests
>>>>>>> have finished.
>>>>>>>
>>>>>>> * thread_pool_poll() lets the new thread and / or thread completion bottom
>>>>>>> halves run (if they are indeed scheduled to be run).
>>>>>>> It is useful for thread pool users that need to launch or terminate new
>>>>>>> threads without returning to the QEMU main loop.
>>>>>>
>>>>>> Did you consider glib's GThreadPool?
>>>>>> https://docs.gtk.org/glib/struct.ThreadPool.html
>>>>>>
>>>>>> QEMU's thread pool is integrated into the QEMU event loop. If your
>>>>>> goal is to bypass the QEMU event loop, then you may as well use the
>>>>>> glib API instead.
>>>>>>
>>>>>> thread_pool_join() and thread_pool_poll() will lead to code that
>>>>>> blocks the event loop. QEMU's aio_poll() and nested event loops in
>>>>>> general are a source of hangs and re-entrancy bugs. I would prefer not
>>>>>> introducing these issues in the QEMU ThreadPool API.
>>>>>>
>>>>>
>>>>> Unfortunately, the problem with the migration code is that it is
>>>>> synchronous - it does not return to the main event loop until the
>>>>> migration is done.
>>>>>
>>>>> So the only way to handle things that need working event loop is to
>>>>> pump it manually from inside the migration code.
>>>>>
>>>>> The reason why I used the QEMU thread pool in the first place in this
>>>>> patch set version is because Peter asked me to do so during the review
>>>>> of its previous iteration [1].
>>>>>
>>>>> Peter also asked me previously to move to QEMU synchronization
>>>>> primitives from using the Glib ones in the early version of this
>>>>> patch set [2].
>>>>>
>>>>> I personally would rather use something common to many applications,
>>>>> well tested and with more pairs of eyes looking at it rather to
>>>>> re-invent things in QEMU.
>>>>>
>>>>> Looking at GThreadPool it seems that it lacks ability to wait until
>>>>> all queued work have finished, so this would need to be open-coded
>>>>> in the migration code.
>>>>>
>>>>> @Peter, what's your opinion on using Glib's thread pool instead of
>>>>> QEMU's one, considering the above things?
>>>>
>>>> I'll add a bit more about my thinking:
>>>>
>>>> Using QEMU's event-driven model is usually preferred because it makes
>>>> integrating with the rest of QEMU easy and avoids having lots of
>>>> single-purpose threads that are hard to observe/manage (e.g. through
>>>> the QMP monitor).
>>>>
>>>> When there is a genuine need to spawn a thread and write synchronous
>>>> code (e.g. a blocking ioctl(2) call or something CPU-intensive), then
>>>
>>> Right, AFAIU this is the current use case for VFIO, and anything beyond in
>>> migration context, where we want to use genuine threads with no need to
>>> integrate with the main even loop.
>>>
>>> Currently the VFIO workfn should read() the VFIO fd in a blocked way, then
>>> dump them to multifd threads (further dump to migration channels), during
>>> which it can wait on a semaphore.
>>>
>>>> it's okay to do that. Use QEMUBH, EventNotifier, or other QEMU APIs to
>>>> synchronize between event loop threads and special-purpose synchronous
>>>> threads.
>>>>
>>>> I haven't looked at the patch series enough to have an opinion about
>>>> whether this use case needs a special-purpose thread or not. I am
>>>> assuming it really needs to be a special-purpose thread. Peter and you
>>>> could discuss that further if you want.
>>>>
>>>> I agree with Peter's request to use QEMU's synchronization primitives.
>>>> They do not depend on the event loop so they can be used outside the
>>>> event loop.
>>>>
>>>> The issue I'm raising with this patch is that adding new join()/poll()
>>>> APIs that shouldn't be called from the event loop is bug-prone. It
>>>> will make the QEMU ThreadPool code harder to understand and maintain
>>>> because now there are two different contexts where different subsets
>>>> of this API can be used and mixing them leads to problems. To me the
>>>> non-event loop case is beyond the scope of QEMU's ThreadPool. I have
>>>> CCed Paolo, who wrote the thread pool in its current form in case he
>>>> wants to participate in the discussion.
>>>>
>>>> Using glib's ThreadPool solves the issue while still reusing an
>>>> existing thread pool implementation. Waiting for all work to complete
>>>> can be done using QemuSemaphore.
>>>
>>> Right.  It's a pity that g_thread_pool_unprocessed() only monitors
>>> unqueuing of tasks, and looks like there's no g_thread_pool_flush().
>>>
>>> Indeed the current thread poll is very aio-centric, and if we worry about
>>> misuse of the APIs we can switch to glib's threadpool.  Sorry Maciej, looks
>>> like I routed you to a direction that I didn't see the side effects..
>>>
>>> I think the fundamental request from my side (on behalf of migration) is we
>>> should avoid a specific vmstate handler managing threads on its own.  E.g.,
>>> any future devices (vdpa, vcpu, etc.) that may also be able to offload
>>> save() processes concurrently to threads (just like what VFIO can already
>>> do right now) should share the same pool of threads.  As long as that can
>>> be achieved I am ok.
>>
>> So, to be clear - do you still prefer using the (extended) QEMU's thread pool
>> or rather prefer switching to the Glib thread pool instead (with
>> thread_pool_wait() equivalent reimplemented inside QEMU since Glib lacks it)?
> 
> After reading Stefan's comment, I prefer the latter.
> 
> I wonder whether we should rename the current ThreadPool to AioThreadPool
> or similar, so that it'll be crystal clear we want it to stick to aio
> context.  Then the new pool can be the raw thread pool (and I also wonder
> whether at some point the aio thread pool can still reuse the raw thread
> pool to some degree).
> 
> And yes, it'll be nice we can wrap glib's with a wait() semantics.

So, if I understand your design correctly, you want to basically wrap
the Glib's GThreadPool into some QEMU GenericThreadPool and then use the
later in multifd code, right?
  
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support
  2024-09-09 19:16               ` Maciej S. Szmigiero
@ 2024-09-09 19:24                 ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-09 19:24 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Stefan Hajnoczi, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Paolo Bonzini

On Mon, Sep 09, 2024 at 09:16:32PM +0200, Maciej S. Szmigiero wrote:
> So, if I understand your design correctly, you want to basically wrap
> the Glib's GThreadPool into some QEMU GenericThreadPool and then use the
> later in multifd code, right?

Yes.  I didn't have an explicit picture yet in mind, but what you said
makes sense to me.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-09 19:08         ` Peter Xu
@ 2024-09-09 19:32           ` Peter Xu
  2024-09-19 19:48             ` Maciej S. Szmigiero
  2024-09-19 19:47           ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-09 19:32 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On Mon, Sep 09, 2024 at 03:08:40PM -0400, Peter Xu wrote:
> On Mon, Sep 09, 2024 at 08:32:45PM +0200, Maciej S. Szmigiero wrote:
> > On 9.09.2024 19:59, Peter Xu wrote:
> > > On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
> > > > 
> > > > On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> > > > > External email: Use caution opening links or attachments
> > > > > 
> > > > > 
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > These SaveVMHandlers help device provide its own asynchronous
> > > > > transmission of the remaining data at the end of a precopy phase.
> > > > > 
> > > > > In this use case the save_live_complete_precopy_begin handler might
> > > > > be used to mark the stream boundary before proceeding with asynchronous
> > > > > transmission of the remaining data while the
> > > > > save_live_complete_precopy_end handler might be used to mark the
> > > > > stream boundary after performing the asynchronous transmission.
> > > > > 
> > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > ---
> > > > >    include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
> > > > >    migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
> > > > >    2 files changed, 71 insertions(+)
> > > > > 
> > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > index f60e797894e5..9de123252edf 100644
> > > > > --- a/include/migration/register.h
> > > > > +++ b/include/migration/register.h
> > > > > @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
> > > > >         */
> > > > >        int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
> > > > > 
> > > > > +    /**
> > > > > +     * @save_live_complete_precopy_begin
> > > > > +     *
> > > > > +     * Called at the end of a precopy phase, before all
> > > > > +     * @save_live_complete_precopy handlers and before launching
> > > > > +     * all @save_live_complete_precopy_thread threads.
> > > > > +     * The handler might, for example, mark the stream boundary before
> > > > > +     * proceeding with asynchronous transmission of the remaining data via
> > > > > +     * @save_live_complete_precopy_thread.
> > > > > +     * When postcopy is enabled, devices that support postcopy will skip this step.
> > > > > +     *
> > > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > > +     * @idstr: this device section idstr
> > > > > +     * @instance_id: this device section instance_id
> > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > +     *
> > > > > +     * Returns zero to indicate success and negative for error
> > > > > +     */
> > > > > +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
> > > > > +                                            char *idstr, uint32_t instance_id,
> > > > > +                                            void *opaque);
> > > > > +    /**
> > > > > +     * @save_live_complete_precopy_end
> > > > > +     *
> > > > > +     * Called at the end of a precopy phase, after @save_live_complete_precopy
> > > > > +     * handlers and after all @save_live_complete_precopy_thread threads have
> > > > > +     * finished. When postcopy is enabled, devices that support postcopy will
> > > > > +     * skip this step.
> > > > > +     *
> > > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > +     *
> > > > > +     * Returns zero to indicate success and negative for error
> > > > > +     */
> > > > > +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
> > > > 
> > > > Is this handler necessary now that migration core is responsible for the
> > > > threads and joins them? I don't see VFIO implementing it later on.
> > > 
> > > Right, I spot the same thing.
> > > 
> > > This series added three hooks: begin, end, precopy_thread.
> > > 
> > > What I think is it only needs one, which is precopy_async.  My vague memory
> > > was that was what we used to discuss too, so that when migration precopy
> > > flushes the final round of iterable data, it does:
> > > 
> > >    (1) loop over all complete_precopy_async() and enqueue the tasks if
> > >        existed into the migration worker pool.  Then,
> > > 
> > >    (2) loop over all complete_precopy() like before.
> > > 
> > > Optionally, we can enforce one vmstate handler only provides either
> > > complete_precopy_async() or complete_precopy().  In this case VFIO can
> > > update the two hooks during setup() by detecting multifd && !mapped_ram &&
> > > nocomp.
> > > 
> > 
> > The "_begin" hook is still necessary to mark the end of the device state
> > sent via the main migration stream (during the phase VM is still running)
> > since we can't start loading the multifd sent device state until all of
> > that earlier data finishes loading first.
> 
> Ah I remembered some more now, thanks.
> 
> If vfio can send data during iterations this new hook will also not be
> needed, right?
> 
> I remember you mentioned you'd have a look and see the challenges there, is
> there any conclusion yet on whether we can use multifd even during that?
> 
> It's also a pity that we introduce this hook only because we want a
> boundary between "iterable stage" and "final stage".  IIUC if we have any
> kind of message telling dest before hand that "we're going to the last
> stage" then this hook can be avoided.  Now it's at least inefficient
> because we need to trigger begin() per-device, even if I think it's more of
> a global request saying that "we need to load all main stream data first
> before moving on".

Or, we could add one MIG_CMD_SWITCHOVER under QEMU_VM_COMMAND, then send it
at the beginning of the switchover phase.  Then we can have a generic
marker on destination to be the boundary of "iterations" v.s. "switchover".
Then I think we can also drop the begin() here, just to avoid one such sync
per-device (also in case if others may have such need, like vdpa, then vdpa
doesn't need that flag too).

Fundamentally, that makes the VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE to be a
migration flag..

But for sure the best is still if VFIO can enable multifd even during
iterations.  Then the boundary guard may not be needed.

> 
> > 
> > We shouldn't send that boundary marker in .save_live_complete_precopy
> > either since it would meant unnecessary waiting for other devices
> > (not necessary VFIO ones) .save_live_complete_precopy bulk data.
> > 
> > And VFIO SaveVMHandlers are shared for all VFIO devices (and const) so
> > we can't really change them at runtime.
> 
> In all cases, please consider dropping end() if it's never used; IMO it's
> fine if there is only begin(), and we shouldn't keep hooks that are never
> used.
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-08-30 13:02       ` Fabiano Rosas
@ 2024-09-09 19:40         ` Peter Xu
  2024-09-19 19:50           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-09 19:40 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

On Fri, Aug 30, 2024 at 10:02:40AM -0300, Fabiano Rosas wrote:
> >>> @@ -397,20 +404,16 @@ bool multifd_send(MultiFDSendData **send_data)
> >>>   
> >>>           p = &multifd_send_state->params[i];
> >>>           /*
> >>> -         * Lockless read to p->pending_job is safe, because only multifd
> >>> -         * sender thread can clear it.
> >>> +         * Lockless RMW on p->pending_job_preparing is safe, because only multifd
> >>> +         * sender thread can clear it after it had seen p->pending_job being set.
> >>> +         *
> >>> +         * Pairs with qatomic_store_release() in multifd_send_thread().
> >>>            */
> >>> -        if (qatomic_read(&p->pending_job) == false) {
> >>> +        if (qatomic_cmpxchg(&p->pending_job_preparing, false, true) == false) {
> >> 
> >> What's the motivation for this change? It would be better to have it in
> >> a separate patch with a proper justification.
> >
> > The original RFC patch set used dedicated device state multifd channels.
> >
> > Peter and other people wanted this functionality removed, however this caused
> > a performance (downtime) regression.
> >
> > One of the things that seemed to help mitigate this regression was making
> > the multifd channel selection more fair via this change.
> >
> > But I can split out it to a separate commit in the next patch set version and
> > then see what performance improvement it currently brings.
> 
> Yes, better to have it separate if anything for documentation of the
> rationale.

And when drafting that patch, please add a comment explaining the field.
Currently it's missing:

    /*
     * The sender thread has work to do if either of below boolean is set.
     *
     * @pending_job:  a job is pending
     * @pending_sync: a sync request is pending
     *
     * For both of these fields, they're only set by the requesters, and
     * cleared by the multifd sender threads.
     */
    bool pending_job;
    bool pending_job_preparing;
    bool pending_sync;

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-02 20:12     ` Maciej S. Szmigiero
  2024-09-03 14:42       ` Fabiano Rosas
@ 2024-09-09 19:52       ` Peter Xu
  2024-09-19 19:49         ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-09 19:52 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Mon, Sep 02, 2024 at 10:12:01PM +0200, Maciej S. Szmigiero wrote:
> > > diff --git a/migration/multifd.h b/migration/multifd.h
> > > index a3e35196d179..a8f3e4838c01 100644
> > > --- a/migration/multifd.h
> > > +++ b/migration/multifd.h
> > > @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
> > >   #define MULTIFD_FLAG_QPL (4 << 1)
> > >   #define MULTIFD_FLAG_UADK (8 << 1)
> > > +/*
> > > + * If set it means that this packet contains device state
> > > + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
> > > + */
> > > +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
> > 
> > Overlaps with UADK. I assume on purpose because device_state doesn't
> > support compression? Might be worth a comment.
> > 
> 
> Yes, the device state transfer bit stream does not support compression
> so it is not a problem since these "compression type" flags will never
> be set in such bit stream anyway.
> 
> Will add a relevant comment here.

Why reuse?  Would using a new bit easier if we still have plenty of bits
(just to tell what is what directly from a stream dump)?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-08-27 17:54 ` [PATCH v2 08/17] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
  2024-08-30 19:28   ` Fabiano Rosas
  2024-09-05 15:13   ` Avihai Horon
@ 2024-09-09 20:03   ` Peter Xu
  2024-09-19 19:49     ` Maciej S. Szmigiero
  2 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-09 20:03 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> load_finish SaveVMHandler allows migration code to poll whether
> a device-specific asynchronous device state loading operation had finished.
> 
> In order to avoid calling this handler needlessly the device is supposed
> to notify the migration code of its possible readiness via a call to
> qemu_loadvm_load_finish_ready_broadcast() while holding
> qemu_loadvm_load_finish_ready_lock.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/migration/register.h | 21 +++++++++++++++
>  migration/migration.c        |  6 +++++
>  migration/migration.h        |  3 +++
>  migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>  migration/savevm.h           |  4 +++
>  5 files changed, 86 insertions(+)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 4a578f140713..44d8cf5192ae 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>      int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>                               Error **errp);
>  
> +    /**
> +     * @load_finish
> +     *
> +     * Poll whether all asynchronous device state loading had finished.
> +     * Not called on the load failure path.
> +     *
> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> +     *
> +     * If this method signals "not ready" then it might not be called
> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> +     * while holding qemu_loadvm_load_finish_ready_lock.

[1]

> +     *
> +     * @opaque: data pointer passed to register_savevm_live()
> +     * @is_finished: whether the loading had finished (output parameter)
> +     * @errp: pointer to Error*, to store an error if it happens.
> +     *
> +     * Returns zero to indicate success and negative for error
> +     * It's not an error that the loading still hasn't finished.
> +     */
> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);

The load_finish() semantics is a bit weird, especially above [1] on "only
allowed to be called once if ..." and also on the locks.

It looks to me vfio_load_finish() also does the final load of the device.

I wonder whether that final load can be done in the threads, then after
everything loaded the device post a semaphore telling the main thread to
continue.  See e.g.:

    if (migrate_switchover_ack()) {
        qemu_loadvm_state_switchover_ack_needed(mis);
    }

IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
when all things are loaded?  We can then get rid of this slightly awkward
interface.  I had a feeling that things can be simplified (e.g., if the
thread will take care of loading the final vmstate then the mutex is also
not needed? etc.).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic
  2024-08-27 17:54 ` [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic Maciej S. Szmigiero
  2024-08-30 18:13   ` Fabiano Rosas
@ 2024-09-10 14:13   ` Peter Xu
  1 sibling, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-10 14:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Aug 27, 2024 at 07:54:29PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This is necessary for multifd_send() to be able to be called
> from multiple threads.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Would it be much simpler to just use a mutex for enqueue?

Something like:

===8<===
diff --git a/migration/multifd.c b/migration/multifd.c
index 9b200f4ad9..979c9748b5 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -69,6 +69,8 @@ struct {
     QemuSemaphore channels_created;
     /* send channels ready */
     QemuSemaphore channels_ready;
+    /* Mutex to serialize multifd enqueues */
+    QemuMutex multifd_send_mutex;
     /*
      * Have we already run terminate threads.  There is a race when it
      * happens that we got one error while we are exiting.
@@ -305,6 +307,8 @@ bool multifd_send(MultiFDSendData **send_data)
     MultiFDSendParams *p = NULL; /* make happy gcc */
     MultiFDSendData *tmp;
 
+    QEMU_LOCK_GUARD(&multifd_send_mutex);
+
     if (multifd_send_should_exit()) {
         return false;
     }
@@ -824,6 +828,7 @@ bool multifd_send_setup(void)
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
     qemu_sem_init(&multifd_send_state->channels_created, 0);
     qemu_sem_init(&multifd_send_state->channels_ready, 0);
+    qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
     qatomic_set(&multifd_send_state->exiting, 0);
     multifd_send_state->ops = multifd_ops[migrate_multifd_compression()];
===8<===

Then all the details doesn't need change (meanwhile the perf should be
similar)?

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-08-27 17:54 ` [PATCH v2 12/17] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
  2024-08-29  0:41   ` Fabiano Rosas
@ 2024-09-10 16:06   ` Peter Xu
  2024-09-19 19:49     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-10 16:06 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Aug 27, 2024 at 07:54:31PM +0200, Maciej S. Szmigiero wrote:
> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> +                                char *data, size_t len)
> +{
> +    /* Device state submissions can come from multiple threads */
> +    QEMU_LOCK_GUARD(&queue_job_mutex);

Ah, just notice there's the mutex.

So please consider the reply in the other thread, IIUC we can make it for
multifd_send() to be a generic mutex to simplify the other patch too, then
drop here.

I assume the ram code should be fine taking one more mutex even without
vfio, if it only takes once for each ~128 pages to enqueue, and only take
in the main thread, then each update should be also in the hot path
(e.g. no cache bouncing).

> +    MultiFDDeviceState_t *device_state;
> +
> +    assert(multifd_payload_empty(device_state_send));
> +
> +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> +    device_state = &device_state_send->u.device_state;
> +    device_state->idstr = g_strdup(idstr);
> +    device_state->instance_id = instance_id;
> +    device_state->buf = g_memdup2(data, len);
> +    device_state->buf_len = len;
> +
> +    if (!multifd_send(&device_state_send)) {
> +        multifd_send_data_clear(device_state_send);
> +        return false;
> +    }
> +
> +    return true;
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-08-29  0:41   ` Fabiano Rosas
  2024-08-29 20:03     ` Maciej S. Szmigiero
@ 2024-09-10 19:48     ` Peter Xu
  2024-09-12 18:43       ` Fabiano Rosas
  2024-09-19 19:49       ` Maciej S. Szmigiero
  1 sibling, 2 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-10 19:48 UTC (permalink / raw)
  To: Fabiano Rosas, mail
  Cc: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
> > +size_t multifd_device_state_payload_size(void)
> > +{
> > +    return sizeof(MultiFDDeviceState_t);
> > +}
> 
> This will not be necessary because the payload size is the same as the
> data type. We only need it for the special case where the MultiFDPages_t
> is smaller than the total ram payload size.

Today I was thinking maybe we should really clean this up, as the current
multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
that feeling stronger.

I think we should change it now perhaps, otherwise we'll need to introduce
other helpers to e.g. reset the device buffers, and that's not only slow
but also not good looking, IMO.

So I went ahead with the idea in previous discussion, that I managed to
change the SendData union into struct; the memory consumption is not super
important yet, IMHO, but we should still stick with the object model where
multifd enqueue thread switch buffer with multifd, as it still sounds a
sane way to do.

Then when that patch is ready, I further tried to make VFIO reuse multifd
buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
don't allocate it every time we enqueue.

I hope it'll also work for VFIO.  VFIO has a specialty on being able to
dump the config space so it's more complex (and I noticed Maciej's current
design requires the final chunk of VFIO config data be migrated in one
packet.. that is also part of the complexity there).  So I allowed that
part to allocate a buffer but only that.  IOW, I made some API (see below)
that can either reuse preallocated buffer, or use a separate one only for
the final bulk.

In short, could both of you have a look at what I came up with below?  I
did that in patches because I think it's too much to comment, so patches
may work better.  No concern if any of below could be good changes to you,
then either Maciej can squash whatever into existing patches (and I feel
like some existing patches in this series can go away with below design),
or I can post pre-requisite patch but only if any of you prefer that.

Anyway, let me know, the patches apply on top of this whole series applied
first.

I also wonder whether there can be any perf difference already (I tested
all multifd qtest with below, but no VFIO I can run), perhaps not that
much, but just to mention below should avoid both buffer allocations and
one round of copy (so VFIO read() directly writes to the multifd buffers
now).

Thanks,

==========8<==========
From a6cbcf692b2376e72cc053219d67bb32eabfb7a6 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 10 Sep 2024 12:10:59 -0400
Subject: [PATCH 1/3] migration/multifd: Make MultiFDSendData a struct

The newly introduced device state buffer can be used for either storing
VFIO's read() raw data, but already also possible to store generic device
states.  After noticing that device states may not easily provide a max
buffer size (also the fact that RAM MultiFDPages_t after all also want to
have flexibility on managing offset[] array), it may not be a good idea to
stick with union on MultiFDSendData.. as it won't play well with such
flexibility.

Switch MultiFDSendData to a struct.

It won't consume a lot more space in reality, after all the real buffers
were already dynamically allocated, so it's so far only about the two
structs (pages, device_state) that will be duplicated, but they're small.

With this, we can remove the pretty hard to understand alloc size logic.
Because now we can allocate offset[] together with the SendData, and
properly free it when the SendData is freed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h              | 16 +++++++++++-----
 migration/multifd-device-state.c |  8 ++++++--
 migration/multifd-nocomp.c       | 13 ++++++-------
 migration/multifd.c              | 25 ++++++-------------------
 4 files changed, 29 insertions(+), 33 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index c15c83104c..47203334b9 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -98,9 +98,13 @@ typedef struct {
     uint32_t num;
     /* number of normal pages */
     uint32_t normal_num;
+    /*
+     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
+     * sure the pointer is always valid!
+     */
     RAMBlock *block;
-    /* offset of each page */
-    ram_addr_t offset[];
+    /* offset array of each page, managed by multifd */
+    ram_addr_t *offset;
 } MultiFDPages_t;
 
 struct MultiFDRecvData {
@@ -123,7 +127,7 @@ typedef enum {
     MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
-typedef union MultiFDPayload {
+typedef struct MultiFDPayload {
     MultiFDPages_t ram;
     MultiFDDeviceState_t device_state;
 } MultiFDPayload;
@@ -323,11 +327,13 @@ static inline uint32_t multifd_ram_page_count(void)
 void multifd_ram_save_setup(void);
 void multifd_ram_save_cleanup(void);
 int multifd_ram_flush_and_sync(void);
-size_t multifd_ram_payload_size(void);
+void multifd_ram_payload_alloc(MultiFDPages_t *pages);
+void multifd_ram_payload_free(MultiFDPages_t *pages);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
 
-size_t multifd_device_state_payload_size(void);
+void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state);
+void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state);
 void multifd_device_state_save_setup(void);
 void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
 void multifd_device_state_save_cleanup(void);
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index 9b364e8ef3..72b72b6e62 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -22,9 +22,13 @@ bool send_threads_abort;
 
 static MultiFDSendData *device_state_send;
 
-size_t multifd_device_state_payload_size(void)
+/* TODO: use static buffers for idstr and buf */
+void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
+{
+}
+
+void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
 {
-    return sizeof(MultiFDDeviceState_t);
 }
 
 void multifd_device_state_save_setup(void)
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 0b7b543f44..c1b95fee0d 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -22,15 +22,14 @@
 
 static MultiFDSendData *multifd_ram_send;
 
-size_t multifd_ram_payload_size(void)
+void multifd_ram_payload_alloc(MultiFDPages_t *pages)
 {
-    uint32_t n = multifd_ram_page_count();
+    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
+}
 
-    /*
-     * We keep an array of page offsets at the end of MultiFDPages_t,
-     * add space for it in the allocation.
-     */
-    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
+void multifd_ram_payload_free(MultiFDPages_t *pages)
+{
+    g_clear_pointer(&pages->offset, g_free);
 }
 
 void multifd_ram_save_setup(void)
diff --git a/migration/multifd.c b/migration/multifd.c
index bebe5b5a9b..5a20b831cf 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -101,26 +101,12 @@ struct {
 
 MultiFDSendData *multifd_send_data_alloc(void)
 {
-    size_t max_payload_size, size_minus_payload;
+    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
 
-    /*
-     * MultiFDPages_t has a flexible array at the end, account for it
-     * when allocating MultiFDSendData. Use max() in case other types
-     * added to the union in the future are larger than
-     * (MultiFDPages_t + flex array).
-     */
-    max_payload_size = MAX(multifd_ram_payload_size(),
-                           multifd_device_state_payload_size());
-    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
-
-    /*
-     * Account for any holes the compiler might insert. We can't pack
-     * the structure because that misaligns the members and triggers
-     * Waddress-of-packed-member.
-     */
-    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
+    multifd_ram_payload_alloc(&new->u.ram);
+    multifd_device_state_payload_alloc(&new->u.device_state);
 
-    return g_malloc0(size_minus_payload + max_payload_size);
+    return new;
 }
 
 void multifd_send_data_clear(MultiFDSendData *data)
@@ -147,7 +133,8 @@ void multifd_send_data_free(MultiFDSendData *data)
         return;
     }
 
-    multifd_send_data_clear(data);
+    multifd_ram_payload_free(&data->u.ram);
+    multifd_device_state_payload_free(&data->u.device_state);
 
     g_free(data);
 }
-- 
2.45.0



From 6695d134c0818f42183f5ea03c21e6d56e7b57ea Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 10 Sep 2024 12:24:14 -0400
Subject: [PATCH 2/3] migration/multifd: Optimize device_state->idstr updates

The duplication / allocation of idstr for each VFIO blob is an overkill, as
idstr isn't something that changes frequently.  Also, the idstr always came
from the upper layer of se->idstr so it's always guaranteed to
exist (e.g. no device unplug allowed during migration).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h              |  4 ++++
 migration/multifd-device-state.c | 10 +++++++---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 47203334b9..1eaa5d4496 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -115,6 +115,10 @@ struct MultiFDRecvData {
 };
 
 typedef struct {
+    /*
+     * Name of the owner device.  NOTE: it's caller's responsibility to
+     * make sure the pointer is always valid!
+     */
     char *idstr;
     uint32_t instance_id;
     char *buf;
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index 72b72b6e62..cfd0465bac 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -44,7 +44,7 @@ void multifd_device_state_save_setup(void)
 
 void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
 {
-    g_clear_pointer(&device_state->idstr, g_free);
+    device_state->idstr = NULL;
     g_clear_pointer(&device_state->buf, g_free);
 }
 
@@ -100,7 +100,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
 
     multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
     device_state = &device_state_send->u.device_state;
-    device_state->idstr = g_strdup(idstr);
+    /*
+     * NOTE: here we must use a static idstr (e.g. of a savevm state
+     * entry) rather than any dynamically allocated buffer, because multifd
+     * assumes this pointer is always valid!
+     */
+    device_state->idstr = idstr;
     device_state->instance_id = instance_id;
     device_state->buf = g_memdup2(data, len);
     device_state->buf_len = len;
@@ -137,7 +142,6 @@ static void multifd_device_state_save_thread_data_free(void *opaque)
 {
     struct MultiFDDSSaveThreadData *data = opaque;
 
-    g_clear_pointer(&data->idstr, g_free);
     g_free(data);
 }
 
-- 
2.45.0


From abfea9698ff46ad0e0175e1dcc6e005e0b2ece2a Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 10 Sep 2024 12:27:49 -0400
Subject: [PATCH 3/3] migration/multifd: Optimize device_state buffer
 allocations

Provide a device_state->buf_prealloc so that the buffers can be reused if
possible.  Provide a set of APIs to use it right.  Please see the
documentation for the API in the code.

The default buffer size came from VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE as of
now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/vfio/vfio-common.h    |  9 ++++
 include/migration/misc.h         | 12 ++++-
 migration/multifd.h              | 11 +++-
 hw/vfio/migration.c              | 43 ++++++++-------
 migration/multifd-device-state.c | 93 +++++++++++++++++++++++++-------
 migration/multifd.c              |  9 ----
 6 files changed, 126 insertions(+), 51 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 4578a0ca6a..c1f2f4ae55 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -61,6 +61,13 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIODeviceStatePacket {
+    uint32_t version;
+    uint32_t idx;
+    uint32_t flags;
+    uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
@@ -168,6 +175,8 @@ typedef struct VFIODevice {
     int devid;
     IOMMUFDBackend *iommufd;
     VFIOIOASHwpt *hwpt;
+    /* Only used on sender side when multifd is enabled */
+    VFIODeviceStatePacket *multifd_packet;
     QLIST_ENTRY(VFIODevice) hwpt_next;
 } VFIODevice;
 
diff --git a/include/migration/misc.h b/include/migration/misc.h
index 26f7f3140f..1a8676ed3d 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -112,8 +112,16 @@ bool migration_in_bg_snapshot(void);
 void dirty_bitmap_mig_init(void);
 
 /* migration/multifd-device-state.c */
-bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
-                                char *data, size_t len);
+struct MultiFDDeviceState_t;
+typedef struct MultiFDDeviceState_t MultiFDDeviceState_t;
+
+MultiFDDeviceState_t *
+multifd_device_state_prepare(char *idstr, uint32_t instance_id);
+void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
+                                      int64_t *buf_len);
+bool multifd_device_state_finish(MultiFDDeviceState_t *state,
+                                 void *buf, int64_t buf_len);
+
 bool migration_has_device_state_support(void);
 
 void
diff --git a/migration/multifd.h b/migration/multifd.h
index 1eaa5d4496..1ccdeeb8c5 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -15,6 +15,7 @@
 
 #include "exec/target_page.h"
 #include "ram.h"
+#include "migration/misc.h"
 
 typedef struct MultiFDRecvData MultiFDRecvData;
 typedef struct MultiFDSendData MultiFDSendData;
@@ -114,16 +115,22 @@ struct MultiFDRecvData {
     off_t file_offset;
 };
 
-typedef struct {
+struct MultiFDDeviceState_t {
     /*
      * Name of the owner device.  NOTE: it's caller's responsibility to
      * make sure the pointer is always valid!
      */
     char *idstr;
     uint32_t instance_id;
+    /*
+     * Points to the buffer to send via multifd.  Normally it's the same as
+     * buf_prealloc, otherwise the caller needs to make sure the buffer is
+     * avaliable through multifd running.
+     */
     char *buf;
+    char *buf_prealloc;
     size_t buf_len;
-} MultiFDDeviceState_t;
+};
 
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 67996aa2df..e36422b7c5 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -59,13 +59,6 @@
 
 #define VFIO_DEVICE_STATE_CONFIG_STATE (1)
 
-typedef struct VFIODeviceStatePacket {
-    uint32_t version;
-    uint32_t idx;
-    uint32_t flags;
-    uint8_t data[0];
-} QEMU_PACKED VFIODeviceStatePacket;
-
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -741,6 +734,9 @@ static void vfio_save_cleanup(void *opaque)
     migration->initial_data_sent = false;
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
+    if (vbasedev->multifd_packet) {
+        g_clear_pointer(&vbasedev->multifd_packet, g_free);
+    }
 }
 
 static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
@@ -892,7 +888,8 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
     g_autoptr(QIOChannelBuffer) bioc = NULL;
     QEMUFile *f = NULL;
     int ret;
-    g_autofree VFIODeviceStatePacket *packet = NULL;
+    VFIODeviceStatePacket *packet;
+    MultiFDDeviceState_t *state;
     size_t packet_len;
 
     bioc = qio_channel_buffer_new(0);
@@ -911,13 +908,19 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
     }
 
     packet_len = sizeof(*packet) + bioc->usage;
-    packet = g_malloc0(packet_len);
+
+    state = multifd_device_state_prepare(idstr, instance_id);
+    /*
+     * Do not reuse multifd buffer, but use our own due to random size.
+     * The buffer will be freed only when save cleanup.
+     */
+    vbasedev->multifd_packet = g_malloc0(packet_len);
+    packet = vbasedev->multifd_packet;
     packet->idx = idx;
     packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
     memcpy(&packet->data, bioc->data, bioc->usage);
 
-    if (!multifd_queue_device_state(idstr, instance_id,
-                                    (char *)packet, packet_len)) {
+    if (!multifd_device_state_finish(state, packet, packet_len)) {
         ret = -1;
     }
 
@@ -936,7 +939,6 @@ static int vfio_save_complete_precopy_thread(char *idstr,
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     int ret;
-    g_autofree VFIODeviceStatePacket *packet = NULL;
     uint32_t idx;
 
     if (!migration->multifd_transfer) {
@@ -954,21 +956,25 @@ static int vfio_save_complete_precopy_thread(char *idstr,
         goto ret_finish;
     }
 
-    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
-
     for (idx = 0; ; idx++) {
+        VFIODeviceStatePacket *packet;
+        MultiFDDeviceState_t *state;
         ssize_t data_size;
         size_t packet_size;
+        int64_t buf_size;
 
         if (qatomic_read(abort_flag)) {
             ret = -ECANCELED;
             goto ret_finish;
         }
 
+        state = multifd_device_state_prepare(idstr, instance_id);
+        packet = multifd_device_state_get_buffer(state, &buf_size);
         data_size = read(migration->data_fd, &packet->data,
-                         migration->data_buffer_size);
+                         buf_size - sizeof(*packet));
         if (data_size < 0) {
             if (errno != ENOMSG) {
+                multifd_device_state_finish(state, NULL, 0);
                 ret = -errno;
                 goto ret_finish;
             }
@@ -980,14 +986,15 @@ static int vfio_save_complete_precopy_thread(char *idstr,
             data_size = 0;
         }
 
-        if (data_size == 0)
+        if (data_size == 0) {
+            multifd_device_state_finish(state, NULL, 0);
             break;
+        }
 
         packet->idx = idx;
         packet_size = sizeof(*packet) + data_size;
 
-        if (!multifd_queue_device_state(idstr, instance_id,
-                                        (char *)packet, packet_size)) {
+        if (!multifd_device_state_finish(state, packet, packet_size)) {
             ret = -1;
             goto ret_finish;
         }
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index cfd0465bac..6f0259426d 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -20,15 +20,18 @@ ThreadPool *send_threads;
 int send_threads_ret;
 bool send_threads_abort;
 
+#define  MULTIFD_DEVICE_STATE_BUFLEN  (1UL << 20)
+
 static MultiFDSendData *device_state_send;
 
-/* TODO: use static buffers for idstr and buf */
 void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
 {
+    device_state->buf_prealloc = g_malloc0(MULTIFD_DEVICE_STATE_BUFLEN);
 }
 
 void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
 {
+    g_clear_pointer(&device_state->buf_prealloc, g_free);
 }
 
 void multifd_device_state_save_setup(void)
@@ -42,12 +45,6 @@ void multifd_device_state_save_setup(void)
     send_threads_abort = false;
 }
 
-void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
-{
-    device_state->idstr = NULL;
-    g_clear_pointer(&device_state->buf, g_free);
-}
-
 void multifd_device_state_save_cleanup(void)
 {
     g_clear_pointer(&send_threads, thread_pool_free);
@@ -89,33 +86,89 @@ void multifd_device_state_send_prepare(MultiFDSendParams *p)
     multifd_device_state_fill_packet(p);
 }
 
-bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
-                                char *data, size_t len)
+/*
+ * Prepare to send some device state via multifd.  Returns the current idle
+ * MultiFDDeviceState_t*.
+ *
+ * As a follow up, the caller must call multifd_device_state_finish() to
+ * release the resources.
+ *
+ * One example usage of the API:
+ *
+ *   // Fetch a free multifd device state object
+ *   state = multifd_device_state_prepare(idstr, instance_id);
+ *
+ *   // Optional: fetch the buffer to reuse
+ *   buf = multifd_device_state_get_buffer(state, &buf_size);
+ *
+ *   // Here len>0 means success, otherwise failure
+ *   len = buffer_fill(buf, buf_size);
+ *
+ *   // Finish the transaction, either enqueue or cancel the request.  Here
+ *   // len>0 will enqueue, <=0 will cancel.
+ *   multifd_device_state_finish(state, buf, len);
+ */
+MultiFDDeviceState_t *
+multifd_device_state_prepare(char *idstr, uint32_t instance_id)
 {
-    /* Device state submissions can come from multiple threads */
-    QEMU_LOCK_GUARD(&queue_job_mutex);
     MultiFDDeviceState_t *device_state;
 
     assert(multifd_payload_empty(device_state_send));
 
-    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
+    /*
+     * TODO: The lock name may need change, but I'm reusing just for
+     * simplicity.
+     */
+    qemu_mutex_lock(&queue_job_mutex);
+
     device_state = &device_state_send->u.device_state;
     /*
-     * NOTE: here we must use a static idstr (e.g. of a savevm state
-     * entry) rather than any dynamically allocated buffer, because multifd
+     * NOTE: here we must use a static idstr (e.g. of a savevm state entry)
+     * rather than any dynamically allocated buffer, because multifd
      * assumes this pointer is always valid!
      */
     device_state->idstr = idstr;
     device_state->instance_id = instance_id;
-    device_state->buf = g_memdup2(data, len);
-    device_state->buf_len = len;
 
-    if (!multifd_send(&device_state_send)) {
-        multifd_send_data_clear(device_state_send);
-        return false;
+    return &device_state_send->u.device_state;
+}
+
+/*
+ * Need to be used after a previous call to multifd_device_state_prepare(),
+ * the buffer must not be used after invoke multifd_device_state_finish().
+ */
+void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
+                                      int64_t *buf_len)
+{
+    *buf_len = MULTIFD_DEVICE_STATE_BUFLEN;
+    return s->buf_prealloc;
+}
+
+/*
+ * Need to be used only in pair with a previous call to
+ * multifd_device_state_prepare().  Returns true if enqueue successful,
+ * false otherwise.
+ */
+bool multifd_device_state_finish(MultiFDDeviceState_t *state,
+                                 void *buf, int64_t buf_len)
+{
+    bool result = false;
+
+    /* Currently we only have one global free buffer */
+    assert(state == &device_state_send->u.device_state);
+
+    if (buf_len < 0) {
+        goto out;
     }
 
-    return true;
+    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
+    /* This normally will be the state->buf_prealloc, but not required */
+    state->buf = buf;
+    state->buf_len = buf_len;
+    result = multifd_send(&device_state_send);
+out:
+    qemu_mutex_unlock(&queue_job_mutex);
+    return result;
 }
 
 bool migration_has_device_state_support(void)
diff --git a/migration/multifd.c b/migration/multifd.c
index 5a20b831cf..2b5185e298 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -115,15 +115,6 @@ void multifd_send_data_clear(MultiFDSendData *data)
         return;
     }
 
-    switch (data->type) {
-    case MULTIFD_PAYLOAD_DEVICE_STATE:
-        multifd_device_state_clear(&data->u.device_state);
-        break;
-    default:
-        /* Nothing to do */
-        break;
-    }
-
     data->type = MULTIFD_PAYLOAD_NONE;
 }
 
-- 
2.45.0


-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 01/17] vfio/migration: Add save_{iterate,complete_precopy}_started trace events
  2024-09-09 18:04     ` Maciej S. Szmigiero
@ 2024-09-11 14:50       ` Avihai Horon
  0 siblings, 0 replies; 128+ messages in thread
From: Avihai Horon @ 2024-09-11 14:50 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 09/09/2024 21:04, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 5.09.2024 15:08, Avihai Horon wrote:
>> Hi Maciej,
>>
>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This way both the start and end points of migrating a particular VFIO
>>> device are known.
>>>
>>> Add also a vfio_save_iterate_empty_hit trace event so it is known when
>>> there's no more data to send for that device.
>>
>> Out of curiosity, what are these traces used for?
>
> Just for benchmarking, collecting these data makes it easier to
> reason where possible bottlenecks may be.
>
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 13 +++++++++++++
>>>   hw/vfio/trace-events          |  3 +++
>>>   include/hw/vfio/vfio-common.h |  3 +++
>>>   3 files changed, 19 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 262d42a46e58..24679d8c5034 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -472,6 +472,9 @@ static int vfio_save_setup(QEMUFile *f, void 
>>> *opaque, Error **errp)
>>>           return -ENOMEM;
>>>       }
>>>
>>> +    migration->save_iterate_run = false;
>>> +    migration->save_iterate_empty_hit = false;
>>> +
>>>       if (vfio_precopy_supported(vbasedev)) {
>>>           switch (migration->device_state) {
>>>           case VFIO_DEVICE_STATE_RUNNING:
>>> @@ -605,9 +608,17 @@ static int vfio_save_iterate(QEMUFile *f, void 
>>> *opaque)
>>>       VFIOMigration *migration = vbasedev->migration;
>>>       ssize_t data_size;
>>>
>>> +    if (!migration->save_iterate_run) {
>>> +        trace_vfio_save_iterate_started(vbasedev->name);
>>> +        migration->save_iterate_run = true;
>>
>> Maybe rename save_iterate_run to save_iterate_started so it's aligned 
>> with trace_vfio_save_iterate_started and 
>> trace_vfio_save_complete_precopy_started?
>
> Will do.
>
>>> +    }
>>> +
>>>       data_size = vfio_save_block(f, migration);
>>>       if (data_size < 0) {
>>>           return data_size;
>>> +    } else if (data_size == 0 && !migration->save_iterate_empty_hit) {
>>> +        trace_vfio_save_iterate_empty_hit(vbasedev->name);
>>> +        migration->save_iterate_empty_hit = true;
>>
>> During precopy we could hit empty multiple times. Any reason why only 
>> the first time should be traced?
>
> This trace point is supposed to indicate whether the device state
> transfer during the time the VM was still running likely has
> exhausted the amount of data that can be transferred during
> that phase.
>
> In other words, the stopped-time device state transfer likely
> only had to transfer the data which the device does not support
> transferring during the live VM phase (with just a small possible
> residual accrued since that trace point was hit).
>
> If that trace point was hit then delaying the switch over point
> further likely wouldn't help the device transfer less data during
> the downtime.

Ah, I see.

Can we achieve the same goal by using trace_vfio_state_pending_exact() 
instead?

Thanks.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-09 18:05     ` Maciej S. Szmigiero
@ 2024-09-12  8:13       ` Avihai Horon
  2024-09-12 13:52         ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-12  8:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 09/09/2024 21:05, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 5.09.2024 18:47, Avihai Horon wrote:
>>
>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Add a basic support for receiving device state via multifd channels -
>>> channels that are shared with RAM transfers.
>>>
>>> To differentiate between a device state and a RAM packet the packet
>>> header is read first.
>>>
>>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not 
>>> in the
>>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>>> data (existing MultiFDPacket_t) is then read.
>>>
>>> The received device state data is provided to
>>> qemu_loadvm_load_state_buffer() function for processing in the
>>> device's load_state_buffer handler.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   migration/multifd.c | 127 
>>> +++++++++++++++++++++++++++++++++++++-------
>>>   migration/multifd.h |  31 ++++++++++-
>>>   2 files changed, 138 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index b06a9fab500e..d5a8e5a9c9b5 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -21,6 +21,7 @@
>>>   #include "file.h"
>>>   #include "migration.h"
>>>   #include "migration-stats.h"
>>> +#include "savevm.h"
>>>   #include "socket.h"
>>>   #include "tls.h"
>>>   #include "qemu-file.h"
>>> @@ -209,10 +210,10 @@ void 
>>> multifd_send_fill_packet(MultiFDSendParams *p)
>>>
>>>       memset(packet, 0, p->packet_len);
>>>
>>> -    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>>> -    packet->version = cpu_to_be32(MULTIFD_VERSION);
>>> +    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
>>> +    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>>>
>>> -    packet->flags = cpu_to_be32(p->flags);
>>> +    packet->hdr.flags = cpu_to_be32(p->flags);
>>>       packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>>>
>>>       packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
>>> @@ -228,31 +229,49 @@ void 
>>> multifd_send_fill_packet(MultiFDSendParams *p)
>>>                               p->flags, p->next_packet_size);
>>>   }
>>>
>>> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error 
>>> **errp)
>>> +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
>>> + MultiFDPacketHdr_t *hdr,
>>> +                                             Error **errp)
>>>   {
>>> -    MultiFDPacket_t *packet = p->packet;
>>> -    int ret = 0;
>>> -
>>> -    packet->magic = be32_to_cpu(packet->magic);
>>> -    if (packet->magic != MULTIFD_MAGIC) {
>>> +    hdr->magic = be32_to_cpu(hdr->magic);
>>> +    if (hdr->magic != MULTIFD_MAGIC) {
>>>           error_setg(errp, "multifd: received packet "
>>>                      "magic %x and expected magic %x",
>>> -                   packet->magic, MULTIFD_MAGIC);
>>> +                   hdr->magic, MULTIFD_MAGIC);
>>>           return -1;
>>>       }
>>>
>>> -    packet->version = be32_to_cpu(packet->version);
>>> -    if (packet->version != MULTIFD_VERSION) {
>>> +    hdr->version = be32_to_cpu(hdr->version);
>>> +    if (hdr->version != MULTIFD_VERSION) {
>>>           error_setg(errp, "multifd: received packet "
>>>                      "version %u and expected version %u",
>>> -                   packet->version, MULTIFD_VERSION);
>>> +                   hdr->version, MULTIFD_VERSION);
>>>           return -1;
>>>       }
>>>
>>> -    p->flags = be32_to_cpu(packet->flags);
>>> +    p->flags = be32_to_cpu(hdr->flags);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int 
>>> multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
>>> +                                                   Error **errp)
>>> +{
>>> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
>>> +
>>> +    packet->instance_id = be32_to_cpu(packet->instance_id);
>>> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, 
>>> Error **errp)
>>> +{
>>> +    MultiFDPacket_t *packet = p->packet;
>>> +    int ret = 0;
>>> +
>>>       p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>>       p->packet_num = be64_to_cpu(packet->packet_num);
>>> -    p->packets_recved++;
>>>
>>>       if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>>>           ret = multifd_ram_unfill_packet(p, errp);
>>> @@ -264,6 +283,19 @@ static int 
>>> multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>>>       return ret;
>>>   }
>>>
>>> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error 
>>> **errp)
>>> +{
>>> +    p->packets_recved++;
>>> +
>>> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>>> +        return multifd_recv_unfill_packet_device_state(p, errp);
>>> +    } else {
>>> +        return multifd_recv_unfill_packet_ram(p, errp);
>>> +    }
>>> +
>>> +    g_assert_not_reached();
>>
>> We can drop the assert and the "else":
>> if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>>      return multifd_recv_unfill_packet_device_state(p, errp);
>> }
>>
>> return multifd_recv_unfill_packet_ram(p, errp);
>
> Ack.
>
>>> +}
>>> +
>>>   static bool multifd_send_should_exit(void)
>>>   {
>>>       return qatomic_read(&multifd_send_state->exiting);
>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>> index a3e35196d179..a8f3e4838c01 100644
>>> --- a/migration/multifd.h
>>> +++ b/migration/multifd.h
>>> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>>   #define MULTIFD_FLAG_QPL (4 << 1)
>>>   #define MULTIFD_FLAG_UADK (8 << 1)
>>>
>>> +/*
>>> + * If set it means that this packet contains device state
>>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>>> + */
>>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
>>> +
>>>   /* This value needs to be a multiple of qemu_target_page_size() */
>>>   #define MULTIFD_PACKET_SIZE (512 * 1024)
>>>
>>> @@ -52,6 +58,11 @@ typedef struct {
>>>       uint32_t magic;
>>>       uint32_t version;
>>>       uint32_t flags;
>>> +} __attribute__((packed)) MultiFDPacketHdr_t;
>>
>> Maybe split this patch into two: one that adds the packet header 
>> concept and another that adds the new device packet?
>
> Can do.
>
>>> +
>>> +typedef struct {
>>> +    MultiFDPacketHdr_t hdr;
>>> +
>>>       /* maximum number of allocated pages */
>>>       uint32_t pages_alloc;
>>>       /* non zero pages */
>>> @@ -72,6 +83,16 @@ typedef struct {
>>>       uint64_t offset[];
>>>   } __attribute__((packed)) MultiFDPacket_t;
>>>
>>> +typedef struct {
>>> +    MultiFDPacketHdr_t hdr;
>>> +
>>> +    char idstr[256] QEMU_NONSTRING;
>>
>> idstr should be null terminated, or am I missing something?
>
> There's no need to always NULL-terminate a constant-size field,
> since the strncpy() already stops at the field size, so we can
> gain another byte for actual string use this way.
>
> RAM block idstr also uses the same "trick":
>> void multifd_ram_fill_packet(MultiFDSendParams *p):
>> strncpy(packet->ramblock, pages->block->idstr, 256);
>
But can idstr actually be 256 bytes long without null byte?
There are a lot of places where idstr is a parameter for functions that 
expect null terminated string and it is also printed as such.

Thanks.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side
  2024-09-09 18:06     ` Maciej S. Szmigiero
@ 2024-09-12  8:20       ` Avihai Horon
  2024-09-12  8:45         ` Cédric Le Goater
  0 siblings, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-12  8:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 09/09/2024 21:06, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 9.09.2024 10:55, Avihai Horon wrote:
>>
>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> The multifd received data needs to be reassembled since device state
>>> packets sent via different multifd channels can arrive out-of-order.
>>>
>>> Therefore, each VFIO device state packet carries a header indicating
>>> its position in the stream.
>>>
>>> The last such VFIO device state packet should have
>>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
>>> state.
>>>
>>> Since it's important to finish loading device state transferred via
>>> the main migration channel (via save_live_iterate handler) before
>>> starting loading the data asynchronously transferred via multifd
>>> a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
>>> mark the end of the main migration channel data.
>>>
>>> The device state loading process waits until that flag is seen before
>>> commencing loading of the multifd-transferred device state.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 338 
>>> +++++++++++++++++++++++++++++++++-
>>>   hw/vfio/pci.c                 |   2 +
>>>   hw/vfio/trace-events          |   9 +-
>>>   include/hw/vfio/vfio-common.h |  17 ++
>>>   4 files changed, 362 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 24679d8c5034..57c1542528dc 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -15,6 +15,7 @@
>>>   #include <linux/vfio.h>
>>>   #include <sys/ioctl.h>
>>>
>>> +#include "io/channel-buffer.h"
>>>   #include "sysemu/runstate.h"
>>>   #include "hw/vfio/vfio-common.h"
>>>   #include "migration/misc.h"
>>> @@ -47,6 +48,7 @@
>>>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
>>>   #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
>>>   #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
>>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE (0xffffffffef100006ULL)
>>>
>>>   /*
>>>    * This is an arbitrary size based on migration of mlx5 devices, 
>>> where typically
>>> @@ -55,6 +57,15 @@
>>>    */
>>>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>>>
>>> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>>> +
>>> +typedef struct VFIODeviceStatePacket {
>>> +    uint32_t version;
>>> +    uint32_t idx;
>>> +    uint32_t flags;
>>> +    uint8_t data[0];
>>> +} QEMU_PACKED VFIODeviceStatePacket;
>>> +
>>>   static int64_t bytes_transferred;
>>>
>>>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>>> @@ -254,6 +265,188 @@ static int vfio_load_buffer(QEMUFile *f, 
>>> VFIODevice *vbasedev,
>>>       return ret;
>>>   }
>>>
>>> +typedef struct LoadedBuffer {
>>> +    bool is_present;
>>> +    char *data;
>>> +    size_t len;
>>> +} LoadedBuffer;
>>
>> Maybe rename LoadedBuffer to a more specific name, like VFIOStateBuffer?
>
> Will do.
>
>> I also feel like LoadedBuffer deserves a separate commit.
>> Plus, I think it will be good to add a full API for this, that wraps 
>> the g_array_* calls and holds the extra members.
>> E.g, VFIOStateBuffer, VFIOStateArray (will hold load_buf_idx, 
>> load_buf_idx_last, etc.), vfio_state_array_destroy(), 
>> vfio_state_array_alloc(), vfio_state_array_get(), etc...
>> IMHO, this will make it clearer.
>
> Will think about wrapping GArray accesses in separate methods,
> however wrapping a single line GArray call in a separate function
> normally would seem a bit excessive.

Sure, let's do it only if it makes the code cleaner.

>
>>> +
>>> +static void loaded_buffer_clear(gpointer data)
>>> +{
>>> +    LoadedBuffer *lb = data;
>>> +
>>> +    if (!lb->is_present) {
>>> +        return;
>>> +    }
>>> +
>>> +    g_clear_pointer(&lb->data, g_free);
>>> +    lb->is_present = false;
>>> +}
>>> +
>>> +static int vfio_load_state_buffer(void *opaque, char *data, size_t 
>>> data_size,
>>> +                                  Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>>
>> Move lock to where it's needed? I.e., after 
>> trace_vfio_load_state_device_buffer_incoming(vbasedev->name, 
>> packet->idx)
>
> It's a declaration of a new variable so I guess it should always be
> at the top of the code block in the kernel / QEMU code style?

Yes, but it's opaque to the user.
Looking at other QEMU_LOCK_GUARD call sites in the code and it seems 
like people are using it in the middle of code blocks as well.

>
> Also, these checks below are very unlikely to fail and even if they do,
> I doubt a failed migration due to bit stream corruption is a scenario
> worth optimizing run time performance for.

IMO, in this case it's more for readability, but we can go either way 
and let the maintainer decide.

>
>>> +    LoadedBuffer *lb;
>>> +
>>> +    if (data_size < sizeof(*packet)) {
>>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>>> +                   data_size, sizeof(*packet));
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (packet->version != 0) {
>>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>>> +                   packet->version);
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (packet->idx == UINT32_MAX) {
>>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>>> +                   packet->idx);
>>> +        return -1;
>>> +    }
>>> +
>>> + trace_vfio_load_state_device_buffer_incoming(vbasedev->name, 
>>> packet->idx);
>>> +
>>> +    /* config state packet should be the last one in the stream */
>>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>>> +        migration->load_buf_idx_last = packet->idx;
>>> +    }
>>> +
>>> +    assert(migration->load_bufs);
>>> +    if (packet->idx >= migration->load_bufs->len) {
>>> +        g_array_set_size(migration->load_bufs, packet->idx + 1);
>>> +    }
>>> +
>>> +    lb = &g_array_index(migration->load_bufs, typeof(*lb), 
>>> packet->idx);
>>> +    if (lb->is_present) {
>>> +        error_setg(errp, "state buffer %" PRIu32 " already filled", 
>>> packet->idx);
>>> +        return -1;
>>> +    }
>>> +
>>> +    assert(packet->idx >= migration->load_buf_idx);
>>> +
>>> +    migration->load_buf_queued_pending_buffers++;
>>> +    if (migration->load_buf_queued_pending_buffers >
>>> +        vbasedev->migration_max_queued_buffers) {
>>> +        error_setg(errp,
>>> +                   "queuing state buffer %" PRIu32 " would exceed 
>>> the max of %" PRIu64,
>>> +                   packet->idx, 
>>> vbasedev->migration_max_queued_buffers);
>>> +        return -1;
>>> +    }
>>
>> I feel like max_queued_buffers accounting/checking/configuration 
>> should be split to a separate patch that will come after this patch.
>> Also, should we count bytes instead of buffers? Current buffer size 
>> is 1MB but this could change, and the normal user should not care or 
>> know what is the buffer size.
>> So maybe rename to migration_max_pending_bytes or such?
>
> Since it's Peter that asked for this limit to be introduced in the 
> first place
> I would like to ask him what his preference here.
>
> @Peter: max queued buffers or bytes?
>
>>> +
>>> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>>> +    lb->len = data_size - sizeof(*packet);
>>> +    lb->is_present = true;
>>> +
>>> + qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond);
>>
>> There is only one thread waiting, shouldn't signal be enough?
>
> Will change this to _signal() since it clearly doesn't
> make sense to have a future-proof API here - it's an
> implementation detail.
>
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void *vfio_load_bufs_thread(void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    Error **errp = &migration->load_bufs_thread_errp;
>>> +    g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock(
>>> + QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex));
>>
>> Any special reason to use QemuLockable?
>
> I prefer automatic lock management (RAII-like) for the same reason
> I prefer automatic memory management: it makes it much harder to
> forget to unlock the lock (or free memory) in some error path.
>
> That's the reason these primitives were introduced in QEMU in the
> first place (apparently modeled after its Glib equivalents) and
> why these are being (slowly) introduced to Linux kernel too.

Agree, I guess what I really meant is why not use QEMU_LOCK_GUARD()?

>
>>> +    LoadedBuffer *lb;
>>> +
>>> +    while (!migration->load_bufs_device_ready &&
>>> +           !migration->load_bufs_thread_want_exit) {
>>> + qemu_cond_wait(&migration->load_bufs_device_ready_cond, 
>>> &migration->load_bufs_mutex);
>>> +    }
>>> +
>>> +    while (!migration->load_bufs_thread_want_exit) {
>>> +        bool starved;
>>> +        ssize_t ret;
>>> +
>>> +        assert(migration->load_buf_idx <= 
>>> migration->load_buf_idx_last);
>>> +
>>> +        if (migration->load_buf_idx >= migration->load_bufs->len) {
>>> +            assert(migration->load_buf_idx == 
>>> migration->load_bufs->len);
>>> +            starved = true;
>>> +        } else {
>>> +            lb = &g_array_index(migration->load_bufs, typeof(*lb), 
>>> migration->load_buf_idx);
>>> +            starved = !lb->is_present;
>>> +        }
>>> +
>>> +        if (starved) {
>>> + trace_vfio_load_state_device_buffer_starved(vbasedev->name, 
>>> migration->load_buf_idx);
>>> + qemu_cond_wait(&migration->load_bufs_buffer_ready_cond, 
>>> &migration->load_bufs_mutex);
>>> +            continue;
>>> +        }
>>> +
>>> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
>>> +            break;
>>> +        }
>>> +
>>> +        if (migration->load_buf_idx == 0) {
>>> + trace_vfio_load_state_device_buffer_start(vbasedev->name);
>>> +        }
>>> +
>>> +        if (lb->len) {
>>> +            g_autofree char *buf = NULL;
>>> +            size_t buf_len;
>>> +            int errno_save;
>>> +
>>> + trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>>> + migration->load_buf_idx);
>>> +
>>> +            /* lb might become re-allocated when we drop the lock */
>>> +            buf = g_steal_pointer(&lb->data);
>>> +            buf_len = lb->len;
>>> +
>>> +            /* Loading data to the device takes a while, drop the 
>>> lock during this process */
>>> + qemu_mutex_unlock(&migration->load_bufs_mutex);
>>> +            ret = write(migration->data_fd, buf, buf_len);
>>> +            errno_save = errno;
>>> + qemu_mutex_lock(&migration->load_bufs_mutex);
>>> +
>>> +            if (ret < 0) {
>>> +                error_setg(errp, "write to state buffer %" PRIu32 " 
>>> failed with %d",
>>> +                           migration->load_buf_idx, errno_save);
>>> +                break;
>>> +            } else if (ret < buf_len) {
>>> +                error_setg(errp, "write to state buffer %" PRIu32 " 
>>> incomplete %zd / %zu",
>>> +                           migration->load_buf_idx, ret, buf_len);
>>> +                break;
>>> +            }
>>> +
>>> + trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>>> + migration->load_buf_idx);
>>> +        }
>>> +
>>> +        assert(migration->load_buf_queued_pending_buffers > 0);
>>> +        migration->load_buf_queued_pending_buffers--;
>>> +
>>> +        if (migration->load_buf_idx == migration->load_buf_idx_last 
>>> - 1) {
>>> + trace_vfio_load_state_device_buffer_end(vbasedev->name);
>>> +        }
>>> +
>>> +        migration->load_buf_idx++;
>>> +    }
>>> +
>>> +    if (migration->load_bufs_thread_want_exit &&
>>> +        !*errp) {
>>> +        error_setg(errp, "load bufs thread asked to quit");
>>> +    }
>>> +
>>> +    g_clear_pointer(&locker, qemu_lockable_auto_unlock);
>>> +
>>> +    qemu_loadvm_load_finish_ready_lock();
>>> +    migration->load_bufs_thread_finished = true;
>>> +    qemu_loadvm_load_finish_ready_broadcast();
>>> +    qemu_loadvm_load_finish_ready_unlock();
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>>                                            Error **errp)
>>>   {
>>> @@ -285,6 +478,8 @@ static int 
>>> vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>>       VFIODevice *vbasedev = opaque;
>>>       uint64_t data;
>>>
>>> + trace_vfio_load_device_config_state_start(vbasedev->name);
>>
>> Maybe split this and below trace_vfio_load_device_config_state_end to 
>> a separate patch?
>
> I guess you mean to add these trace points in a separate patch?
> Can do.
>
>>> +
>>>       if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>>>           int ret;
>>>
>>> @@ -303,7 +498,7 @@ static int 
>>> vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>>           return -EINVAL;
>>>       }
>>>
>>> -    trace_vfio_load_device_config_state(vbasedev->name);
>>> + trace_vfio_load_device_config_state_end(vbasedev->name);
>>>       return qemu_file_get_error(f);
>>>   }
>>>
>>> @@ -687,16 +882,70 @@ static void vfio_save_state(QEMUFile *f, void 
>>> *opaque)
>>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    int ret;
>>> +
>>> +    ret = vfio_migration_set_state(vbasedev, 
>>> VFIO_DEVICE_STATE_RESUMING,
>>> + vbasedev->migration->device_state, errp);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    assert(!migration->load_bufs);
>>> +    migration->load_bufs = g_array_new(FALSE, TRUE, 
>>> sizeof(LoadedBuffer));
>>> +    g_array_set_clear_func(migration->load_bufs, loaded_buffer_clear);
>>> +
>>> +    qemu_mutex_init(&migration->load_bufs_mutex);
>>> +
>>> +    migration->load_bufs_device_ready = false;
>>> + qemu_cond_init(&migration->load_bufs_device_ready_cond);
>>> +
>>> +    migration->load_buf_idx = 0;
>>> +    migration->load_buf_idx_last = UINT32_MAX;
>>> +    migration->load_buf_queued_pending_buffers = 0;
>>> + qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
>>> +
>>> +    migration->config_state_loaded_to_dev = false;
>>> +
>>> +    assert(!migration->load_bufs_thread_started);
>>
>> Maybe do all these allocations (and de-allocations) only if multifd 
>> device state is supported and enabled?
>> Extracting this to its own function could also be good.
>
> Sure, will try to avoid unnecessarily allocating multifd device state
> related things if this functionality is unavailable anyway.
>
>>>
>>> -    return vfio_migration_set_state(vbasedev, 
>>> VFIO_DEVICE_STATE_RESUMING,
>>> - vbasedev->migration->device_state, errp);
>>> +    migration->load_bufs_thread_finished = false;
>>> +    migration->load_bufs_thread_want_exit = false;
>>> +    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs",
>>> +                       vfio_load_bufs_thread, opaque, 
>>> QEMU_THREAD_JOINABLE);
>>
>> The device state save threads are manged by migration core thread 
>> pool. Don't we want to apply the same thread management scheme for 
>> the load flow as well?
>
> I think that (in contrast with the device state saving threads)
> the buffer loading / reordering thread is an implementation detail
> of the VFIO driver, so I don't think it really makes sense for multifd 
> code
> to manage it.

Hmm, yes I understand.

Thanks.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side
  2024-09-09 18:07     ` Maciej S. Szmigiero
@ 2024-09-12  8:26       ` Avihai Horon
  2024-09-12  8:57         ` Cédric Le Goater
  0 siblings, 1 reply; 128+ messages in thread
From: Avihai Horon @ 2024-09-12  8:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 09/09/2024 21:07, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 9.09.2024 13:41, Avihai Horon wrote:
>>
>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Implement the multifd device state transfer via additional per-device
>>> thread inside save_live_complete_precopy_thread handler.
>>>
>>> Switch between doing the data transfer in the new handler and doing it
>>> in the old save_state handler depending on the
>>> x-migration-multifd-transfer device property value.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 169 
>>> ++++++++++++++++++++++++++++++++++
>>>   hw/vfio/trace-events          |   2 +
>>>   include/hw/vfio/vfio-common.h |   1 +
>>>   3 files changed, 172 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 57c1542528dc..67996aa2df8b 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -655,6 +655,16 @@ static int vfio_save_setup(QEMUFile *f, void 
>>> *opaque, Error **errp)
>>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>       int ret;
>>>
>>> +    /* Make a copy of this setting at the start in case it is 
>>> changed mid-migration */
>>> +    migration->multifd_transfer = 
>>> vbasedev->migration_multifd_transfer;
>>
>> Should VFIO multifd be controlled by main migration multifd 
>> capability, and let the per VFIO device migration_multifd_transfer 
>> property be immutable and enabled by default?
>> Then we would have a single point of configuration (and an extra one 
>> per VFIO device just to disable for backward compatibility).
>> Unless there are other benefits to have this property configurable?
>
> We want multifd device state transfer property to be configurable 
> per-device
> in case in the future we add another device type (besides VFIO) that 
> supports
> multifd device state transfer.
>
> In this case, we might need to enable the multifd device state 
> transfer just
> for VFIO devices, but not for this new device type when we are 
> migrating to a
> QEMU target that supports just the VFIO multifd device state transfer.

I think for this case we can use hw/core/machine.c:hw_compat_X_Y arrays [1].

[1] 
https://www.qemu.org/docs/master/devel/migration/compatibility.html#how-backwards-compatibility-works

>
> TBH, I'm not opposed to adding a additional global multifd device 
> state transfer
> switch (if we keep the per-device ones too) but I am not sure what 
> value it adds.
>
>>> +
>>> +    if (migration->multifd_transfer && 
>>> !migration_has_device_state_support()) {
>>> +        error_setg(errp,
>>> +                   "%s: Multifd device transfer requested but 
>>> unsupported in the current config",
>>> +                   vbasedev->name);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>
>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>> @@ -835,10 +845,20 @@ static int vfio_save_iterate(QEMUFile *f, void 
>>> *opaque)
>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>>       ssize_t data_size;
>>>       int ret;
>>>       Error *local_err = NULL;
>>>
>>> +    if (migration->multifd_transfer) {
>>> +        /*
>>> +         * Emit dummy NOP data, vfio_save_complete_precopy_thread()
>>> +         * does the actual transfer.
>>> +         */
>>> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>
>> There are three places where we send this dummy end of state, maybe 
>> worth extracting it to a helper? I.e., vfio_send_end_of_state() and 
>> then document there the rationale.
>
> I'm not totally against it but it's wrapping just a single line of 
> code in
> a separate function?

Yes, it's more for self-documentation purpose and for not duplicating 
comments.
I guess it's a matter of taste, so we can go either way and let 
maintainer decide.

>
>>> +        return 0;
>>> +    }
>>> +
>>> trace_vfio_save_complete_precopy_started(vbasedev->name);
>>>
>>>       /* We reach here with device state STOP or STOP_COPY only */
>>> @@ -864,12 +884,159 @@ static int 
>>> vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>       return ret;
>>>   }
>>>
>>> +static int 
>>> vfio_save_complete_precopy_async_thread_config_state(VFIODevice 
>>> *vbasedev,
>>> +                                                                
>>> char *idstr,
>>> + uint32_t instance_id,
>>> + uint32_t idx)
>>> +{
>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>> +    QEMUFile *f = NULL;
>>> +    int ret;
>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>> +    size_t packet_len;
>>> +
>>> +    bioc = qio_channel_buffer_new(0);
>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), 
>>> "vfio-device-config-save");
>>> +
>>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>>> +
>>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
>>> +    if (ret) {
>>> +        return ret;
>>
>> Need to close f in this case.
>
> Right - by the way, that's a good example why RAII
> helps avoid such mistakes.

Agreed :)

>
>>> +    }
>>> +
>>> +    ret = qemu_fflush(f);
>>> +    if (ret) {
>>> +        goto ret_close_file;
>>> +    }
>>> +
>>> +    packet_len = sizeof(*packet) + bioc->usage;
>>> +    packet = g_malloc0(packet_len);
>>> +    packet->idx = idx;
>>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>>> +
>>> +    if (!multifd_queue_device_state(idstr, instance_id,
>>> +                                    (char *)packet, packet_len)) {
>>> +        ret = -1;
>>
>> goto ret_close_file?
>
> Right, it would be better not to increment the counter in this case.
>
>>> +    }
>>> +
>>> +    bytes_transferred += packet_len;
>>
>> bytes_transferred is a global variable. Now that we access it from 
>> multiple threads it should be protected.
>
> Right, this stat needs some concurrent access protection.
>
>> Note that now the VFIO device data is reported also in multifd stats 
>> (if I am not mistaken), is this the behavior we want? Maybe we should 
>> enhance multifd stats to distinguish between RAM data and device data?
>
> Multifd stats report total size of data transferred via multifd so
> they should include device state too.

Yes I agree. But now we are reporting double the amount of VFIO data 
that we actually transfer (once in "vfio device transferred" and another 
in multifd stats) and this may be misleading.
So maybe we should add a dedicated multifd device state counter and 
report VFIO multifd bytes there instead of in bytes_transferred?
We can wait for other people's opinion about that.

>
> It may make sense to add a dedicated device state transfer counter
> at some time though.
>
>>> +
>>> +ret_close_file:
>>
>> Rename to "out" as we only have one exit point?
>>
>>> +    g_clear_pointer(&f, qemu_fclose);
>>
>> f is a local variable, wouldn't qemu_fclose(f) be enough here?
>
> Sure, but why leave a dangling pointer?
>
> Currently, it is obviously a NOP (probably deleted by dead store
> elimination anyway) but the code might get refactored at some point
> and I think it's good practice to always NULL pointers after freeing
> them where possible and so be on the safe side.

Ack.

Thanks.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side
  2024-09-12  8:20       ` Avihai Horon
@ 2024-09-12  8:45         ` Cédric Le Goater
  0 siblings, 0 replies; 128+ messages in thread
From: Cédric Le Goater @ 2024-09-12  8:45 UTC (permalink / raw)
  To: Avihai Horon, Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel

>>>>
>>>> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>>>> - vbasedev->migration->device_state, errp);
>>>> +    migration->load_bufs_thread_finished = false;
>>>> +    migration->load_bufs_thread_want_exit = false;
>>>> +    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs",
>>>> +                       vfio_load_bufs_thread, opaque, QEMU_THREAD_JOINABLE);
>>>
>>> The device state save threads are manged by migration core thread pool. Don't we want to apply the same thread management scheme for the load flow as well?
>>
>> I think that (in contrast with the device state saving threads)
>> the buffer loading / reordering thread is an implementation detail
>> of the VFIO driver, so I don't think it really makes sense for multifd code
>> to manage it.

Is it an optimisation then ? In that case, could the implementation not
use threads ?

VFIO is complex, migration is complex, VFIO migration is even more. TBH,
the idea of doing thread management in the VFIO subsystem makes me feel
uncomfortable.

Thanks,

C.





^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side
  2024-09-12  8:26       ` Avihai Horon
@ 2024-09-12  8:57         ` Cédric Le Goater
  0 siblings, 0 replies; 128+ messages in thread
From: Cédric Le Goater @ 2024-09-12  8:57 UTC (permalink / raw)
  To: Avihai Horon, Maciej S. Szmigiero
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

On 9/12/24 10:26, Avihai Horon wrote:
> 
> On 09/09/2024 21:07, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 9.09.2024 13:41, Avihai Horon wrote:
>>>
>>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Implement the multifd device state transfer via additional per-device
>>>> thread inside save_live_complete_precopy_thread handler.
>>>>
>>>> Switch between doing the data transfer in the new handler and doing it
>>>> in the old save_state handler depending on the
>>>> x-migration-multifd-transfer device property value.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration.c           | 169 ++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/trace-events          |   2 +
>>>>   include/hw/vfio/vfio-common.h |   1 +
>>>>   3 files changed, 172 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 57c1542528dc..67996aa2df8b 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -655,6 +655,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>>       int ret;
>>>>
>>>> +    /* Make a copy of this setting at the start in case it is changed mid-migration */
>>>> +    migration->multifd_transfer = vbasedev->migration_multifd_transfer;
>>>
>>> Should VFIO multifd be controlled by main migration multifd capability, and let the per VFIO device migration_multifd_transfer property be immutable and enabled by default?
>>> Then we would have a single point of configuration (and an extra one per VFIO device just to disable for backward compatibility).
>>> Unless there are other benefits to have this property configurable?
>>
>> We want multifd device state transfer property to be configurable per-device
>> in case in the future we add another device type (besides VFIO) that supports
>> multifd device state transfer.
>>
>> In this case, we might need to enable the multifd device state transfer just
>> for VFIO devices, but not for this new device type when we are migrating to a
>> QEMU target that supports just the VFIO multifd device state transfer.
> 
> I think for this case we can use hw/core/machine.c:hw_compat_X_Y arrays [1].
> 
> [1] https://www.qemu.org/docs/master/devel/migration/compatibility.html#how-backwards-compatibility-works
> 
>>
>> TBH, I'm not opposed to adding a additional global multifd device state transfer
>> switch (if we keep the per-device ones too) but I am not sure what value it adds.
>>
>>>> +
>>>> +    if (migration->multifd_transfer && !migration_has_device_state_support()) {
>>>> +        error_setg(errp,
>>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>>> +                   vbasedev->name);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>>
>>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>> @@ -835,10 +845,20 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>       ssize_t data_size;
>>>>       int ret;
>>>>       Error *local_err = NULL;
>>>>
>>>> +    if (migration->multifd_transfer) {
>>>> +        /*
>>>> +         * Emit dummy NOP data, vfio_save_complete_precopy_thread()
>>>> +         * does the actual transfer.
>>>> +         */
>>>> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>
>>> There are three places where we send this dummy end of state, maybe worth extracting it to a helper? I.e., vfio_send_end_of_state() and then document there the rationale.
>>
>> I'm not totally against it but it's wrapping just a single line of code in
>> a separate function?
> 
> Yes, it's more for self-documentation purpose and for not duplicating comments.
> I guess it's a matter of taste, so we can go either way and let maintainer decide.

I'd prefer an helper too. This comment applies to all additions
in pre-existing code. Ideally new routines should have a
'vfio_{migration,save,load}_multifd' prefix so that the reader
understands what the code is for.


Thanks,

C.


> 
>>
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> trace_vfio_save_complete_precopy_started(vbasedev->name);
>>>>
>>>>       /* We reach here with device state STOP or STOP_COPY only */
>>>> @@ -864,12 +884,159 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>>       return ret;
>>>>   }
>>>>
>>>> +static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
>>>> + char *idstr,
>>>> + uint32_t instance_id,
>>>> + uint32_t idx)
>>>> +{
>>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>>> +    QEMUFile *f = NULL;
>>>> +    int ret;
>>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>>> +    size_t packet_len;
>>>> +
>>>> +    bioc = qio_channel_buffer_new(0);
>>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>>>> +
>>>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>>>> +
>>>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
>>>> +    if (ret) {
>>>> +        return ret;
>>>
>>> Need to close f in this case.
>>
>> Right - by the way, that's a good example why RAII
>> helps avoid such mistakes.
> 
> Agreed :)
> 
>>
>>>> +    }
>>>> +
>>>> +    ret = qemu_fflush(f);
>>>> +    if (ret) {
>>>> +        goto ret_close_file;
>>>> +    }
>>>> +
>>>> +    packet_len = sizeof(*packet) + bioc->usage;
>>>> +    packet = g_malloc0(packet_len);
>>>> +    packet->idx = idx;
>>>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>>>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>>>> +
>>>> +    if (!multifd_queue_device_state(idstr, instance_id,
>>>> +                                    (char *)packet, packet_len)) {
>>>> +        ret = -1;
>>>
>>> goto ret_close_file?
>>
>> Right, it would be better not to increment the counter in this case.
>>
>>>> +    }
>>>> +
>>>> +    bytes_transferred += packet_len;
>>>
>>> bytes_transferred is a global variable. Now that we access it from multiple threads it should be protected.
>>
>> Right, this stat needs some concurrent access protection.
>>
>>> Note that now the VFIO device data is reported also in multifd stats (if I am not mistaken), is this the behavior we want? Maybe we should enhance multifd stats to distinguish between RAM data and device data?
>>
>> Multifd stats report total size of data transferred via multifd so
>> they should include device state too.
> 
> Yes I agree. But now we are reporting double the amount of VFIO data that we actually transfer (once in "vfio device transferred" and another in multifd stats) and this may be misleading.
> So maybe we should add a dedicated multifd device state counter and report VFIO multifd bytes there instead of in bytes_transferred?
> We can wait for other people's opinion about that.
> 
>>
>> It may make sense to add a dedicated device state transfer counter
>> at some time though.
>>
>>>> +
>>>> +ret_close_file:
>>>
>>> Rename to "out" as we only have one exit point?
>>>
>>>> +    g_clear_pointer(&f, qemu_fclose);
>>>
>>> f is a local variable, wouldn't qemu_fclose(f) be enough here?
>>
>> Sure, but why leave a dangling pointer?
>>
>> Currently, it is obviously a NOP (probably deleted by dead store
>> elimination anyway) but the code might get refactored at some point
>> and I think it's good practice to always NULL pointers after freeing
>> them where possible and so be on the safe side.
> 
> Ack.
> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-12  8:13       ` Avihai Horon
@ 2024-09-12 13:52         ` Fabiano Rosas
  2024-09-19 19:59           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-12 13:52 UTC (permalink / raw)
  To: Avihai Horon, Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

Avihai Horon <avihaih@nvidia.com> writes:

> On 09/09/2024 21:05, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 5.09.2024 18:47, Avihai Horon wrote:
>>>
>>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Add a basic support for receiving device state via multifd channels -
>>>> channels that are shared with RAM transfers.
>>>>
>>>> To differentiate between a device state and a RAM packet the packet
>>>> header is read first.
>>>>
>>>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not 
>>>> in the
>>>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>>>> data (existing MultiFDPacket_t) is then read.
>>>>
>>>> The received device state data is provided to
>>>> qemu_loadvm_load_state_buffer() function for processing in the
>>>> device's load_state_buffer handler.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   migration/multifd.c | 127 
>>>> +++++++++++++++++++++++++++++++++++++-------
>>>>   migration/multifd.h |  31 ++++++++++-
>>>>   2 files changed, 138 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>>> index b06a9fab500e..d5a8e5a9c9b5 100644
>>>> --- a/migration/multifd.c
>>>> +++ b/migration/multifd.c
>>>> @@ -21,6 +21,7 @@
>>>>   #include "file.h"
>>>>   #include "migration.h"
>>>>   #include "migration-stats.h"
>>>> +#include "savevm.h"
>>>>   #include "socket.h"
>>>>   #include "tls.h"
>>>>   #include "qemu-file.h"
>>>> @@ -209,10 +210,10 @@ void 
>>>> multifd_send_fill_packet(MultiFDSendParams *p)
>>>>
>>>>       memset(packet, 0, p->packet_len);
>>>>
>>>> -    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>>>> -    packet->version = cpu_to_be32(MULTIFD_VERSION);
>>>> +    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
>>>> +    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>>>>
>>>> -    packet->flags = cpu_to_be32(p->flags);
>>>> +    packet->hdr.flags = cpu_to_be32(p->flags);
>>>>       packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>>>>
>>>>       packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
>>>> @@ -228,31 +229,49 @@ void 
>>>> multifd_send_fill_packet(MultiFDSendParams *p)
>>>>                               p->flags, p->next_packet_size);
>>>>   }
>>>>
>>>> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error 
>>>> **errp)
>>>> +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
>>>> + MultiFDPacketHdr_t *hdr,
>>>> +                                             Error **errp)
>>>>   {
>>>> -    MultiFDPacket_t *packet = p->packet;
>>>> -    int ret = 0;
>>>> -
>>>> -    packet->magic = be32_to_cpu(packet->magic);
>>>> -    if (packet->magic != MULTIFD_MAGIC) {
>>>> +    hdr->magic = be32_to_cpu(hdr->magic);
>>>> +    if (hdr->magic != MULTIFD_MAGIC) {
>>>>           error_setg(errp, "multifd: received packet "
>>>>                      "magic %x and expected magic %x",
>>>> -                   packet->magic, MULTIFD_MAGIC);
>>>> +                   hdr->magic, MULTIFD_MAGIC);
>>>>           return -1;
>>>>       }
>>>>
>>>> -    packet->version = be32_to_cpu(packet->version);
>>>> -    if (packet->version != MULTIFD_VERSION) {
>>>> +    hdr->version = be32_to_cpu(hdr->version);
>>>> +    if (hdr->version != MULTIFD_VERSION) {
>>>>           error_setg(errp, "multifd: received packet "
>>>>                      "version %u and expected version %u",
>>>> -                   packet->version, MULTIFD_VERSION);
>>>> +                   hdr->version, MULTIFD_VERSION);
>>>>           return -1;
>>>>       }
>>>>
>>>> -    p->flags = be32_to_cpu(packet->flags);
>>>> +    p->flags = be32_to_cpu(hdr->flags);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int 
>>>> multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
>>>> +                                                   Error **errp)
>>>> +{
>>>> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
>>>> +
>>>> +    packet->instance_id = be32_to_cpu(packet->instance_id);
>>>> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, 
>>>> Error **errp)
>>>> +{
>>>> +    MultiFDPacket_t *packet = p->packet;
>>>> +    int ret = 0;
>>>> +
>>>>       p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>>>       p->packet_num = be64_to_cpu(packet->packet_num);
>>>> -    p->packets_recved++;
>>>>
>>>>       if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>>>>           ret = multifd_ram_unfill_packet(p, errp);
>>>> @@ -264,6 +283,19 @@ static int 
>>>> multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>>>>       return ret;
>>>>   }
>>>>
>>>> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error 
>>>> **errp)
>>>> +{
>>>> +    p->packets_recved++;
>>>> +
>>>> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>>>> +        return multifd_recv_unfill_packet_device_state(p, errp);
>>>> +    } else {
>>>> +        return multifd_recv_unfill_packet_ram(p, errp);
>>>> +    }
>>>> +
>>>> +    g_assert_not_reached();
>>>
>>> We can drop the assert and the "else":
>>> if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>>>      return multifd_recv_unfill_packet_device_state(p, errp);
>>> }
>>>
>>> return multifd_recv_unfill_packet_ram(p, errp);
>>
>> Ack.
>>
>>>> +}
>>>> +
>>>>   static bool multifd_send_should_exit(void)
>>>>   {
>>>>       return qatomic_read(&multifd_send_state->exiting);
>>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>>> index a3e35196d179..a8f3e4838c01 100644
>>>> --- a/migration/multifd.h
>>>> +++ b/migration/multifd.h
>>>> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>>>   #define MULTIFD_FLAG_QPL (4 << 1)
>>>>   #define MULTIFD_FLAG_UADK (8 << 1)
>>>>
>>>> +/*
>>>> + * If set it means that this packet contains device state
>>>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>>>> + */
>>>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
>>>> +
>>>>   /* This value needs to be a multiple of qemu_target_page_size() */
>>>>   #define MULTIFD_PACKET_SIZE (512 * 1024)
>>>>
>>>> @@ -52,6 +58,11 @@ typedef struct {
>>>>       uint32_t magic;
>>>>       uint32_t version;
>>>>       uint32_t flags;
>>>> +} __attribute__((packed)) MultiFDPacketHdr_t;
>>>
>>> Maybe split this patch into two: one that adds the packet header 
>>> concept and another that adds the new device packet?
>>
>> Can do.
>>
>>>> +
>>>> +typedef struct {
>>>> +    MultiFDPacketHdr_t hdr;
>>>> +
>>>>       /* maximum number of allocated pages */
>>>>       uint32_t pages_alloc;
>>>>       /* non zero pages */
>>>> @@ -72,6 +83,16 @@ typedef struct {
>>>>       uint64_t offset[];
>>>>   } __attribute__((packed)) MultiFDPacket_t;
>>>>
>>>> +typedef struct {
>>>> +    MultiFDPacketHdr_t hdr;
>>>> +
>>>> +    char idstr[256] QEMU_NONSTRING;
>>>
>>> idstr should be null terminated, or am I missing something?
>>
>> There's no need to always NULL-terminate a constant-size field,
>> since the strncpy() already stops at the field size, so we can
>> gain another byte for actual string use this way.
>>
>> RAM block idstr also uses the same "trick":
>>> void multifd_ram_fill_packet(MultiFDSendParams *p):
>>> strncpy(packet->ramblock, pages->block->idstr, 256);
>>
> But can idstr actually be 256 bytes long without null byte?
> There are a lot of places where idstr is a parameter for functions that 
> expect null terminated string and it is also printed as such.

Yeah, and I actually don't see the "trick" being used in
RAMBlock. Anyway, it's best to null terminate to be more predictable. We
also had Coverity reports about similar things:

https://lore.kernel.org/r/CAFEAcA_F2qrSAacY=V5Hez1qFGuNW0-XqL2LQ=Y_UKYuHEJWhw@mail.gmail.com

I haven't got the time to send that patch yet.

>
> Thanks.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-10 19:48     ` Peter Xu
@ 2024-09-12 18:43       ` Fabiano Rosas
  2024-09-13  0:23         ` Peter Xu
  2024-09-19 19:49       ` Maciej S. Szmigiero
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-12 18:43 UTC (permalink / raw)
  To: Peter Xu, mail
  Cc: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

Hi Peter, sorry if I'm not very enthusiastic by this, I'm sure you
understand the rework is a little frustrating.

> On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
>> > +size_t multifd_device_state_payload_size(void)
>> > +{
>> > +    return sizeof(MultiFDDeviceState_t);
>> > +}
>> 
>> This will not be necessary because the payload size is the same as the
>> data type. We only need it for the special case where the MultiFDPages_t
>> is smaller than the total ram payload size.
>
> Today I was thinking maybe we should really clean this up, as the current
> multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
> that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
> and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
> that feeling stronger.

If we're going to commit bad code and then rewrite it a week later, we
could have just let the original series from Maciej merge without any of
this. I already suggested it a couple of times, we shouldn't be doing
core refactorings underneath contributors' patches, this is too
fragile. Just let people contribute their code and we can change it
later.

This is also why I've been trying hard to separate core multifd
functionality from migration code that uses multifd to transmit their
data.

My original RFC plus the suggestion to extend multifd_ops for device
state would have (almost) made it so that no client code would be left
in multifd. We could have been turning this thing upside down and it
wouldn't affect anyone in terms of code conflicts.

The ship has already sailed, so your patches below are fine, I have just
some small comments.

>
> I think we should change it now perhaps, otherwise we'll need to introduce
> other helpers to e.g. reset the device buffers, and that's not only slow
> but also not good looking, IMO.

I agree that part is kind of a sore thumb.

>
> So I went ahead with the idea in previous discussion, that I managed to
> change the SendData union into struct; the memory consumption is not super
> important yet, IMHO, but we should still stick with the object model where
> multifd enqueue thread switch buffer with multifd, as it still sounds a
> sane way to do.
>
> Then when that patch is ready, I further tried to make VFIO reuse multifd
> buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
> don't allocate it every time we enqueue.
>
> I hope it'll also work for VFIO.  VFIO has a specialty on being able to
> dump the config space so it's more complex (and I noticed Maciej's current
> design requires the final chunk of VFIO config data be migrated in one
> packet.. that is also part of the complexity there).  So I allowed that
> part to allocate a buffer but only that.  IOW, I made some API (see below)
> that can either reuse preallocated buffer, or use a separate one only for
> the final bulk.
>
> In short, could both of you have a look at what I came up with below?  I
> did that in patches because I think it's too much to comment, so patches
> may work better.  No concern if any of below could be good changes to you,
> then either Maciej can squash whatever into existing patches (and I feel
> like some existing patches in this series can go away with below design),
> or I can post pre-requisite patch but only if any of you prefer that.
>
> Anyway, let me know, the patches apply on top of this whole series applied
> first.
>
> I also wonder whether there can be any perf difference already (I tested
> all multifd qtest with below, but no VFIO I can run), perhaps not that
> much, but just to mention below should avoid both buffer allocations and
> one round of copy (so VFIO read() directly writes to the multifd buffers
> now).
>
> Thanks,
>
> ==========8<==========
> From a6cbcf692b2376e72cc053219d67bb32eabfb7a6 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 10 Sep 2024 12:10:59 -0400
> Subject: [PATCH 1/3] migration/multifd: Make MultiFDSendData a struct
>
> The newly introduced device state buffer can be used for either storing
> VFIO's read() raw data, but already also possible to store generic device
> states.  After noticing that device states may not easily provide a max
> buffer size (also the fact that RAM MultiFDPages_t after all also want to
> have flexibility on managing offset[] array), it may not be a good idea to
> stick with union on MultiFDSendData.. as it won't play well with such
> flexibility.
>
> Switch MultiFDSendData to a struct.
>
> It won't consume a lot more space in reality, after all the real buffers
> were already dynamically allocated, so it's so far only about the two
> structs (pages, device_state) that will be duplicated, but they're small.
>
> With this, we can remove the pretty hard to understand alloc size logic.
> Because now we can allocate offset[] together with the SendData, and
> properly free it when the SendData is freed.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.h              | 16 +++++++++++-----
>  migration/multifd-device-state.c |  8 ++++++--
>  migration/multifd-nocomp.c       | 13 ++++++-------
>  migration/multifd.c              | 25 ++++++-------------------
>  4 files changed, 29 insertions(+), 33 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index c15c83104c..47203334b9 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -98,9 +98,13 @@ typedef struct {
>      uint32_t num;
>      /* number of normal pages */
>      uint32_t normal_num;
> +    /*
> +     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
> +     * sure the pointer is always valid!
> +     */

This could use some rewording, it's not clear what "caller" means here.

>      RAMBlock *block;
> -    /* offset of each page */
> -    ram_addr_t offset[];
> +    /* offset array of each page, managed by multifd */

I'd drop the part after the comma, it's not very accurate and also gives
an impression that something sophisticated is being done to this.

> +    ram_addr_t *offset;
>  } MultiFDPages_t;
>  
>  struct MultiFDRecvData {
> @@ -123,7 +127,7 @@ typedef enum {
>      MULTIFD_PAYLOAD_DEVICE_STATE,
>  } MultiFDPayloadType;
>  
> -typedef union MultiFDPayload {
> +typedef struct MultiFDPayload {
>      MultiFDPages_t ram;
>      MultiFDDeviceState_t device_state;
>  } MultiFDPayload;
> @@ -323,11 +327,13 @@ static inline uint32_t multifd_ram_page_count(void)
>  void multifd_ram_save_setup(void);
>  void multifd_ram_save_cleanup(void);
>  int multifd_ram_flush_and_sync(void);
> -size_t multifd_ram_payload_size(void);
> +void multifd_ram_payload_alloc(MultiFDPages_t *pages);
> +void multifd_ram_payload_free(MultiFDPages_t *pages);
>  void multifd_ram_fill_packet(MultiFDSendParams *p);
>  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
>  
> -size_t multifd_device_state_payload_size(void);
> +void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state);
> +void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state);
>  void multifd_device_state_save_setup(void);
>  void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
>  void multifd_device_state_save_cleanup(void);
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index 9b364e8ef3..72b72b6e62 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -22,9 +22,13 @@ bool send_threads_abort;
>  
>  static MultiFDSendData *device_state_send;
>  
> -size_t multifd_device_state_payload_size(void)
> +/* TODO: use static buffers for idstr and buf */
> +void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
> +{
> +}
> +
> +void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
>  {
> -    return sizeof(MultiFDDeviceState_t);
>  }
>  
>  void multifd_device_state_save_setup(void)
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index 0b7b543f44..c1b95fee0d 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -22,15 +22,14 @@
>  
>  static MultiFDSendData *multifd_ram_send;
>  
> -size_t multifd_ram_payload_size(void)
> +void multifd_ram_payload_alloc(MultiFDPages_t *pages)
>  {
> -    uint32_t n = multifd_ram_page_count();
> +    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
> +}
>  
> -    /*
> -     * We keep an array of page offsets at the end of MultiFDPages_t,
> -     * add space for it in the allocation.
> -     */
> -    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
> +void multifd_ram_payload_free(MultiFDPages_t *pages)
> +{
> +    g_clear_pointer(&pages->offset, g_free);
>  }
>  
>  void multifd_ram_save_setup(void)
> diff --git a/migration/multifd.c b/migration/multifd.c
> index bebe5b5a9b..5a20b831cf 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -101,26 +101,12 @@ struct {
>  
>  MultiFDSendData *multifd_send_data_alloc(void)
>  {
> -    size_t max_payload_size, size_minus_payload;
> +    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
>  
> -    /*
> -     * MultiFDPages_t has a flexible array at the end, account for it
> -     * when allocating MultiFDSendData. Use max() in case other types
> -     * added to the union in the future are larger than
> -     * (MultiFDPages_t + flex array).
> -     */
> -    max_payload_size = MAX(multifd_ram_payload_size(),
> -                           multifd_device_state_payload_size());
> -    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
> -
> -    /*
> -     * Account for any holes the compiler might insert. We can't pack
> -     * the structure because that misaligns the members and triggers
> -     * Waddress-of-packed-member.
> -     */
> -    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
> +    multifd_ram_payload_alloc(&new->u.ram);
> +    multifd_device_state_payload_alloc(&new->u.device_state);
>  
> -    return g_malloc0(size_minus_payload + max_payload_size);
> +    return new;
>  }
>  
>  void multifd_send_data_clear(MultiFDSendData *data)
> @@ -147,7 +133,8 @@ void multifd_send_data_free(MultiFDSendData *data)
>          return;
>      }
>  
> -    multifd_send_data_clear(data);
> +    multifd_ram_payload_free(&data->u.ram);
> +    multifd_device_state_payload_free(&data->u.device_state);

The "u" needs to be dropped now. Could just change to "p". Or can we now
move the whole struct inside MultiFDSendData?

>  
>      g_free(data);
>  }
> -- 
> 2.45.0
>
>
>
> From 6695d134c0818f42183f5ea03c21e6d56e7b57ea Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 10 Sep 2024 12:24:14 -0400
> Subject: [PATCH 2/3] migration/multifd: Optimize device_state->idstr updates
>
> The duplication / allocation of idstr for each VFIO blob is an overkill, as
> idstr isn't something that changes frequently.  Also, the idstr always came
> from the upper layer of se->idstr so it's always guaranteed to
> exist (e.g. no device unplug allowed during migration).
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.h              |  4 ++++
>  migration/multifd-device-state.c | 10 +++++++---
>  2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 47203334b9..1eaa5d4496 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -115,6 +115,10 @@ struct MultiFDRecvData {
>  };
>  
>  typedef struct {
> +    /*
> +     * Name of the owner device.  NOTE: it's caller's responsibility to
> +     * make sure the pointer is always valid!
> +     */

Same comment as the other one here.

>      char *idstr;
>      uint32_t instance_id;
>      char *buf;
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index 72b72b6e62..cfd0465bac 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -44,7 +44,7 @@ void multifd_device_state_save_setup(void)
>  
>  void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>  {
> -    g_clear_pointer(&device_state->idstr, g_free);
> +    device_state->idstr = NULL;
>      g_clear_pointer(&device_state->buf, g_free);
>  }
>  
> @@ -100,7 +100,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>  
>      multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
>      device_state = &device_state_send->u.device_state;
> -    device_state->idstr = g_strdup(idstr);
> +    /*
> +     * NOTE: here we must use a static idstr (e.g. of a savevm state
> +     * entry) rather than any dynamically allocated buffer, because multifd
> +     * assumes this pointer is always valid!
> +     */
> +    device_state->idstr = idstr;
>      device_state->instance_id = instance_id;
>      device_state->buf = g_memdup2(data, len);
>      device_state->buf_len = len;
> @@ -137,7 +142,6 @@ static void multifd_device_state_save_thread_data_free(void *opaque)
>  {
>      struct MultiFDDSSaveThreadData *data = opaque;
>  
> -    g_clear_pointer(&data->idstr, g_free);
>      g_free(data);
>  }
>  
> -- 
> 2.45.0
>
>
> From abfea9698ff46ad0e0175e1dcc6e005e0b2ece2a Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 10 Sep 2024 12:27:49 -0400
> Subject: [PATCH 3/3] migration/multifd: Optimize device_state buffer
>  allocations
>
> Provide a device_state->buf_prealloc so that the buffers can be reused if
> possible.  Provide a set of APIs to use it right.  Please see the
> documentation for the API in the code.
>
> The default buffer size came from VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE as of
> now.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/hw/vfio/vfio-common.h    |  9 ++++
>  include/migration/misc.h         | 12 ++++-
>  migration/multifd.h              | 11 +++-
>  hw/vfio/migration.c              | 43 ++++++++-------
>  migration/multifd-device-state.c | 93 +++++++++++++++++++++++++-------
>  migration/multifd.c              |  9 ----
>  6 files changed, 126 insertions(+), 51 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 4578a0ca6a..c1f2f4ae55 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -61,6 +61,13 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
>  
> +typedef struct VFIODeviceStatePacket {
> +    uint32_t version;
> +    uint32_t idx;
> +    uint32_t flags;
> +    uint8_t data[0];
> +} QEMU_PACKED VFIODeviceStatePacket;
> +
>  typedef struct VFIOMigration {
>      struct VFIODevice *vbasedev;
>      VMChangeStateEntry *vm_state;
> @@ -168,6 +175,8 @@ typedef struct VFIODevice {
>      int devid;
>      IOMMUFDBackend *iommufd;
>      VFIOIOASHwpt *hwpt;
> +    /* Only used on sender side when multifd is enabled */
> +    VFIODeviceStatePacket *multifd_packet;
>      QLIST_ENTRY(VFIODevice) hwpt_next;
>  } VFIODevice;
>  
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 26f7f3140f..1a8676ed3d 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -112,8 +112,16 @@ bool migration_in_bg_snapshot(void);
>  void dirty_bitmap_mig_init(void);
>  
>  /* migration/multifd-device-state.c */
> -bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> -                                char *data, size_t len);
> +struct MultiFDDeviceState_t;
> +typedef struct MultiFDDeviceState_t MultiFDDeviceState_t;
> +
> +MultiFDDeviceState_t *
> +multifd_device_state_prepare(char *idstr, uint32_t instance_id);
> +void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
> +                                      int64_t *buf_len);
> +bool multifd_device_state_finish(MultiFDDeviceState_t *state,
> +                                 void *buf, int64_t buf_len);
> +
>  bool migration_has_device_state_support(void);
>  
>  void
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 1eaa5d4496..1ccdeeb8c5 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -15,6 +15,7 @@
>  
>  #include "exec/target_page.h"
>  #include "ram.h"
> +#include "migration/misc.h"
>  
>  typedef struct MultiFDRecvData MultiFDRecvData;
>  typedef struct MultiFDSendData MultiFDSendData;
> @@ -114,16 +115,22 @@ struct MultiFDRecvData {
>      off_t file_offset;
>  };
>  
> -typedef struct {
> +struct MultiFDDeviceState_t {
>      /*
>       * Name of the owner device.  NOTE: it's caller's responsibility to
>       * make sure the pointer is always valid!
>       */
>      char *idstr;
>      uint32_t instance_id;
> +    /*
> +     * Points to the buffer to send via multifd.  Normally it's the same as
> +     * buf_prealloc, otherwise the caller needs to make sure the buffer is
> +     * avaliable through multifd running.

"throughout multifd runtime" maybe.

> +     */
>      char *buf;
> +    char *buf_prealloc;
>      size_t buf_len;
> -} MultiFDDeviceState_t;
> +};
>  
>  typedef enum {
>      MULTIFD_PAYLOAD_NONE,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 67996aa2df..e36422b7c5 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -59,13 +59,6 @@
>  
>  #define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>  
> -typedef struct VFIODeviceStatePacket {
> -    uint32_t version;
> -    uint32_t idx;
> -    uint32_t flags;
> -    uint8_t data[0];
> -} QEMU_PACKED VFIODeviceStatePacket;
> -
>  static int64_t bytes_transferred;
>  
>  static const char *mig_state_to_str(enum vfio_device_mig_state state)
> @@ -741,6 +734,9 @@ static void vfio_save_cleanup(void *opaque)
>      migration->initial_data_sent = false;
>      vfio_migration_cleanup(vbasedev);
>      trace_vfio_save_cleanup(vbasedev->name);
> +    if (vbasedev->multifd_packet) {
> +        g_clear_pointer(&vbasedev->multifd_packet, g_free);
> +    }
>  }
>  
>  static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> @@ -892,7 +888,8 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
>      g_autoptr(QIOChannelBuffer) bioc = NULL;
>      QEMUFile *f = NULL;
>      int ret;
> -    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    VFIODeviceStatePacket *packet;
> +    MultiFDDeviceState_t *state;
>      size_t packet_len;
>  
>      bioc = qio_channel_buffer_new(0);
> @@ -911,13 +908,19 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
>      }
>  
>      packet_len = sizeof(*packet) + bioc->usage;
> -    packet = g_malloc0(packet_len);
> +
> +    state = multifd_device_state_prepare(idstr, instance_id);
> +    /*
> +     * Do not reuse multifd buffer, but use our own due to random size.
> +     * The buffer will be freed only when save cleanup.
> +     */
> +    vbasedev->multifd_packet = g_malloc0(packet_len);
> +    packet = vbasedev->multifd_packet;
>      packet->idx = idx;
>      packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>      memcpy(&packet->data, bioc->data, bioc->usage);
>  
> -    if (!multifd_queue_device_state(idstr, instance_id,
> -                                    (char *)packet, packet_len)) {
> +    if (!multifd_device_state_finish(state, packet, packet_len)) {
>          ret = -1;
>      }
>  
> @@ -936,7 +939,6 @@ static int vfio_save_complete_precopy_thread(char *idstr,
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>      int ret;
> -    g_autofree VFIODeviceStatePacket *packet = NULL;
>      uint32_t idx;
>  
>      if (!migration->multifd_transfer) {
> @@ -954,21 +956,25 @@ static int vfio_save_complete_precopy_thread(char *idstr,
>          goto ret_finish;
>      }
>  
> -    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
> -
>      for (idx = 0; ; idx++) {
> +        VFIODeviceStatePacket *packet;
> +        MultiFDDeviceState_t *state;
>          ssize_t data_size;
>          size_t packet_size;
> +        int64_t buf_size;
>  
>          if (qatomic_read(abort_flag)) {
>              ret = -ECANCELED;
>              goto ret_finish;
>          }
>  
> +        state = multifd_device_state_prepare(idstr, instance_id);
> +        packet = multifd_device_state_get_buffer(state, &buf_size);
>          data_size = read(migration->data_fd, &packet->data,
> -                         migration->data_buffer_size);
> +                         buf_size - sizeof(*packet));
>          if (data_size < 0) {
>              if (errno != ENOMSG) {
> +                multifd_device_state_finish(state, NULL, 0);
>                  ret = -errno;
>                  goto ret_finish;
>              }
> @@ -980,14 +986,15 @@ static int vfio_save_complete_precopy_thread(char *idstr,
>              data_size = 0;
>          }
>  
> -        if (data_size == 0)
> +        if (data_size == 0) {
> +            multifd_device_state_finish(state, NULL, 0);
>              break;
> +        }
>  
>          packet->idx = idx;
>          packet_size = sizeof(*packet) + data_size;
>  
> -        if (!multifd_queue_device_state(idstr, instance_id,
> -                                        (char *)packet, packet_size)) {
> +        if (!multifd_device_state_finish(state, packet, packet_size)) {
>              ret = -1;
>              goto ret_finish;
>          }
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index cfd0465bac..6f0259426d 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -20,15 +20,18 @@ ThreadPool *send_threads;
>  int send_threads_ret;
>  bool send_threads_abort;
>  
> +#define  MULTIFD_DEVICE_STATE_BUFLEN  (1UL << 20)
> +
>  static MultiFDSendData *device_state_send;
>  
> -/* TODO: use static buffers for idstr and buf */
>  void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
>  {
> +    device_state->buf_prealloc = g_malloc0(MULTIFD_DEVICE_STATE_BUFLEN);
>  }
>  
>  void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
>  {
> +    g_clear_pointer(&device_state->buf_prealloc, g_free);
>  }
>  
>  void multifd_device_state_save_setup(void)
> @@ -42,12 +45,6 @@ void multifd_device_state_save_setup(void)
>      send_threads_abort = false;
>  }
>  
> -void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
> -{
> -    device_state->idstr = NULL;
> -    g_clear_pointer(&device_state->buf, g_free);
> -}
> -
>  void multifd_device_state_save_cleanup(void)
>  {
>      g_clear_pointer(&send_threads, thread_pool_free);
> @@ -89,33 +86,89 @@ void multifd_device_state_send_prepare(MultiFDSendParams *p)
>      multifd_device_state_fill_packet(p);
>  }
>  
> -bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> -                                char *data, size_t len)
> +/*
> + * Prepare to send some device state via multifd.  Returns the current idle
> + * MultiFDDeviceState_t*.
> + *
> + * As a follow up, the caller must call multifd_device_state_finish() to
> + * release the resources.
> + *
> + * One example usage of the API:
> + *
> + *   // Fetch a free multifd device state object
> + *   state = multifd_device_state_prepare(idstr, instance_id);
> + *
> + *   // Optional: fetch the buffer to reuse
> + *   buf = multifd_device_state_get_buffer(state, &buf_size);
> + *
> + *   // Here len>0 means success, otherwise failure
> + *   len = buffer_fill(buf, buf_size);
> + *
> + *   // Finish the transaction, either enqueue or cancel the request.  Here
> + *   // len>0 will enqueue, <=0 will cancel.
> + *   multifd_device_state_finish(state, buf, len);
> + */
> +MultiFDDeviceState_t *
> +multifd_device_state_prepare(char *idstr, uint32_t instance_id)
>  {
> -    /* Device state submissions can come from multiple threads */
> -    QEMU_LOCK_GUARD(&queue_job_mutex);
>      MultiFDDeviceState_t *device_state;
>  
>      assert(multifd_payload_empty(device_state_send));
>  
> -    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> +    /*
> +     * TODO: The lock name may need change, but I'm reusing just for
> +     * simplicity.
> +     */
> +    qemu_mutex_lock(&queue_job_mutex);
> +
>      device_state = &device_state_send->u.device_state;
>      /*
> -     * NOTE: here we must use a static idstr (e.g. of a savevm state
> -     * entry) rather than any dynamically allocated buffer, because multifd
> +     * NOTE: here we must use a static idstr (e.g. of a savevm state entry)
> +     * rather than any dynamically allocated buffer, because multifd
>       * assumes this pointer is always valid!
>       */
>      device_state->idstr = idstr;
>      device_state->instance_id = instance_id;
> -    device_state->buf = g_memdup2(data, len);
> -    device_state->buf_len = len;
>  
> -    if (!multifd_send(&device_state_send)) {
> -        multifd_send_data_clear(device_state_send);
> -        return false;
> +    return &device_state_send->u.device_state;
> +}
> +
> +/*
> + * Need to be used after a previous call to multifd_device_state_prepare(),
> + * the buffer must not be used after invoke multifd_device_state_finish().
> + */
> +void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
> +                                      int64_t *buf_len)
> +{
> +    *buf_len = MULTIFD_DEVICE_STATE_BUFLEN;
> +    return s->buf_prealloc;
> +}
> +
> +/*
> + * Need to be used only in pair with a previous call to
> + * multifd_device_state_prepare().  Returns true if enqueue successful,
> + * false otherwise.
> + */
> +bool multifd_device_state_finish(MultiFDDeviceState_t *state,
> +                                 void *buf, int64_t buf_len)
> +{
> +    bool result = false;
> +
> +    /* Currently we only have one global free buffer */
> +    assert(state == &device_state_send->u.device_state);
> +
> +    if (buf_len < 0) {
> +        goto out;
>      }
>  
> -    return true;
> +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> +    /* This normally will be the state->buf_prealloc, but not required */
> +    state->buf = buf;
> +    state->buf_len = buf_len;
> +    result = multifd_send(&device_state_send);
> +out:
> +    qemu_mutex_unlock(&queue_job_mutex);
> +    return result;
>  }
>  
>  bool migration_has_device_state_support(void)
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 5a20b831cf..2b5185e298 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -115,15 +115,6 @@ void multifd_send_data_clear(MultiFDSendData *data)
>          return;
>      }
>  
> -    switch (data->type) {
> -    case MULTIFD_PAYLOAD_DEVICE_STATE:
> -        multifd_device_state_clear(&data->u.device_state);
> -        break;
> -    default:
> -        /* Nothing to do */
> -        break;
> -    }
> -
>      data->type = MULTIFD_PAYLOAD_NONE;
>  }
>  
> -- 
> 2.45.0


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-12 18:43       ` Fabiano Rosas
@ 2024-09-13  0:23         ` Peter Xu
  2024-09-13 13:21           ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-13  0:23 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: mail, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Sep 12, 2024 at 03:43:39PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> Hi Peter, sorry if I'm not very enthusiastic by this, I'm sure you
> understand the rework is a little frustrating.

That's OK.

[For some reason my email sync program decided to give up working for
 hours.  I got more time looking at a tsc bug, which is good, but then I
 miss a lot of emails..]

> 
> > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
> >> > +size_t multifd_device_state_payload_size(void)
> >> > +{
> >> > +    return sizeof(MultiFDDeviceState_t);
> >> > +}
> >> 
> >> This will not be necessary because the payload size is the same as the
> >> data type. We only need it for the special case where the MultiFDPages_t
> >> is smaller than the total ram payload size.
> >
> > Today I was thinking maybe we should really clean this up, as the current
> > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
> > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
> > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
> > that feeling stronger.
> 
> If we're going to commit bad code and then rewrite it a week later, we
> could have just let the original series from Maciej merge without any of
> this.

Why it's "bad code"?

It runs pretty well, I don't think it's bad code.  You wrote it, and I
don't think it's bad at all.

But now we're rethinking after reading Maciej's new series.  Personally I
don't think it's a major problem.

Note that we're not changing the design back: what was initially proposed
was the client submitting an array of multifd objects.  I still don't think
that's right.

What the change goes so far is we make the union a struct, however that's
still N+2 objects not 2*N, where 2 came from RAM+VFIO.  I think the
important bits are still there (from your previous refactor series).

> I already suggested it a couple of times, we shouldn't be doing
> core refactorings underneath contributors' patches, this is too
> fragile. Just let people contribute their code and we can change it
> later.

I sincerely don't think a lot needs changing... only patch 1.  Basically
patch 1 on top of your previous rework series will be at least what I want,
but I'm open to comments from you guys.

Note that patch 2-3 will be on top of Maciej's changes and they're totally
not relevant to what we merged so far.  Hence, nothing relevant there to
what you worked.  And this is the diff of patch 1:

 migration/multifd.h              | 16 +++++++++++-----
 migration/multifd-device-state.c |  8 ++++++--
 migration/multifd-nocomp.c       | 13 ++++++-------
 migration/multifd.c              | 25 ++++++-------------------
 4 files changed, 29 insertions(+), 33 deletions(-)

It's only 33 lines removed (many of which are comments..), it's not a huge
lot.  I don't know why you feel so bad at this...

It's probably because we maintain migration together, or we can keep our
own way of work.  I don't think we did anything wrong yet so far.

We can definitely talk about this in next 1:1.

> 
> This is also why I've been trying hard to separate core multifd
> functionality from migration code that uses multifd to transmit their
> data.
> 
> My original RFC plus the suggestion to extend multifd_ops for device
> state would have (almost) made it so that no client code would be left
> in multifd. We could have been turning this thing upside down and it
> wouldn't affect anyone in terms of code conflicts.

Do you mean you preferred the 2*N approach?

> 
> The ship has already sailed, so your patches below are fine, I have just
> some small comments.

I'm not sure what you meant about "ship sailed", but we should merge code
whenever we think is the most correct.  I hope you meant after below all
things look the best, or please shoot.  That's exactly what I'm requesting
for as comments.

> 
> >
> > I think we should change it now perhaps, otherwise we'll need to introduce
> > other helpers to e.g. reset the device buffers, and that's not only slow
> > but also not good looking, IMO.
> 
> I agree that part is kind of a sore thumb.
> 
> >
> > So I went ahead with the idea in previous discussion, that I managed to
> > change the SendData union into struct; the memory consumption is not super
> > important yet, IMHO, but we should still stick with the object model where
> > multifd enqueue thread switch buffer with multifd, as it still sounds a
> > sane way to do.
> >
> > Then when that patch is ready, I further tried to make VFIO reuse multifd
> > buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
> > don't allocate it every time we enqueue.
> >
> > I hope it'll also work for VFIO.  VFIO has a specialty on being able to
> > dump the config space so it's more complex (and I noticed Maciej's current
> > design requires the final chunk of VFIO config data be migrated in one
> > packet.. that is also part of the complexity there).  So I allowed that
> > part to allocate a buffer but only that.  IOW, I made some API (see below)
> > that can either reuse preallocated buffer, or use a separate one only for
> > the final bulk.
> >
> > In short, could both of you have a look at what I came up with below?  I
> > did that in patches because I think it's too much to comment, so patches
> > may work better.  No concern if any of below could be good changes to you,
> > then either Maciej can squash whatever into existing patches (and I feel
> > like some existing patches in this series can go away with below design),
> > or I can post pre-requisite patch but only if any of you prefer that.
> >
> > Anyway, let me know, the patches apply on top of this whole series applied
> > first.
> >
> > I also wonder whether there can be any perf difference already (I tested
> > all multifd qtest with below, but no VFIO I can run), perhaps not that
> > much, but just to mention below should avoid both buffer allocations and
> > one round of copy (so VFIO read() directly writes to the multifd buffers
> > now).
> >
> > Thanks,
> >
> > ==========8<==========
> > From a6cbcf692b2376e72cc053219d67bb32eabfb7a6 Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Tue, 10 Sep 2024 12:10:59 -0400
> > Subject: [PATCH 1/3] migration/multifd: Make MultiFDSendData a struct
> >
> > The newly introduced device state buffer can be used for either storing
> > VFIO's read() raw data, but already also possible to store generic device
> > states.  After noticing that device states may not easily provide a max
> > buffer size (also the fact that RAM MultiFDPages_t after all also want to
> > have flexibility on managing offset[] array), it may not be a good idea to
> > stick with union on MultiFDSendData.. as it won't play well with such
> > flexibility.
> >
> > Switch MultiFDSendData to a struct.
> >
> > It won't consume a lot more space in reality, after all the real buffers
> > were already dynamically allocated, so it's so far only about the two
> > structs (pages, device_state) that will be duplicated, but they're small.
> >
> > With this, we can remove the pretty hard to understand alloc size logic.
> > Because now we can allocate offset[] together with the SendData, and
> > properly free it when the SendData is freed.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.h              | 16 +++++++++++-----
> >  migration/multifd-device-state.c |  8 ++++++--
> >  migration/multifd-nocomp.c       | 13 ++++++-------
> >  migration/multifd.c              | 25 ++++++-------------------
> >  4 files changed, 29 insertions(+), 33 deletions(-)
> >
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index c15c83104c..47203334b9 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -98,9 +98,13 @@ typedef struct {
> >      uint32_t num;
> >      /* number of normal pages */
> >      uint32_t normal_num;
> > +    /*
> > +     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
> > +     * sure the pointer is always valid!
> > +     */
> 
> This could use some rewording, it's not clear what "caller" means here.
> 
> >      RAMBlock *block;
> > -    /* offset of each page */
> > -    ram_addr_t offset[];
> > +    /* offset array of each page, managed by multifd */
> 
> I'd drop the part after the comma, it's not very accurate and also gives
> an impression that something sophisticated is being done to this.
> 
> > +    ram_addr_t *offset;
> >  } MultiFDPages_t;
> >  
> >  struct MultiFDRecvData {
> > @@ -123,7 +127,7 @@ typedef enum {
> >      MULTIFD_PAYLOAD_DEVICE_STATE,
> >  } MultiFDPayloadType;
> >  
> > -typedef union MultiFDPayload {
> > +typedef struct MultiFDPayload {
> >      MultiFDPages_t ram;
> >      MultiFDDeviceState_t device_state;
> >  } MultiFDPayload;
> > @@ -323,11 +327,13 @@ static inline uint32_t multifd_ram_page_count(void)
> >  void multifd_ram_save_setup(void);
> >  void multifd_ram_save_cleanup(void);
> >  int multifd_ram_flush_and_sync(void);
> > -size_t multifd_ram_payload_size(void);
> > +void multifd_ram_payload_alloc(MultiFDPages_t *pages);
> > +void multifd_ram_payload_free(MultiFDPages_t *pages);
> >  void multifd_ram_fill_packet(MultiFDSendParams *p);
> >  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
> >  
> > -size_t multifd_device_state_payload_size(void);
> > +void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state);
> > +void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state);
> >  void multifd_device_state_save_setup(void);
> >  void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
> >  void multifd_device_state_save_cleanup(void);
> > diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> > index 9b364e8ef3..72b72b6e62 100644
> > --- a/migration/multifd-device-state.c
> > +++ b/migration/multifd-device-state.c
> > @@ -22,9 +22,13 @@ bool send_threads_abort;
> >  
> >  static MultiFDSendData *device_state_send;
> >  
> > -size_t multifd_device_state_payload_size(void)
> > +/* TODO: use static buffers for idstr and buf */
> > +void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
> > +{
> > +}
> > +
> > +void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
> >  {
> > -    return sizeof(MultiFDDeviceState_t);
> >  }
> >  
> >  void multifd_device_state_save_setup(void)
> > diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> > index 0b7b543f44..c1b95fee0d 100644
> > --- a/migration/multifd-nocomp.c
> > +++ b/migration/multifd-nocomp.c
> > @@ -22,15 +22,14 @@
> >  
> >  static MultiFDSendData *multifd_ram_send;
> >  
> > -size_t multifd_ram_payload_size(void)
> > +void multifd_ram_payload_alloc(MultiFDPages_t *pages)
> >  {
> > -    uint32_t n = multifd_ram_page_count();
> > +    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
> > +}
> >  
> > -    /*
> > -     * We keep an array of page offsets at the end of MultiFDPages_t,
> > -     * add space for it in the allocation.
> > -     */
> > -    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
> > +void multifd_ram_payload_free(MultiFDPages_t *pages)
> > +{
> > +    g_clear_pointer(&pages->offset, g_free);
> >  }
> >  
> >  void multifd_ram_save_setup(void)
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index bebe5b5a9b..5a20b831cf 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -101,26 +101,12 @@ struct {
> >  
> >  MultiFDSendData *multifd_send_data_alloc(void)
> >  {
> > -    size_t max_payload_size, size_minus_payload;
> > +    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
> >  
> > -    /*
> > -     * MultiFDPages_t has a flexible array at the end, account for it
> > -     * when allocating MultiFDSendData. Use max() in case other types
> > -     * added to the union in the future are larger than
> > -     * (MultiFDPages_t + flex array).
> > -     */
> > -    max_payload_size = MAX(multifd_ram_payload_size(),
> > -                           multifd_device_state_payload_size());
> > -    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
> > -
> > -    /*
> > -     * Account for any holes the compiler might insert. We can't pack
> > -     * the structure because that misaligns the members and triggers
> > -     * Waddress-of-packed-member.
> > -     */
> > -    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
> > +    multifd_ram_payload_alloc(&new->u.ram);
> > +    multifd_device_state_payload_alloc(&new->u.device_state);
> >  
> > -    return g_malloc0(size_minus_payload + max_payload_size);
> > +    return new;
> >  }
> >  
> >  void multifd_send_data_clear(MultiFDSendData *data)
> > @@ -147,7 +133,8 @@ void multifd_send_data_free(MultiFDSendData *data)
> >          return;
> >      }
> >  
> > -    multifd_send_data_clear(data);
> > +    multifd_ram_payload_free(&data->u.ram);
> > +    multifd_device_state_payload_free(&data->u.device_state);
> 
> The "u" needs to be dropped now. Could just change to "p". Or can we now
> move the whole struct inside MultiFDSendData?

Yep, all your comments look good to me.

A note here: I intentionally didn't touch "u." as that requires more
changes which doesn't help as me leaving "patch-styled comment".  As I
said, feel free to see the patches as comments not patches for merging yet.
I / Maciej can prepare patch but only if the idea in general can be
accepted.

For me as I mentioned patch 2-3 do not relevant much to current master
branch, afaiu, so if you guys like I can repost patch 1 with a formal one,
but only if Maciej thinks it's easier for him.

> 
> >  
> >      g_free(data);
> >  }
> > -- 
> > 2.45.0
> >
> >
> >
> > From 6695d134c0818f42183f5ea03c21e6d56e7b57ea Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Tue, 10 Sep 2024 12:24:14 -0400
> > Subject: [PATCH 2/3] migration/multifd: Optimize device_state->idstr updates
> >
> > The duplication / allocation of idstr for each VFIO blob is an overkill, as
> > idstr isn't something that changes frequently.  Also, the idstr always came
> > from the upper layer of se->idstr so it's always guaranteed to
> > exist (e.g. no device unplug allowed during migration).
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.h              |  4 ++++
> >  migration/multifd-device-state.c | 10 +++++++---
> >  2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index 47203334b9..1eaa5d4496 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -115,6 +115,10 @@ struct MultiFDRecvData {
> >  };
> >  
> >  typedef struct {
> > +    /*
> > +     * Name of the owner device.  NOTE: it's caller's responsibility to
> > +     * make sure the pointer is always valid!
> > +     */
> 
> Same comment as the other one here.
> 
> >      char *idstr;
> >      uint32_t instance_id;
> >      char *buf;
> > diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> > index 72b72b6e62..cfd0465bac 100644
> > --- a/migration/multifd-device-state.c
> > +++ b/migration/multifd-device-state.c
> > @@ -44,7 +44,7 @@ void multifd_device_state_save_setup(void)
> >  
> >  void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
> >  {
> > -    g_clear_pointer(&device_state->idstr, g_free);
> > +    device_state->idstr = NULL;
> >      g_clear_pointer(&device_state->buf, g_free);
> >  }
> >  
> > @@ -100,7 +100,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> >  
> >      multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> >      device_state = &device_state_send->u.device_state;
> > -    device_state->idstr = g_strdup(idstr);
> > +    /*
> > +     * NOTE: here we must use a static idstr (e.g. of a savevm state
> > +     * entry) rather than any dynamically allocated buffer, because multifd
> > +     * assumes this pointer is always valid!
> > +     */
> > +    device_state->idstr = idstr;
> >      device_state->instance_id = instance_id;
> >      device_state->buf = g_memdup2(data, len);
> >      device_state->buf_len = len;
> > @@ -137,7 +142,6 @@ static void multifd_device_state_save_thread_data_free(void *opaque)
> >  {
> >      struct MultiFDDSSaveThreadData *data = opaque;
> >  
> > -    g_clear_pointer(&data->idstr, g_free);
> >      g_free(data);
> >  }
> >  
> > -- 
> > 2.45.0
> >
> >
> > From abfea9698ff46ad0e0175e1dcc6e005e0b2ece2a Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Tue, 10 Sep 2024 12:27:49 -0400
> > Subject: [PATCH 3/3] migration/multifd: Optimize device_state buffer
> >  allocations
> >
> > Provide a device_state->buf_prealloc so that the buffers can be reused if
> > possible.  Provide a set of APIs to use it right.  Please see the
> > documentation for the API in the code.
> >
> > The default buffer size came from VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE as of
> > now.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/hw/vfio/vfio-common.h    |  9 ++++
> >  include/migration/misc.h         | 12 ++++-
> >  migration/multifd.h              | 11 +++-
> >  hw/vfio/migration.c              | 43 ++++++++-------
> >  migration/multifd-device-state.c | 93 +++++++++++++++++++++++++-------
> >  migration/multifd.c              |  9 ----
> >  6 files changed, 126 insertions(+), 51 deletions(-)
> >
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 4578a0ca6a..c1f2f4ae55 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -61,6 +61,13 @@ typedef struct VFIORegion {
> >      uint8_t nr; /* cache the region number for debug */
> >  } VFIORegion;
> >  
> > +typedef struct VFIODeviceStatePacket {
> > +    uint32_t version;
> > +    uint32_t idx;
> > +    uint32_t flags;
> > +    uint8_t data[0];
> > +} QEMU_PACKED VFIODeviceStatePacket;
> > +
> >  typedef struct VFIOMigration {
> >      struct VFIODevice *vbasedev;
> >      VMChangeStateEntry *vm_state;
> > @@ -168,6 +175,8 @@ typedef struct VFIODevice {
> >      int devid;
> >      IOMMUFDBackend *iommufd;
> >      VFIOIOASHwpt *hwpt;
> > +    /* Only used on sender side when multifd is enabled */
> > +    VFIODeviceStatePacket *multifd_packet;
> >      QLIST_ENTRY(VFIODevice) hwpt_next;
> >  } VFIODevice;
> >  
> > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > index 26f7f3140f..1a8676ed3d 100644
> > --- a/include/migration/misc.h
> > +++ b/include/migration/misc.h
> > @@ -112,8 +112,16 @@ bool migration_in_bg_snapshot(void);
> >  void dirty_bitmap_mig_init(void);
> >  
> >  /* migration/multifd-device-state.c */
> > -bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> > -                                char *data, size_t len);
> > +struct MultiFDDeviceState_t;
> > +typedef struct MultiFDDeviceState_t MultiFDDeviceState_t;
> > +
> > +MultiFDDeviceState_t *
> > +multifd_device_state_prepare(char *idstr, uint32_t instance_id);
> > +void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
> > +                                      int64_t *buf_len);
> > +bool multifd_device_state_finish(MultiFDDeviceState_t *state,
> > +                                 void *buf, int64_t buf_len);
> > +
> >  bool migration_has_device_state_support(void);
> >  
> >  void
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index 1eaa5d4496..1ccdeeb8c5 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -15,6 +15,7 @@
> >  
> >  #include "exec/target_page.h"
> >  #include "ram.h"
> > +#include "migration/misc.h"
> >  
> >  typedef struct MultiFDRecvData MultiFDRecvData;
> >  typedef struct MultiFDSendData MultiFDSendData;
> > @@ -114,16 +115,22 @@ struct MultiFDRecvData {
> >      off_t file_offset;
> >  };
> >  
> > -typedef struct {
> > +struct MultiFDDeviceState_t {
> >      /*
> >       * Name of the owner device.  NOTE: it's caller's responsibility to
> >       * make sure the pointer is always valid!
> >       */
> >      char *idstr;
> >      uint32_t instance_id;
> > +    /*
> > +     * Points to the buffer to send via multifd.  Normally it's the same as
> > +     * buf_prealloc, otherwise the caller needs to make sure the buffer is
> > +     * avaliable through multifd running.
> 
> "throughout multifd runtime" maybe.
> 
> > +     */
> >      char *buf;
> > +    char *buf_prealloc;
> >      size_t buf_len;
> > -} MultiFDDeviceState_t;
> > +};
> >  
> >  typedef enum {
> >      MULTIFD_PAYLOAD_NONE,
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 67996aa2df..e36422b7c5 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -59,13 +59,6 @@
> >  
> >  #define VFIO_DEVICE_STATE_CONFIG_STATE (1)
> >  
> > -typedef struct VFIODeviceStatePacket {
> > -    uint32_t version;
> > -    uint32_t idx;
> > -    uint32_t flags;
> > -    uint8_t data[0];
> > -} QEMU_PACKED VFIODeviceStatePacket;
> > -
> >  static int64_t bytes_transferred;
> >  
> >  static const char *mig_state_to_str(enum vfio_device_mig_state state)
> > @@ -741,6 +734,9 @@ static void vfio_save_cleanup(void *opaque)
> >      migration->initial_data_sent = false;
> >      vfio_migration_cleanup(vbasedev);
> >      trace_vfio_save_cleanup(vbasedev->name);
> > +    if (vbasedev->multifd_packet) {
> > +        g_clear_pointer(&vbasedev->multifd_packet, g_free);
> > +    }
> >  }
> >  
> >  static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> > @@ -892,7 +888,8 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
> >      g_autoptr(QIOChannelBuffer) bioc = NULL;
> >      QEMUFile *f = NULL;
> >      int ret;
> > -    g_autofree VFIODeviceStatePacket *packet = NULL;
> > +    VFIODeviceStatePacket *packet;
> > +    MultiFDDeviceState_t *state;
> >      size_t packet_len;
> >  
> >      bioc = qio_channel_buffer_new(0);
> > @@ -911,13 +908,19 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
> >      }
> >  
> >      packet_len = sizeof(*packet) + bioc->usage;
> > -    packet = g_malloc0(packet_len);
> > +
> > +    state = multifd_device_state_prepare(idstr, instance_id);
> > +    /*
> > +     * Do not reuse multifd buffer, but use our own due to random size.
> > +     * The buffer will be freed only when save cleanup.
> > +     */
> > +    vbasedev->multifd_packet = g_malloc0(packet_len);
> > +    packet = vbasedev->multifd_packet;
> >      packet->idx = idx;
> >      packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
> >      memcpy(&packet->data, bioc->data, bioc->usage);
> >  
> > -    if (!multifd_queue_device_state(idstr, instance_id,
> > -                                    (char *)packet, packet_len)) {
> > +    if (!multifd_device_state_finish(state, packet, packet_len)) {
> >          ret = -1;
> >      }
> >  
> > @@ -936,7 +939,6 @@ static int vfio_save_complete_precopy_thread(char *idstr,
> >      VFIODevice *vbasedev = opaque;
> >      VFIOMigration *migration = vbasedev->migration;
> >      int ret;
> > -    g_autofree VFIODeviceStatePacket *packet = NULL;
> >      uint32_t idx;
> >  
> >      if (!migration->multifd_transfer) {
> > @@ -954,21 +956,25 @@ static int vfio_save_complete_precopy_thread(char *idstr,
> >          goto ret_finish;
> >      }
> >  
> > -    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
> > -
> >      for (idx = 0; ; idx++) {
> > +        VFIODeviceStatePacket *packet;
> > +        MultiFDDeviceState_t *state;
> >          ssize_t data_size;
> >          size_t packet_size;
> > +        int64_t buf_size;
> >  
> >          if (qatomic_read(abort_flag)) {
> >              ret = -ECANCELED;
> >              goto ret_finish;
> >          }
> >  
> > +        state = multifd_device_state_prepare(idstr, instance_id);
> > +        packet = multifd_device_state_get_buffer(state, &buf_size);
> >          data_size = read(migration->data_fd, &packet->data,
> > -                         migration->data_buffer_size);
> > +                         buf_size - sizeof(*packet));
> >          if (data_size < 0) {
> >              if (errno != ENOMSG) {
> > +                multifd_device_state_finish(state, NULL, 0);
> >                  ret = -errno;
> >                  goto ret_finish;
> >              }
> > @@ -980,14 +986,15 @@ static int vfio_save_complete_precopy_thread(char *idstr,
> >              data_size = 0;
> >          }
> >  
> > -        if (data_size == 0)
> > +        if (data_size == 0) {
> > +            multifd_device_state_finish(state, NULL, 0);
> >              break;
> > +        }
> >  
> >          packet->idx = idx;
> >          packet_size = sizeof(*packet) + data_size;
> >  
> > -        if (!multifd_queue_device_state(idstr, instance_id,
> > -                                        (char *)packet, packet_size)) {
> > +        if (!multifd_device_state_finish(state, packet, packet_size)) {
> >              ret = -1;
> >              goto ret_finish;
> >          }
> > diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> > index cfd0465bac..6f0259426d 100644
> > --- a/migration/multifd-device-state.c
> > +++ b/migration/multifd-device-state.c
> > @@ -20,15 +20,18 @@ ThreadPool *send_threads;
> >  int send_threads_ret;
> >  bool send_threads_abort;
> >  
> > +#define  MULTIFD_DEVICE_STATE_BUFLEN  (1UL << 20)
> > +
> >  static MultiFDSendData *device_state_send;
> >  
> > -/* TODO: use static buffers for idstr and buf */
> >  void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
> >  {
> > +    device_state->buf_prealloc = g_malloc0(MULTIFD_DEVICE_STATE_BUFLEN);
> >  }
> >  
> >  void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
> >  {
> > +    g_clear_pointer(&device_state->buf_prealloc, g_free);
> >  }
> >  
> >  void multifd_device_state_save_setup(void)
> > @@ -42,12 +45,6 @@ void multifd_device_state_save_setup(void)
> >      send_threads_abort = false;
> >  }
> >  
> > -void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
> > -{
> > -    device_state->idstr = NULL;
> > -    g_clear_pointer(&device_state->buf, g_free);
> > -}
> > -
> >  void multifd_device_state_save_cleanup(void)
> >  {
> >      g_clear_pointer(&send_threads, thread_pool_free);
> > @@ -89,33 +86,89 @@ void multifd_device_state_send_prepare(MultiFDSendParams *p)
> >      multifd_device_state_fill_packet(p);
> >  }
> >  
> > -bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> > -                                char *data, size_t len)
> > +/*
> > + * Prepare to send some device state via multifd.  Returns the current idle
> > + * MultiFDDeviceState_t*.
> > + *
> > + * As a follow up, the caller must call multifd_device_state_finish() to
> > + * release the resources.
> > + *
> > + * One example usage of the API:
> > + *
> > + *   // Fetch a free multifd device state object
> > + *   state = multifd_device_state_prepare(idstr, instance_id);
> > + *
> > + *   // Optional: fetch the buffer to reuse
> > + *   buf = multifd_device_state_get_buffer(state, &buf_size);
> > + *
> > + *   // Here len>0 means success, otherwise failure
> > + *   len = buffer_fill(buf, buf_size);
> > + *
> > + *   // Finish the transaction, either enqueue or cancel the request.  Here
> > + *   // len>0 will enqueue, <=0 will cancel.
> > + *   multifd_device_state_finish(state, buf, len);
> > + */
> > +MultiFDDeviceState_t *
> > +multifd_device_state_prepare(char *idstr, uint32_t instance_id)
> >  {
> > -    /* Device state submissions can come from multiple threads */
> > -    QEMU_LOCK_GUARD(&queue_job_mutex);
> >      MultiFDDeviceState_t *device_state;
> >  
> >      assert(multifd_payload_empty(device_state_send));
> >  
> > -    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> > +    /*
> > +     * TODO: The lock name may need change, but I'm reusing just for
> > +     * simplicity.
> > +     */
> > +    qemu_mutex_lock(&queue_job_mutex);
> > +
> >      device_state = &device_state_send->u.device_state;
> >      /*
> > -     * NOTE: here we must use a static idstr (e.g. of a savevm state
> > -     * entry) rather than any dynamically allocated buffer, because multifd
> > +     * NOTE: here we must use a static idstr (e.g. of a savevm state entry)
> > +     * rather than any dynamically allocated buffer, because multifd
> >       * assumes this pointer is always valid!
> >       */
> >      device_state->idstr = idstr;
> >      device_state->instance_id = instance_id;
> > -    device_state->buf = g_memdup2(data, len);
> > -    device_state->buf_len = len;
> >  
> > -    if (!multifd_send(&device_state_send)) {
> > -        multifd_send_data_clear(device_state_send);
> > -        return false;
> > +    return &device_state_send->u.device_state;
> > +}
> > +
> > +/*
> > + * Need to be used after a previous call to multifd_device_state_prepare(),
> > + * the buffer must not be used after invoke multifd_device_state_finish().
> > + */
> > +void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
> > +                                      int64_t *buf_len)
> > +{
> > +    *buf_len = MULTIFD_DEVICE_STATE_BUFLEN;
> > +    return s->buf_prealloc;
> > +}
> > +
> > +/*
> > + * Need to be used only in pair with a previous call to
> > + * multifd_device_state_prepare().  Returns true if enqueue successful,
> > + * false otherwise.
> > + */
> > +bool multifd_device_state_finish(MultiFDDeviceState_t *state,
> > +                                 void *buf, int64_t buf_len)
> > +{
> > +    bool result = false;
> > +
> > +    /* Currently we only have one global free buffer */
> > +    assert(state == &device_state_send->u.device_state);
> > +
> > +    if (buf_len < 0) {
> > +        goto out;
> >      }
> >  
> > -    return true;
> > +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> > +    /* This normally will be the state->buf_prealloc, but not required */
> > +    state->buf = buf;
> > +    state->buf_len = buf_len;
> > +    result = multifd_send(&device_state_send);
> > +out:
> > +    qemu_mutex_unlock(&queue_job_mutex);
> > +    return result;
> >  }
> >  
> >  bool migration_has_device_state_support(void)
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 5a20b831cf..2b5185e298 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -115,15 +115,6 @@ void multifd_send_data_clear(MultiFDSendData *data)
> >          return;
> >      }
> >  
> > -    switch (data->type) {
> > -    case MULTIFD_PAYLOAD_DEVICE_STATE:
> > -        multifd_device_state_clear(&data->u.device_state);
> > -        break;
> > -    default:
> > -        /* Nothing to do */
> > -        break;
> > -    }
> > -
> >      data->type = MULTIFD_PAYLOAD_NONE;
> >  }
> >  
> > -- 
> > 2.45.0
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-13  0:23         ` Peter Xu
@ 2024-09-13 13:21           ` Fabiano Rosas
  2024-09-13 14:19             ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-13 13:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: mail, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Thu, Sep 12, 2024 at 03:43:39PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> Hi Peter, sorry if I'm not very enthusiastic by this, I'm sure you
>> understand the rework is a little frustrating.
>
> That's OK.
>
> [For some reason my email sync program decided to give up working for
>  hours.  I got more time looking at a tsc bug, which is good, but then I
>  miss a lot of emails..]
>
>> 
>> > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
>> >> > +size_t multifd_device_state_payload_size(void)
>> >> > +{
>> >> > +    return sizeof(MultiFDDeviceState_t);
>> >> > +}
>> >> 
>> >> This will not be necessary because the payload size is the same as the
>> >> data type. We only need it for the special case where the MultiFDPages_t
>> >> is smaller than the total ram payload size.
>> >
>> > Today I was thinking maybe we should really clean this up, as the current
>> > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
>> > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
>> > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
>> > that feeling stronger.
>> 
>> If we're going to commit bad code and then rewrite it a week later, we
>> could have just let the original series from Maciej merge without any of
>> this.
>
> Why it's "bad code"?
>
> It runs pretty well, I don't think it's bad code.  You wrote it, and I
> don't think it's bad at all.

Code that forces us to do arithmetic in order to properly allocate
memory and comes with a big comment explaining how we're dodging
compiler warnings is bad code in my book.

>
> But now we're rethinking after reading Maciej's new series.
>Personally I don't think it's a major problem.
>
> Note that we're not changing the design back: what was initially proposed
> was the client submitting an array of multifd objects.  I still don't think
> that's right.
>
> What the change goes so far is we make the union a struct, however that's
> still N+2 objects not 2*N, where 2 came from RAM+VFIO.  I think the
> important bits are still there (from your previous refactor series).
>

You fail to appreciate that before the RFC series, multifd already
allocated N for the pages. The device state adds another client, so that
would be another N anyway. The problem the RFC tried to solve was that
multifd channels owned that 2N, so the array was added to move the
memory into the client's ownership. IOW, it wasn't even the RFC series
that made it 2N, that was the multifd design all along. Now in hindsight
I don't think we should have went with the memory saving discussion.

>> I already suggested it a couple of times, we shouldn't be doing
>> core refactorings underneath contributors' patches, this is too
>> fragile. Just let people contribute their code and we can change it
>> later.
>
> I sincerely don't think a lot needs changing... only patch 1.  Basically
> patch 1 on top of your previous rework series will be at least what I want,
> but I'm open to comments from you guys.

Don't get me wrong, I'm very much in favor of what you're doing
here. However, I don't think it's ok to be backtracking on our design
while other people have series in flight that depend on it. You
certainly know the feeling of trying to merge a feature and having
maintainers ask you to rewrite a bunch code just to be able to start
working. That's not ideal.

I tried to quickly insert the RFC series before the device state series
progressed too much, but it's 3 months later and we're still discussing
it, maybe we don't need to do it this way.

And ok, let's consider the current situation a special case. But I would
like to avoid in the future this kind of uncertainty. 

>
> Note that patch 2-3 will be on top of Maciej's changes and they're totally
> not relevant to what we merged so far.  Hence, nothing relevant there to
> what you worked.  And this is the diff of patch 1:
>
>  migration/multifd.h              | 16 +++++++++++-----
>  migration/multifd-device-state.c |  8 ++++++--
>  migration/multifd-nocomp.c       | 13 ++++++-------
>  migration/multifd.c              | 25 ++++++-------------------
>  4 files changed, 29 insertions(+), 33 deletions(-)
>
> It's only 33 lines removed (many of which are comments..), it's not a huge
> lot.  I don't know why you feel so bad at this...
>
> It's probably because we maintain migration together, or we can keep our
> own way of work.  I don't think we did anything wrong yet so far.
>
> We can definitely talk about this in next 1:1.
>
>> 
>> This is also why I've been trying hard to separate core multifd
>> functionality from migration code that uses multifd to transmit their
>> data.
>> 
>> My original RFC plus the suggestion to extend multifd_ops for device
>> state would have (almost) made it so that no client code would be left
>> in multifd. We could have been turning this thing upside down and it
>> wouldn't affect anyone in terms of code conflicts.
>
> Do you mean you preferred the 2*N approach?
>

2*N, where N is usually not larger than 32 and the payload size is
1k. Yes, I'd trade that off no problem.

>> 
>> The ship has already sailed, so your patches below are fine, I have just
>> some small comments.
>
> I'm not sure what you meant about "ship sailed", but we should merge code
> whenever we think is the most correct.

As you put above, I agree that the important bits of the original series
have been preserved, but other secondary goals were lost, such as the
more abstract separation between multifd & client code and that is the
ship that has sailed.

That series was not: "introduce this array for no reason", we also lost
the ability to abstract the payload from the multifd threads when we
dropped the .alloc_fn callback for instance. The last patch you posted
here now adds multifd_device_state_prepare, somewhat ignoring that the
ram code also has the same pattern and it could be made to use the same
API.

I did accept your premise that ram+compression is one thing while
device_state is another, so I'm not asking it to be changed, just
pointing out that the RFC series also addressed those issues. I might
not have made that clear back then.

> I hope you meant after below all things look the best, or please shoot.
> That's exactly what I'm requesting for as comments.

What you have here is certainly an improvement from the current
state. I'm just ranting about the path we took here.

>> 
>> >
>> > I think we should change it now perhaps, otherwise we'll need to introduce
>> > other helpers to e.g. reset the device buffers, and that's not only slow
>> > but also not good looking, IMO.
>> 
>> I agree that part is kind of a sore thumb.
>> 
>> >
>> > So I went ahead with the idea in previous discussion, that I managed to
>> > change the SendData union into struct; the memory consumption is not super
>> > important yet, IMHO, but we should still stick with the object model where
>> > multifd enqueue thread switch buffer with multifd, as it still sounds a
>> > sane way to do.
>> >
>> > Then when that patch is ready, I further tried to make VFIO reuse multifd
>> > buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
>> > don't allocate it every time we enqueue.
>> >
>> > I hope it'll also work for VFIO.  VFIO has a specialty on being able to
>> > dump the config space so it's more complex (and I noticed Maciej's current
>> > design requires the final chunk of VFIO config data be migrated in one
>> > packet.. that is also part of the complexity there).  So I allowed that
>> > part to allocate a buffer but only that.  IOW, I made some API (see below)
>> > that can either reuse preallocated buffer, or use a separate one only for
>> > the final bulk.
>> >
>> > In short, could both of you have a look at what I came up with below?  I
>> > did that in patches because I think it's too much to comment, so patches
>> > may work better.  No concern if any of below could be good changes to you,
>> > then either Maciej can squash whatever into existing patches (and I feel
>> > like some existing patches in this series can go away with below design),
>> > or I can post pre-requisite patch but only if any of you prefer that.
>> >
>> > Anyway, let me know, the patches apply on top of this whole series applied
>> > first.
>> >
>> > I also wonder whether there can be any perf difference already (I tested
>> > all multifd qtest with below, but no VFIO I can run), perhaps not that
>> > much, but just to mention below should avoid both buffer allocations and
>> > one round of copy (so VFIO read() directly writes to the multifd buffers
>> > now).
>> >
>> > Thanks,
>> >
>> > ==========8<==========
>> > From a6cbcf692b2376e72cc053219d67bb32eabfb7a6 Mon Sep 17 00:00:00 2001
>> > From: Peter Xu <peterx@redhat.com>
>> > Date: Tue, 10 Sep 2024 12:10:59 -0400
>> > Subject: [PATCH 1/3] migration/multifd: Make MultiFDSendData a struct
>> >
>> > The newly introduced device state buffer can be used for either storing
>> > VFIO's read() raw data, but already also possible to store generic device
>> > states.  After noticing that device states may not easily provide a max
>> > buffer size (also the fact that RAM MultiFDPages_t after all also want to
>> > have flexibility on managing offset[] array), it may not be a good idea to
>> > stick with union on MultiFDSendData.. as it won't play well with such
>> > flexibility.
>> >
>> > Switch MultiFDSendData to a struct.
>> >
>> > It won't consume a lot more space in reality, after all the real buffers
>> > were already dynamically allocated, so it's so far only about the two
>> > structs (pages, device_state) that will be duplicated, but they're small.
>> >
>> > With this, we can remove the pretty hard to understand alloc size logic.
>> > Because now we can allocate offset[] together with the SendData, and
>> > properly free it when the SendData is freed.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> >  migration/multifd.h              | 16 +++++++++++-----
>> >  migration/multifd-device-state.c |  8 ++++++--
>> >  migration/multifd-nocomp.c       | 13 ++++++-------
>> >  migration/multifd.c              | 25 ++++++-------------------
>> >  4 files changed, 29 insertions(+), 33 deletions(-)
>> >
>> > diff --git a/migration/multifd.h b/migration/multifd.h
>> > index c15c83104c..47203334b9 100644
>> > --- a/migration/multifd.h
>> > +++ b/migration/multifd.h
>> > @@ -98,9 +98,13 @@ typedef struct {
>> >      uint32_t num;
>> >      /* number of normal pages */
>> >      uint32_t normal_num;
>> > +    /*
>> > +     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
>> > +     * sure the pointer is always valid!
>> > +     */
>> 
>> This could use some rewording, it's not clear what "caller" means here.
>> 
>> >      RAMBlock *block;
>> > -    /* offset of each page */
>> > -    ram_addr_t offset[];
>> > +    /* offset array of each page, managed by multifd */
>> 
>> I'd drop the part after the comma, it's not very accurate and also gives
>> an impression that something sophisticated is being done to this.
>> 
>> > +    ram_addr_t *offset;
>> >  } MultiFDPages_t;
>> >  
>> >  struct MultiFDRecvData {
>> > @@ -123,7 +127,7 @@ typedef enum {
>> >      MULTIFD_PAYLOAD_DEVICE_STATE,
>> >  } MultiFDPayloadType;
>> >  
>> > -typedef union MultiFDPayload {
>> > +typedef struct MultiFDPayload {
>> >      MultiFDPages_t ram;
>> >      MultiFDDeviceState_t device_state;
>> >  } MultiFDPayload;
>> > @@ -323,11 +327,13 @@ static inline uint32_t multifd_ram_page_count(void)
>> >  void multifd_ram_save_setup(void);
>> >  void multifd_ram_save_cleanup(void);
>> >  int multifd_ram_flush_and_sync(void);
>> > -size_t multifd_ram_payload_size(void);
>> > +void multifd_ram_payload_alloc(MultiFDPages_t *pages);
>> > +void multifd_ram_payload_free(MultiFDPages_t *pages);
>> >  void multifd_ram_fill_packet(MultiFDSendParams *p);
>> >  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
>> >  
>> > -size_t multifd_device_state_payload_size(void);
>> > +void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state);
>> > +void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state);
>> >  void multifd_device_state_save_setup(void);
>> >  void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
>> >  void multifd_device_state_save_cleanup(void);
>> > diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> > index 9b364e8ef3..72b72b6e62 100644
>> > --- a/migration/multifd-device-state.c
>> > +++ b/migration/multifd-device-state.c
>> > @@ -22,9 +22,13 @@ bool send_threads_abort;
>> >  
>> >  static MultiFDSendData *device_state_send;
>> >  
>> > -size_t multifd_device_state_payload_size(void)
>> > +/* TODO: use static buffers for idstr and buf */
>> > +void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
>> > +{
>> > +}
>> > +
>> > +void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
>> >  {
>> > -    return sizeof(MultiFDDeviceState_t);
>> >  }
>> >  
>> >  void multifd_device_state_save_setup(void)
>> > diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
>> > index 0b7b543f44..c1b95fee0d 100644
>> > --- a/migration/multifd-nocomp.c
>> > +++ b/migration/multifd-nocomp.c
>> > @@ -22,15 +22,14 @@
>> >  
>> >  static MultiFDSendData *multifd_ram_send;
>> >  
>> > -size_t multifd_ram_payload_size(void)
>> > +void multifd_ram_payload_alloc(MultiFDPages_t *pages)
>> >  {
>> > -    uint32_t n = multifd_ram_page_count();
>> > +    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
>> > +}
>> >  
>> > -    /*
>> > -     * We keep an array of page offsets at the end of MultiFDPages_t,
>> > -     * add space for it in the allocation.
>> > -     */
>> > -    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
>> > +void multifd_ram_payload_free(MultiFDPages_t *pages)
>> > +{
>> > +    g_clear_pointer(&pages->offset, g_free);
>> >  }
>> >  
>> >  void multifd_ram_save_setup(void)
>> > diff --git a/migration/multifd.c b/migration/multifd.c
>> > index bebe5b5a9b..5a20b831cf 100644
>> > --- a/migration/multifd.c
>> > +++ b/migration/multifd.c
>> > @@ -101,26 +101,12 @@ struct {
>> >  
>> >  MultiFDSendData *multifd_send_data_alloc(void)
>> >  {
>> > -    size_t max_payload_size, size_minus_payload;
>> > +    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
>> >  
>> > -    /*
>> > -     * MultiFDPages_t has a flexible array at the end, account for it
>> > -     * when allocating MultiFDSendData. Use max() in case other types
>> > -     * added to the union in the future are larger than
>> > -     * (MultiFDPages_t + flex array).
>> > -     */
>> > -    max_payload_size = MAX(multifd_ram_payload_size(),
>> > -                           multifd_device_state_payload_size());
>> > -    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
>> > -
>> > -    /*
>> > -     * Account for any holes the compiler might insert. We can't pack
>> > -     * the structure because that misaligns the members and triggers
>> > -     * Waddress-of-packed-member.
>> > -     */
>> > -    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
>> > +    multifd_ram_payload_alloc(&new->u.ram);
>> > +    multifd_device_state_payload_alloc(&new->u.device_state);
>> >  
>> > -    return g_malloc0(size_minus_payload + max_payload_size);
>> > +    return new;
>> >  }
>> >  
>> >  void multifd_send_data_clear(MultiFDSendData *data)
>> > @@ -147,7 +133,8 @@ void multifd_send_data_free(MultiFDSendData *data)
>> >          return;
>> >      }
>> >  
>> > -    multifd_send_data_clear(data);
>> > +    multifd_ram_payload_free(&data->u.ram);
>> > +    multifd_device_state_payload_free(&data->u.device_state);
>> 
>> The "u" needs to be dropped now. Could just change to "p". Or can we now
>> move the whole struct inside MultiFDSendData?
>
> Yep, all your comments look good to me.
>
> A note here: I intentionally didn't touch "u." as that requires more
> changes which doesn't help as me leaving "patch-styled comment".  As I
> said, feel free to see the patches as comments not patches for merging yet.
> I / Maciej can prepare patch but only if the idea in general can be
> accepted.
>
> For me as I mentioned patch 2-3 do not relevant much to current master
> branch, afaiu, so if you guys like I can repost patch 1 with a formal one,
> but only if Maciej thinks it's easier for him.
>

I don't mind either way. If it were a proper series, it could be fetched
with b4, maybe that helps.

>> 
>> >  
>> >      g_free(data);
>> >  }
>> > -- 
>> > 2.45.0
>> >
>> >
>> >
>> > From 6695d134c0818f42183f5ea03c21e6d56e7b57ea Mon Sep 17 00:00:00 2001
>> > From: Peter Xu <peterx@redhat.com>
>> > Date: Tue, 10 Sep 2024 12:24:14 -0400
>> > Subject: [PATCH 2/3] migration/multifd: Optimize device_state->idstr updates
>> >
>> > The duplication / allocation of idstr for each VFIO blob is an overkill, as
>> > idstr isn't something that changes frequently.  Also, the idstr always came
>> > from the upper layer of se->idstr so it's always guaranteed to
>> > exist (e.g. no device unplug allowed during migration).
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> >  migration/multifd.h              |  4 ++++
>> >  migration/multifd-device-state.c | 10 +++++++---
>> >  2 files changed, 11 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/migration/multifd.h b/migration/multifd.h
>> > index 47203334b9..1eaa5d4496 100644
>> > --- a/migration/multifd.h
>> > +++ b/migration/multifd.h
>> > @@ -115,6 +115,10 @@ struct MultiFDRecvData {
>> >  };
>> >  
>> >  typedef struct {
>> > +    /*
>> > +     * Name of the owner device.  NOTE: it's caller's responsibility to
>> > +     * make sure the pointer is always valid!
>> > +     */
>> 
>> Same comment as the other one here.
>> 
>> >      char *idstr;
>> >      uint32_t instance_id;
>> >      char *buf;
>> > diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> > index 72b72b6e62..cfd0465bac 100644
>> > --- a/migration/multifd-device-state.c
>> > +++ b/migration/multifd-device-state.c
>> > @@ -44,7 +44,7 @@ void multifd_device_state_save_setup(void)
>> >  
>> >  void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>> >  {
>> > -    g_clear_pointer(&device_state->idstr, g_free);
>> > +    device_state->idstr = NULL;
>> >      g_clear_pointer(&device_state->buf, g_free);
>> >  }
>> >  
>> > @@ -100,7 +100,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> >  
>> >      multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
>> >      device_state = &device_state_send->u.device_state;
>> > -    device_state->idstr = g_strdup(idstr);
>> > +    /*
>> > +     * NOTE: here we must use a static idstr (e.g. of a savevm state
>> > +     * entry) rather than any dynamically allocated buffer, because multifd
>> > +     * assumes this pointer is always valid!
>> > +     */
>> > +    device_state->idstr = idstr;
>> >      device_state->instance_id = instance_id;
>> >      device_state->buf = g_memdup2(data, len);
>> >      device_state->buf_len = len;
>> > @@ -137,7 +142,6 @@ static void multifd_device_state_save_thread_data_free(void *opaque)
>> >  {
>> >      struct MultiFDDSSaveThreadData *data = opaque;
>> >  
>> > -    g_clear_pointer(&data->idstr, g_free);
>> >      g_free(data);
>> >  }
>> >  
>> > -- 
>> > 2.45.0
>> >
>> >
>> > From abfea9698ff46ad0e0175e1dcc6e005e0b2ece2a Mon Sep 17 00:00:00 2001
>> > From: Peter Xu <peterx@redhat.com>
>> > Date: Tue, 10 Sep 2024 12:27:49 -0400
>> > Subject: [PATCH 3/3] migration/multifd: Optimize device_state buffer
>> >  allocations
>> >
>> > Provide a device_state->buf_prealloc so that the buffers can be reused if
>> > possible.  Provide a set of APIs to use it right.  Please see the
>> > documentation for the API in the code.
>> >
>> > The default buffer size came from VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE as of
>> > now.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> >  include/hw/vfio/vfio-common.h    |  9 ++++
>> >  include/migration/misc.h         | 12 ++++-
>> >  migration/multifd.h              | 11 +++-
>> >  hw/vfio/migration.c              | 43 ++++++++-------
>> >  migration/multifd-device-state.c | 93 +++++++++++++++++++++++++-------
>> >  migration/multifd.c              |  9 ----
>> >  6 files changed, 126 insertions(+), 51 deletions(-)
>> >
>> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> > index 4578a0ca6a..c1f2f4ae55 100644
>> > --- a/include/hw/vfio/vfio-common.h
>> > +++ b/include/hw/vfio/vfio-common.h
>> > @@ -61,6 +61,13 @@ typedef struct VFIORegion {
>> >      uint8_t nr; /* cache the region number for debug */
>> >  } VFIORegion;
>> >  
>> > +typedef struct VFIODeviceStatePacket {
>> > +    uint32_t version;
>> > +    uint32_t idx;
>> > +    uint32_t flags;
>> > +    uint8_t data[0];
>> > +} QEMU_PACKED VFIODeviceStatePacket;
>> > +
>> >  typedef struct VFIOMigration {
>> >      struct VFIODevice *vbasedev;
>> >      VMChangeStateEntry *vm_state;
>> > @@ -168,6 +175,8 @@ typedef struct VFIODevice {
>> >      int devid;
>> >      IOMMUFDBackend *iommufd;
>> >      VFIOIOASHwpt *hwpt;
>> > +    /* Only used on sender side when multifd is enabled */
>> > +    VFIODeviceStatePacket *multifd_packet;
>> >      QLIST_ENTRY(VFIODevice) hwpt_next;
>> >  } VFIODevice;
>> >  
>> > diff --git a/include/migration/misc.h b/include/migration/misc.h
>> > index 26f7f3140f..1a8676ed3d 100644
>> > --- a/include/migration/misc.h
>> > +++ b/include/migration/misc.h
>> > @@ -112,8 +112,16 @@ bool migration_in_bg_snapshot(void);
>> >  void dirty_bitmap_mig_init(void);
>> >  
>> >  /* migration/multifd-device-state.c */
>> > -bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> > -                                char *data, size_t len);
>> > +struct MultiFDDeviceState_t;
>> > +typedef struct MultiFDDeviceState_t MultiFDDeviceState_t;
>> > +
>> > +MultiFDDeviceState_t *
>> > +multifd_device_state_prepare(char *idstr, uint32_t instance_id);
>> > +void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
>> > +                                      int64_t *buf_len);
>> > +bool multifd_device_state_finish(MultiFDDeviceState_t *state,
>> > +                                 void *buf, int64_t buf_len);
>> > +
>> >  bool migration_has_device_state_support(void);
>> >  
>> >  void
>> > diff --git a/migration/multifd.h b/migration/multifd.h
>> > index 1eaa5d4496..1ccdeeb8c5 100644
>> > --- a/migration/multifd.h
>> > +++ b/migration/multifd.h
>> > @@ -15,6 +15,7 @@
>> >  
>> >  #include "exec/target_page.h"
>> >  #include "ram.h"
>> > +#include "migration/misc.h"
>> >  
>> >  typedef struct MultiFDRecvData MultiFDRecvData;
>> >  typedef struct MultiFDSendData MultiFDSendData;
>> > @@ -114,16 +115,22 @@ struct MultiFDRecvData {
>> >      off_t file_offset;
>> >  };
>> >  
>> > -typedef struct {
>> > +struct MultiFDDeviceState_t {
>> >      /*
>> >       * Name of the owner device.  NOTE: it's caller's responsibility to
>> >       * make sure the pointer is always valid!
>> >       */
>> >      char *idstr;
>> >      uint32_t instance_id;
>> > +    /*
>> > +     * Points to the buffer to send via multifd.  Normally it's the same as
>> > +     * buf_prealloc, otherwise the caller needs to make sure the buffer is
>> > +     * avaliable through multifd running.
>> 
>> "throughout multifd runtime" maybe.
>> 
>> > +     */
>> >      char *buf;
>> > +    char *buf_prealloc;
>> >      size_t buf_len;
>> > -} MultiFDDeviceState_t;
>> > +};
>> >  
>> >  typedef enum {
>> >      MULTIFD_PAYLOAD_NONE,
>> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> > index 67996aa2df..e36422b7c5 100644
>> > --- a/hw/vfio/migration.c
>> > +++ b/hw/vfio/migration.c
>> > @@ -59,13 +59,6 @@
>> >  
>> >  #define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>> >  
>> > -typedef struct VFIODeviceStatePacket {
>> > -    uint32_t version;
>> > -    uint32_t idx;
>> > -    uint32_t flags;
>> > -    uint8_t data[0];
>> > -} QEMU_PACKED VFIODeviceStatePacket;
>> > -
>> >  static int64_t bytes_transferred;
>> >  
>> >  static const char *mig_state_to_str(enum vfio_device_mig_state state)
>> > @@ -741,6 +734,9 @@ static void vfio_save_cleanup(void *opaque)
>> >      migration->initial_data_sent = false;
>> >      vfio_migration_cleanup(vbasedev);
>> >      trace_vfio_save_cleanup(vbasedev->name);
>> > +    if (vbasedev->multifd_packet) {
>> > +        g_clear_pointer(&vbasedev->multifd_packet, g_free);
>> > +    }
>> >  }
>> >  
>> >  static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>> > @@ -892,7 +888,8 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
>> >      g_autoptr(QIOChannelBuffer) bioc = NULL;
>> >      QEMUFile *f = NULL;
>> >      int ret;
>> > -    g_autofree VFIODeviceStatePacket *packet = NULL;
>> > +    VFIODeviceStatePacket *packet;
>> > +    MultiFDDeviceState_t *state;
>> >      size_t packet_len;
>> >  
>> >      bioc = qio_channel_buffer_new(0);
>> > @@ -911,13 +908,19 @@ static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbas
>> >      }
>> >  
>> >      packet_len = sizeof(*packet) + bioc->usage;
>> > -    packet = g_malloc0(packet_len);
>> > +
>> > +    state = multifd_device_state_prepare(idstr, instance_id);
>> > +    /*
>> > +     * Do not reuse multifd buffer, but use our own due to random size.
>> > +     * The buffer will be freed only when save cleanup.
>> > +     */
>> > +    vbasedev->multifd_packet = g_malloc0(packet_len);
>> > +    packet = vbasedev->multifd_packet;
>> >      packet->idx = idx;
>> >      packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>> >      memcpy(&packet->data, bioc->data, bioc->usage);
>> >  
>> > -    if (!multifd_queue_device_state(idstr, instance_id,
>> > -                                    (char *)packet, packet_len)) {
>> > +    if (!multifd_device_state_finish(state, packet, packet_len)) {
>> >          ret = -1;
>> >      }
>> >  
>> > @@ -936,7 +939,6 @@ static int vfio_save_complete_precopy_thread(char *idstr,
>> >      VFIODevice *vbasedev = opaque;
>> >      VFIOMigration *migration = vbasedev->migration;
>> >      int ret;
>> > -    g_autofree VFIODeviceStatePacket *packet = NULL;
>> >      uint32_t idx;
>> >  
>> >      if (!migration->multifd_transfer) {
>> > @@ -954,21 +956,25 @@ static int vfio_save_complete_precopy_thread(char *idstr,
>> >          goto ret_finish;
>> >      }
>> >  
>> > -    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>> > -
>> >      for (idx = 0; ; idx++) {
>> > +        VFIODeviceStatePacket *packet;
>> > +        MultiFDDeviceState_t *state;
>> >          ssize_t data_size;
>> >          size_t packet_size;
>> > +        int64_t buf_size;
>> >  
>> >          if (qatomic_read(abort_flag)) {
>> >              ret = -ECANCELED;
>> >              goto ret_finish;
>> >          }
>> >  
>> > +        state = multifd_device_state_prepare(idstr, instance_id);
>> > +        packet = multifd_device_state_get_buffer(state, &buf_size);
>> >          data_size = read(migration->data_fd, &packet->data,
>> > -                         migration->data_buffer_size);
>> > +                         buf_size - sizeof(*packet));
>> >          if (data_size < 0) {
>> >              if (errno != ENOMSG) {
>> > +                multifd_device_state_finish(state, NULL, 0);
>> >                  ret = -errno;
>> >                  goto ret_finish;
>> >              }
>> > @@ -980,14 +986,15 @@ static int vfio_save_complete_precopy_thread(char *idstr,
>> >              data_size = 0;
>> >          }
>> >  
>> > -        if (data_size == 0)
>> > +        if (data_size == 0) {
>> > +            multifd_device_state_finish(state, NULL, 0);
>> >              break;
>> > +        }
>> >  
>> >          packet->idx = idx;
>> >          packet_size = sizeof(*packet) + data_size;
>> >  
>> > -        if (!multifd_queue_device_state(idstr, instance_id,
>> > -                                        (char *)packet, packet_size)) {
>> > +        if (!multifd_device_state_finish(state, packet, packet_size)) {
>> >              ret = -1;
>> >              goto ret_finish;
>> >          }
>> > diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> > index cfd0465bac..6f0259426d 100644
>> > --- a/migration/multifd-device-state.c
>> > +++ b/migration/multifd-device-state.c
>> > @@ -20,15 +20,18 @@ ThreadPool *send_threads;
>> >  int send_threads_ret;
>> >  bool send_threads_abort;
>> >  
>> > +#define  MULTIFD_DEVICE_STATE_BUFLEN  (1UL << 20)
>> > +
>> >  static MultiFDSendData *device_state_send;
>> >  
>> > -/* TODO: use static buffers for idstr and buf */
>> >  void multifd_device_state_payload_alloc(MultiFDDeviceState_t *device_state)
>> >  {
>> > +    device_state->buf_prealloc = g_malloc0(MULTIFD_DEVICE_STATE_BUFLEN);
>> >  }
>> >  
>> >  void multifd_device_state_payload_free(MultiFDDeviceState_t *device_state)
>> >  {
>> > +    g_clear_pointer(&device_state->buf_prealloc, g_free);
>> >  }
>> >  
>> >  void multifd_device_state_save_setup(void)
>> > @@ -42,12 +45,6 @@ void multifd_device_state_save_setup(void)
>> >      send_threads_abort = false;
>> >  }
>> >  
>> > -void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>> > -{
>> > -    device_state->idstr = NULL;
>> > -    g_clear_pointer(&device_state->buf, g_free);
>> > -}
>> > -
>> >  void multifd_device_state_save_cleanup(void)
>> >  {
>> >      g_clear_pointer(&send_threads, thread_pool_free);
>> > @@ -89,33 +86,89 @@ void multifd_device_state_send_prepare(MultiFDSendParams *p)
>> >      multifd_device_state_fill_packet(p);
>> >  }
>> >  
>> > -bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> > -                                char *data, size_t len)
>> > +/*
>> > + * Prepare to send some device state via multifd.  Returns the current idle
>> > + * MultiFDDeviceState_t*.
>> > + *
>> > + * As a follow up, the caller must call multifd_device_state_finish() to
>> > + * release the resources.
>> > + *
>> > + * One example usage of the API:
>> > + *
>> > + *   // Fetch a free multifd device state object
>> > + *   state = multifd_device_state_prepare(idstr, instance_id);
>> > + *
>> > + *   // Optional: fetch the buffer to reuse
>> > + *   buf = multifd_device_state_get_buffer(state, &buf_size);
>> > + *
>> > + *   // Here len>0 means success, otherwise failure
>> > + *   len = buffer_fill(buf, buf_size);
>> > + *
>> > + *   // Finish the transaction, either enqueue or cancel the request.  Here
>> > + *   // len>0 will enqueue, <=0 will cancel.
>> > + *   multifd_device_state_finish(state, buf, len);
>> > + */
>> > +MultiFDDeviceState_t *
>> > +multifd_device_state_prepare(char *idstr, uint32_t instance_id)
>> >  {
>> > -    /* Device state submissions can come from multiple threads */
>> > -    QEMU_LOCK_GUARD(&queue_job_mutex);
>> >      MultiFDDeviceState_t *device_state;
>> >  
>> >      assert(multifd_payload_empty(device_state_send));
>> >  
>> > -    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
>> > +    /*
>> > +     * TODO: The lock name may need change, but I'm reusing just for
>> > +     * simplicity.
>> > +     */
>> > +    qemu_mutex_lock(&queue_job_mutex);
>> > +
>> >      device_state = &device_state_send->u.device_state;
>> >      /*
>> > -     * NOTE: here we must use a static idstr (e.g. of a savevm state
>> > -     * entry) rather than any dynamically allocated buffer, because multifd
>> > +     * NOTE: here we must use a static idstr (e.g. of a savevm state entry)
>> > +     * rather than any dynamically allocated buffer, because multifd
>> >       * assumes this pointer is always valid!
>> >       */
>> >      device_state->idstr = idstr;
>> >      device_state->instance_id = instance_id;
>> > -    device_state->buf = g_memdup2(data, len);
>> > -    device_state->buf_len = len;
>> >  
>> > -    if (!multifd_send(&device_state_send)) {
>> > -        multifd_send_data_clear(device_state_send);
>> > -        return false;
>> > +    return &device_state_send->u.device_state;
>> > +}
>> > +
>> > +/*
>> > + * Need to be used after a previous call to multifd_device_state_prepare(),
>> > + * the buffer must not be used after invoke multifd_device_state_finish().
>> > + */
>> > +void *multifd_device_state_get_buffer(MultiFDDeviceState_t *s,
>> > +                                      int64_t *buf_len)
>> > +{
>> > +    *buf_len = MULTIFD_DEVICE_STATE_BUFLEN;
>> > +    return s->buf_prealloc;
>> > +}
>> > +
>> > +/*
>> > + * Need to be used only in pair with a previous call to
>> > + * multifd_device_state_prepare().  Returns true if enqueue successful,
>> > + * false otherwise.
>> > + */
>> > +bool multifd_device_state_finish(MultiFDDeviceState_t *state,
>> > +                                 void *buf, int64_t buf_len)
>> > +{
>> > +    bool result = false;
>> > +
>> > +    /* Currently we only have one global free buffer */
>> > +    assert(state == &device_state_send->u.device_state);
>> > +
>> > +    if (buf_len < 0) {
>> > +        goto out;
>> >      }
>> >  
>> > -    return true;
>> > +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
>> > +    /* This normally will be the state->buf_prealloc, but not required */
>> > +    state->buf = buf;
>> > +    state->buf_len = buf_len;
>> > +    result = multifd_send(&device_state_send);
>> > +out:
>> > +    qemu_mutex_unlock(&queue_job_mutex);
>> > +    return result;
>> >  }
>> >  
>> >  bool migration_has_device_state_support(void)
>> > diff --git a/migration/multifd.c b/migration/multifd.c
>> > index 5a20b831cf..2b5185e298 100644
>> > --- a/migration/multifd.c
>> > +++ b/migration/multifd.c
>> > @@ -115,15 +115,6 @@ void multifd_send_data_clear(MultiFDSendData *data)
>> >          return;
>> >      }
>> >  
>> > -    switch (data->type) {
>> > -    case MULTIFD_PAYLOAD_DEVICE_STATE:
>> > -        multifd_device_state_clear(&data->u.device_state);
>> > -        break;
>> > -    default:
>> > -        /* Nothing to do */
>> > -        break;
>> > -    }
>> > -
>> >      data->type = MULTIFD_PAYLOAD_NONE;
>> >  }
>> >  
>> > -- 
>> > 2.45.0
>> 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-13 13:21           ` Fabiano Rosas
@ 2024-09-13 14:19             ` Peter Xu
  2024-09-13 15:04               ` Fabiano Rosas
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-13 14:19 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: mail, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Sep 13, 2024 at 10:21:39AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Sep 12, 2024 at 03:43:39PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> Hi Peter, sorry if I'm not very enthusiastic by this, I'm sure you
> >> understand the rework is a little frustrating.
> >
> > That's OK.
> >
> > [For some reason my email sync program decided to give up working for
> >  hours.  I got more time looking at a tsc bug, which is good, but then I
> >  miss a lot of emails..]
> >
> >> 
> >> > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
> >> >> > +size_t multifd_device_state_payload_size(void)
> >> >> > +{
> >> >> > +    return sizeof(MultiFDDeviceState_t);
> >> >> > +}
> >> >> 
> >> >> This will not be necessary because the payload size is the same as the
> >> >> data type. We only need it for the special case where the MultiFDPages_t
> >> >> is smaller than the total ram payload size.
> >> >
> >> > Today I was thinking maybe we should really clean this up, as the current
> >> > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
> >> > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
> >> > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
> >> > that feeling stronger.
> >> 
> >> If we're going to commit bad code and then rewrite it a week later, we
> >> could have just let the original series from Maciej merge without any of
> >> this.
> >
> > Why it's "bad code"?
> >
> > It runs pretty well, I don't think it's bad code.  You wrote it, and I
> > don't think it's bad at all.
> 
> Code that forces us to do arithmetic in order to properly allocate
> memory and comes with a big comment explaining how we're dodging
> compiler warnings is bad code in my book.
> 
> >
> > But now we're rethinking after reading Maciej's new series.
> >Personally I don't think it's a major problem.
> >
> > Note that we're not changing the design back: what was initially proposed
> > was the client submitting an array of multifd objects.  I still don't think
> > that's right.
> >
> > What the change goes so far is we make the union a struct, however that's
> > still N+2 objects not 2*N, where 2 came from RAM+VFIO.  I think the
> > important bits are still there (from your previous refactor series).
> >
> 
> You fail to appreciate that before the RFC series, multifd already
> allocated N for the pages.

It depends on how you see it, IMHO.  I think it allocates N not for the
"pages" but for the "threads", because the threads can be busy with those
buffers, no matter if it's "page" or "device data".

> The device state adds another client, so that
> would be another N anyway. The problem the RFC tried to solve was that
> multifd channels owned that 2N, so the array was added to move the
> memory into the client's ownership. IOW, it wasn't even the RFC series
> that made it 2N, that was the multifd design all along. Now in hindsight
> I don't think we should have went with the memory saving discussion.

I think I could have made that feeling that I only wanted to save memory,
if so, I'm sorry.  But do you still remember I mentioned "we can make it a
struct, too" before your series landed?  Then you think it's ok to keep
using union, and I'm ok too! At least at that time.  I don't think that's a
huge deal.  I don't think each route we go must be perfect, but we try the
best to make it as good.

I don't think any discussion must not happen.  I agree memory consumption
is not the 1st thing to worry, but I don't see why it can't be discussed.

Note that I never said we can't save those memory either - I plan to have
follow up patches (for this, after Maciej's series land.. that's why I even
didn't yet mention) to allow modules report device state buffer size.  I
just didn't say, yet, and don't plan to worry before vfio series lands.
When with that, we'll save 1M*N when no vfio device attached (but we'll
need to handle hotplug).  So I think we don't need to lose any finally.

However I think we need to get straight with the base design on how vfio
should use multifd, because it's important bit and I don't want to rework
important bits after a huge feature, if we know a better directions.

I don't even think what I proposed patch 1-3 here is a must to me, I should
be clear again here just in case we have similar discussions
afterwards.. that I'm ok with below be done after Maciej's:

  - Avoid memory allocations per-packet (done in patch 2-3)
  - Avoid unnecessary data copy (done in patch 2-3)
  - Avoid allocate device buffers when no device will use (not proposed)

But I'm not ok building everything on top of the idea of not using multifd
buffers in the current way, because that can involve a lot of changes:
that's where buffer passes from up to down or backwards, and the interface
needs change a lot too.  We already have that in master so it's not a
problem now.

> 
> >> I already suggested it a couple of times, we shouldn't be doing
> >> core refactorings underneath contributors' patches, this is too
> >> fragile. Just let people contribute their code and we can change it
> >> later.
> >
> > I sincerely don't think a lot needs changing... only patch 1.  Basically
> > patch 1 on top of your previous rework series will be at least what I want,
> > but I'm open to comments from you guys.
> 
> Don't get me wrong, I'm very much in favor of what you're doing
> here. However, I don't think it's ok to be backtracking on our design
> while other people have series in flight that depend on it. You
> certainly know the feeling of trying to merge a feature and having
> maintainers ask you to rewrite a bunch code just to be able to start
> working. That's not ideal.

I as a patch writer always like to do that when it's essential.  Normally
the case is I don't have enough reviewer resources to help me get a better
design, or discuss about it.

When vfio is the new user of multifd vfio needs to do the heavy lifting to
draft the api.

> 
> I tried to quickly insert the RFC series before the device state series
> progressed too much, but it's 3 months later and we're still discussing
> it, maybe we don't need to do it this way.

Can I get that of your feeling from when you were working on mapped-ram?
That series does take long enough, I agree.  Not so bad yet with the VFIO
series - it's good to have you around because you provide great reviews.
I'm also trying the best to not let a series dangle for more than a year.
I don't think 3 months is long with this feature: this is the 1st multifd
extrenal user (and file mapping is also in another angle), it can take some
time.

Sorry if it's so, but sorry again I don't think I get convinced: I think we
need to go this way to build blocks one by one, and we need to make sure
lower blocks are hopefully solid enough to take the upper ones.  Again I'm
ok with small things that go against it, but not major designs.  We
shouldn't go rewrite major designs if we seem to know a better one.

> 
> And ok, let's consider the current situation a special case. But I would
> like to avoid in the future this kind of uncertainty. 
> 
> >
> > Note that patch 2-3 will be on top of Maciej's changes and they're totally
> > not relevant to what we merged so far.  Hence, nothing relevant there to
> > what you worked.  And this is the diff of patch 1:
> >
> >  migration/multifd.h              | 16 +++++++++++-----
> >  migration/multifd-device-state.c |  8 ++++++--
> >  migration/multifd-nocomp.c       | 13 ++++++-------
> >  migration/multifd.c              | 25 ++++++-------------------
> >  4 files changed, 29 insertions(+), 33 deletions(-)
> >
> > It's only 33 lines removed (many of which are comments..), it's not a huge
> > lot.  I don't know why you feel so bad at this...
> >
> > It's probably because we maintain migration together, or we can keep our
> > own way of work.  I don't think we did anything wrong yet so far.
> >
> > We can definitely talk about this in next 1:1.
> >
> >> 
> >> This is also why I've been trying hard to separate core multifd
> >> functionality from migration code that uses multifd to transmit their
> >> data.
> >> 
> >> My original RFC plus the suggestion to extend multifd_ops for device
> >> state would have (almost) made it so that no client code would be left
> >> in multifd. We could have been turning this thing upside down and it
> >> wouldn't affect anyone in terms of code conflicts.
> >
> > Do you mean you preferred the 2*N approach?
> >
> 
> 2*N, where N is usually not larger than 32 and the payload size is
> 1k. Yes, I'd trade that off no problem.

I think it's a problem.

Vdpa when involved with exactly the same pattern of how vfio uses it (as
they're really alike underneath) then vdpa will need its own array of
buffers, or it'll need to take the same vfio lock which doesn't make sense
to me.

N+2, or, N+M (M is the user) is the minimum buffers we need.  N because
multifd can be worst case 100% busy on all threads occupying the buffers.
M because M users can be worst case 100% pre-filling.  It's either about
memory consumption, or about logical sensibility.

> 
> >> 
> >> The ship has already sailed, so your patches below are fine, I have just
> >> some small comments.
> >
> > I'm not sure what you meant about "ship sailed", but we should merge code
> > whenever we think is the most correct.
> 
> As you put above, I agree that the important bits of the original series
> have been preserved, but other secondary goals were lost, such as the
> more abstract separation between multifd & client code and that is the
> ship that has sailed.
> 
> That series was not: "introduce this array for no reason", we also lost
> the ability to abstract the payload from the multifd threads when we
> dropped the .alloc_fn callback for instance. The last patch you posted

I don't remember the details there, but my memory was that it was too
flexible while we seem to reach the consensus that we only process either
RAM or device, nothing else.

> here now adds multifd_device_state_prepare, somewhat ignoring that the
> ram code also has the same pattern and it could be made to use the same
> API.

I need some further elaborations to understand.

multifd_device_state_prepare currently does a few things: taking ownership
of the temp device state object, fill in idstr / instance_id, taking the
lock (so far is needed because we only have one device state object).  None
of them seems to be needed for RAM yet.

Feel free to send a rfc patch if that helps.

> 
> I did accept your premise that ram+compression is one thing while
> device_state is another, so I'm not asking it to be changed, just
> pointing out that the RFC series also addressed those issues. I might
> not have made that clear back then.
> 
> > I hope you meant after below all things look the best, or please shoot.
> > That's exactly what I'm requesting for as comments.
> 
> What you have here is certainly an improvement from the current
> state. I'm just ranting about the path we took here.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-13 14:19             ` Peter Xu
@ 2024-09-13 15:04               ` Fabiano Rosas
  2024-09-13 15:22                 ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-13 15:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: mail, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Fri, Sep 13, 2024 at 10:21:39AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Thu, Sep 12, 2024 at 03:43:39PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> Hi Peter, sorry if I'm not very enthusiastic by this, I'm sure you
>> >> understand the rework is a little frustrating.
>> >
>> > That's OK.
>> >
>> > [For some reason my email sync program decided to give up working for
>> >  hours.  I got more time looking at a tsc bug, which is good, but then I
>> >  miss a lot of emails..]
>> >
>> >> 
>> >> > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
>> >> >> > +size_t multifd_device_state_payload_size(void)
>> >> >> > +{
>> >> >> > +    return sizeof(MultiFDDeviceState_t);
>> >> >> > +}
>> >> >> 
>> >> >> This will not be necessary because the payload size is the same as the
>> >> >> data type. We only need it for the special case where the MultiFDPages_t
>> >> >> is smaller than the total ram payload size.
>> >> >
>> >> > Today I was thinking maybe we should really clean this up, as the current
>> >> > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
>> >> > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
>> >> > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
>> >> > that feeling stronger.
>> >> 
>> >> If we're going to commit bad code and then rewrite it a week later, we
>> >> could have just let the original series from Maciej merge without any of
>> >> this.
>> >
>> > Why it's "bad code"?
>> >
>> > It runs pretty well, I don't think it's bad code.  You wrote it, and I
>> > don't think it's bad at all.
>> 
>> Code that forces us to do arithmetic in order to properly allocate
>> memory and comes with a big comment explaining how we're dodging
>> compiler warnings is bad code in my book.
>> 
>> >
>> > But now we're rethinking after reading Maciej's new series.
>> >Personally I don't think it's a major problem.
>> >
>> > Note that we're not changing the design back: what was initially proposed
>> > was the client submitting an array of multifd objects.  I still don't think
>> > that's right.
>> >
>> > What the change goes so far is we make the union a struct, however that's
>> > still N+2 objects not 2*N, where 2 came from RAM+VFIO.  I think the
>> > important bits are still there (from your previous refactor series).
>> >
>> 
>> You fail to appreciate that before the RFC series, multifd already
>> allocated N for the pages.
>
> It depends on how you see it, IMHO.  I think it allocates N not for the
> "pages" but for the "threads", because the threads can be busy with those
> buffers, no matter if it's "page" or "device data".

Each MultiFD*Params had a p->pages, so N channels, N p->pages. The
device state series would add p->device_state, one per channel. So 2N +
2 (for the extra slot).

>
>> The device state adds another client, so that
>> would be another N anyway. The problem the RFC tried to solve was that
>> multifd channels owned that 2N, so the array was added to move the
>> memory into the client's ownership. IOW, it wasn't even the RFC series
>> that made it 2N, that was the multifd design all along. Now in hindsight
>> I don't think we should have went with the memory saving discussion.
>
> I think I could have made that feeling that I only wanted to save memory,
> if so, I'm sorry.  But do you still remember I mentioned "we can make it a
> struct, too" before your series landed?  Then you think it's ok to keep
> using union, and I'm ok too! At least at that time.  I don't think that's a
> huge deal.  I don't think each route we go must be perfect, but we try the
> best to make it as good.

Yep, I did agree with all of this. I'm just saying I now think I
shouldn't have.

>
> I don't think any discussion must not happen.  I agree memory consumption
> is not the 1st thing to worry, but I don't see why it can't be discussed.

It can be discussed, sure, but then 3 months pass and we're still
talking about it. Saved ~64k and spent 3 months. We could have just as
well said: "let's do a pass to optimize memory consumption after the
device state series is in".

>
> Note that I never said we can't save those memory either - I plan to have
> follow up patches (for this, after Maciej's series land.. that's why I even
> didn't yet mention) to allow modules report device state buffer size.  I
> just didn't say, yet, and don't plan to worry before vfio series lands.
> When with that, we'll save 1M*N when no vfio device attached (but we'll
> need to handle hotplug).  So I think we don't need to lose any finally.
>
> However I think we need to get straight with the base design on how vfio
> should use multifd, because it's important bit and I don't want to rework
> important bits after a huge feature, if we know a better directions.
>
> I don't even think what I proposed patch 1-3 here is a must to me, I should
> be clear again here just in case we have similar discussions
> afterwards.. that I'm ok with below be done after Maciej's:
>
>   - Avoid memory allocations per-packet (done in patch 2-3)
>   - Avoid unnecessary data copy (done in patch 2-3)
>   - Avoid allocate device buffers when no device will use (not proposed)
>
> But I'm not ok building everything on top of the idea of not using multifd
> buffers in the current way, because that can involve a lot of changes:
> that's where buffer passes from up to down or backwards, and the interface
> needs change a lot too.  We already have that in master so it's not a
> problem now.
>
>> 
>> >> I already suggested it a couple of times, we shouldn't be doing
>> >> core refactorings underneath contributors' patches, this is too
>> >> fragile. Just let people contribute their code and we can change it
>> >> later.
>> >
>> > I sincerely don't think a lot needs changing... only patch 1.  Basically
>> > patch 1 on top of your previous rework series will be at least what I want,
>> > but I'm open to comments from you guys.
>> 
>> Don't get me wrong, I'm very much in favor of what you're doing
>> here. However, I don't think it's ok to be backtracking on our design
>> while other people have series in flight that depend on it. You
>> certainly know the feeling of trying to merge a feature and having
>> maintainers ask you to rewrite a bunch code just to be able to start
>> working. That's not ideal.
>
> I as a patch writer always like to do that when it's essential.  Normally
> the case is I don't have enough reviewer resources to help me get a better
> design, or discuss about it.

Right, but we can't keep providing a moving target. See the thread pool
discussion for an example. It's hard to work that way. The discussion
here is similar, we introduced the union, now we're moving to the
struct. And you're right that the changes here are small, so let's not
get caught in that.

>
> When vfio is the new user of multifd vfio needs to do the heavy lifting to
> draft the api.

Well, multifd could have provided a flexible API to being with. That's
entirely on us. I have been toying with allowing more clients since at
least 1 year ago. We just couldn't get there in time.

>
>> 
>> I tried to quickly insert the RFC series before the device state series
>> progressed too much, but it's 3 months later and we're still discussing
>> it, maybe we don't need to do it this way.
>
> Can I get that of your feeling from when you were working on
> mapped-ram?

At that time I had already committed to helping maintain the code, so
the time spent there already went into the maintainer bucket anyway. If
I were instead just trying to drive-by, then that would have been a
pain.

> That series does take long enough, I agree.  Not so bad yet with the VFIO
> series - it's good to have you around because you provide great reviews.
> I'm also trying the best to not let a series dangle for more than a year.
> I don't think 3 months is long with this feature: this is the 1st multifd
> extrenal user (and file mapping is also in another angle), it can take some
> time.

Oh, I don't mean the VFIO series is taking long. That's a complex
feature indeed. I just mean going from p->pages to p->data could have
taken less time. I'm suggesting we might have overdone there a bit.

>
> Sorry if it's so, but sorry again I don't think I get convinced: I think we
> need to go this way to build blocks one by one, and we need to make sure
> lower blocks are hopefully solid enough to take the upper ones.  Again I'm
> ok with small things that go against it, but not major designs.  We
> shouldn't go rewrite major designs if we seem to know a better one.
>
>> 
>> And ok, let's consider the current situation a special case. But I would
>> like to avoid in the future this kind of uncertainty. 
>> 
>> >
>> > Note that patch 2-3 will be on top of Maciej's changes and they're totally
>> > not relevant to what we merged so far.  Hence, nothing relevant there to
>> > what you worked.  And this is the diff of patch 1:
>> >
>> >  migration/multifd.h              | 16 +++++++++++-----
>> >  migration/multifd-device-state.c |  8 ++++++--
>> >  migration/multifd-nocomp.c       | 13 ++++++-------
>> >  migration/multifd.c              | 25 ++++++-------------------
>> >  4 files changed, 29 insertions(+), 33 deletions(-)
>> >
>> > It's only 33 lines removed (many of which are comments..), it's not a huge
>> > lot.  I don't know why you feel so bad at this...
>> >
>> > It's probably because we maintain migration together, or we can keep our
>> > own way of work.  I don't think we did anything wrong yet so far.
>> >
>> > We can definitely talk about this in next 1:1.
>> >
>> >> 
>> >> This is also why I've been trying hard to separate core multifd
>> >> functionality from migration code that uses multifd to transmit their
>> >> data.
>> >> 
>> >> My original RFC plus the suggestion to extend multifd_ops for device
>> >> state would have (almost) made it so that no client code would be left
>> >> in multifd. We could have been turning this thing upside down and it
>> >> wouldn't affect anyone in terms of code conflicts.
>> >
>> > Do you mean you preferred the 2*N approach?
>> >
>> 
>> 2*N, where N is usually not larger than 32 and the payload size is
>> 1k. Yes, I'd trade that off no problem.
>
> I think it's a problem.
>
> Vdpa when involved with exactly the same pattern of how vfio uses it (as
> they're really alike underneath) then vdpa will need its own array of
> buffers, or it'll need to take the same vfio lock which doesn't make sense
> to me.
>
> N+2, or, N+M (M is the user) is the minimum buffers we need.  N because
> multifd can be worst case 100% busy on all threads occupying the buffers.
> M because M users can be worst case 100% pre-filling.  It's either about
> memory consumption, or about logical sensibility.

I'm aware of the memory consumption. Still, we're not forced to use the
minimum amount of space we can. If using more memory can lead to a
better design in the medium term, we're allowed to make that choice.

Hey, I'm not even saying we *should* have gone with 2N. I think it's
good that we're now N+M. But I think we also lost some design
flexibility due to that.

>
>> 
>> >> 
>> >> The ship has already sailed, so your patches below are fine, I have just
>> >> some small comments.
>> >
>> > I'm not sure what you meant about "ship sailed", but we should merge code
>> > whenever we think is the most correct.
>> 
>> As you put above, I agree that the important bits of the original series
>> have been preserved, but other secondary goals were lost, such as the
>> more abstract separation between multifd & client code and that is the
>> ship that has sailed.
>> 
>> That series was not: "introduce this array for no reason", we also lost
>> the ability to abstract the payload from the multifd threads when we
>> dropped the .alloc_fn callback for instance. The last patch you posted
>
> I don't remember the details there, but my memory was that it was too
> flexible while we seem to reach the consensus that we only process either
> RAM or device, nothing else.

Indeed. I'm being unfair here, sorry.

>
>> here now adds multifd_device_state_prepare, somewhat ignoring that the
>> ram code also has the same pattern and it could be made to use the same
>> API.
>
> I need some further elaborations to understand.
>
> multifd_device_state_prepare currently does a few things: taking ownership
> of the temp device state object, fill in idstr / instance_id, taking the
> lock (so far is needed because we only have one device state object).  None
> of them seems to be needed for RAM yet.
>
> Feel free to send a rfc patch if that helps.

What if I don't send a patch, wait for it to get merged and then send a
refactoring on top so we don't add yet another detour to this
conversation? =)

>
>> 
>> I did accept your premise that ram+compression is one thing while
>> device_state is another, so I'm not asking it to be changed, just
>> pointing out that the RFC series also addressed those issues. I might
>> not have made that clear back then.
>> 
>> > I hope you meant after below all things look the best, or please shoot.
>> > That's exactly what I'm requesting for as comments.
>> 
>> What you have here is certainly an improvement from the current
>> state. I'm just ranting about the path we took here.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-13 15:04               ` Fabiano Rosas
@ 2024-09-13 15:22                 ` Peter Xu
  2024-09-13 18:26                   ` Fabiano Rosas
  2024-09-17 17:07                   ` Cédric Le Goater
  0 siblings, 2 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-13 15:22 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: mail, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Sep 13, 2024 at 12:04:00PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Sep 13, 2024 at 10:21:39AM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > On Thu, Sep 12, 2024 at 03:43:39PM -0300, Fabiano Rosas wrote:
> >> >> Peter Xu <peterx@redhat.com> writes:
> >> >> 
> >> >> Hi Peter, sorry if I'm not very enthusiastic by this, I'm sure you
> >> >> understand the rework is a little frustrating.
> >> >
> >> > That's OK.
> >> >
> >> > [For some reason my email sync program decided to give up working for
> >> >  hours.  I got more time looking at a tsc bug, which is good, but then I
> >> >  miss a lot of emails..]
> >> >
> >> >> 
> >> >> > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
> >> >> >> > +size_t multifd_device_state_payload_size(void)
> >> >> >> > +{
> >> >> >> > +    return sizeof(MultiFDDeviceState_t);
> >> >> >> > +}
> >> >> >> 
> >> >> >> This will not be necessary because the payload size is the same as the
> >> >> >> data type. We only need it for the special case where the MultiFDPages_t
> >> >> >> is smaller than the total ram payload size.
> >> >> >
> >> >> > Today I was thinking maybe we should really clean this up, as the current
> >> >> > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
> >> >> > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
> >> >> > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
> >> >> > that feeling stronger.
> >> >> 
> >> >> If we're going to commit bad code and then rewrite it a week later, we
> >> >> could have just let the original series from Maciej merge without any of
> >> >> this.
> >> >
> >> > Why it's "bad code"?
> >> >
> >> > It runs pretty well, I don't think it's bad code.  You wrote it, and I
> >> > don't think it's bad at all.
> >> 
> >> Code that forces us to do arithmetic in order to properly allocate
> >> memory and comes with a big comment explaining how we're dodging
> >> compiler warnings is bad code in my book.
> >> 
> >> >
> >> > But now we're rethinking after reading Maciej's new series.
> >> >Personally I don't think it's a major problem.
> >> >
> >> > Note that we're not changing the design back: what was initially proposed
> >> > was the client submitting an array of multifd objects.  I still don't think
> >> > that's right.
> >> >
> >> > What the change goes so far is we make the union a struct, however that's
> >> > still N+2 objects not 2*N, where 2 came from RAM+VFIO.  I think the
> >> > important bits are still there (from your previous refactor series).
> >> >
> >> 
> >> You fail to appreciate that before the RFC series, multifd already
> >> allocated N for the pages.
> >
> > It depends on how you see it, IMHO.  I think it allocates N not for the
> > "pages" but for the "threads", because the threads can be busy with those
> > buffers, no matter if it's "page" or "device data".
> 
> Each MultiFD*Params had a p->pages, so N channels, N p->pages. The
> device state series would add p->device_state, one per channel. So 2N +
> 2 (for the extra slot).

Then it makes sense to have SendData covering pages+device_state.  I think
it's what we have now, but maybe I missed the point.

> 
> >
> >> The device state adds another client, so that
> >> would be another N anyway. The problem the RFC tried to solve was that
> >> multifd channels owned that 2N, so the array was added to move the
> >> memory into the client's ownership. IOW, it wasn't even the RFC series
> >> that made it 2N, that was the multifd design all along. Now in hindsight
> >> I don't think we should have went with the memory saving discussion.
> >
> > I think I could have made that feeling that I only wanted to save memory,
> > if so, I'm sorry.  But do you still remember I mentioned "we can make it a
> > struct, too" before your series landed?  Then you think it's ok to keep
> > using union, and I'm ok too! At least at that time.  I don't think that's a
> > huge deal.  I don't think each route we go must be perfect, but we try the
> > best to make it as good.
> 
> Yep, I did agree with all of this. I'm just saying I now think I
> shouldn't have.
> 
> >
> > I don't think any discussion must not happen.  I agree memory consumption
> > is not the 1st thing to worry, but I don't see why it can't be discussed.
> 
> It can be discussed, sure, but then 3 months pass and we're still
> talking about it. Saved ~64k and spent 3 months. We could have just as
> well said: "let's do a pass to optimize memory consumption after the
> device state series is in".

We didn't discuss 3 months on discussing memory consumption only!  It's
unfair to think it like that.

> 
> >
> > Note that I never said we can't save those memory either - I plan to have
> > follow up patches (for this, after Maciej's series land.. that's why I even
> > didn't yet mention) to allow modules report device state buffer size.  I
> > just didn't say, yet, and don't plan to worry before vfio series lands.
> > When with that, we'll save 1M*N when no vfio device attached (but we'll
> > need to handle hotplug).  So I think we don't need to lose any finally.
> >
> > However I think we need to get straight with the base design on how vfio
> > should use multifd, because it's important bit and I don't want to rework
> > important bits after a huge feature, if we know a better directions.
> >
> > I don't even think what I proposed patch 1-3 here is a must to me, I should
> > be clear again here just in case we have similar discussions
> > afterwards.. that I'm ok with below be done after Maciej's:
> >
> >   - Avoid memory allocations per-packet (done in patch 2-3)
> >   - Avoid unnecessary data copy (done in patch 2-3)
> >   - Avoid allocate device buffers when no device will use (not proposed)
> >
> > But I'm not ok building everything on top of the idea of not using multifd
> > buffers in the current way, because that can involve a lot of changes:
> > that's where buffer passes from up to down or backwards, and the interface
> > needs change a lot too.  We already have that in master so it's not a
> > problem now.
> >
> >> 
> >> >> I already suggested it a couple of times, we shouldn't be doing
> >> >> core refactorings underneath contributors' patches, this is too
> >> >> fragile. Just let people contribute their code and we can change it
> >> >> later.
> >> >
> >> > I sincerely don't think a lot needs changing... only patch 1.  Basically
> >> > patch 1 on top of your previous rework series will be at least what I want,
> >> > but I'm open to comments from you guys.
> >> 
> >> Don't get me wrong, I'm very much in favor of what you're doing
> >> here. However, I don't think it's ok to be backtracking on our design
> >> while other people have series in flight that depend on it. You
> >> certainly know the feeling of trying to merge a feature and having
> >> maintainers ask you to rewrite a bunch code just to be able to start
> >> working. That's not ideal.
> >
> > I as a patch writer always like to do that when it's essential.  Normally
> > the case is I don't have enough reviewer resources to help me get a better
> > design, or discuss about it.
> 
> Right, but we can't keep providing a moving target. See the thread pool
> discussion for an example. It's hard to work that way. The discussion
> here is similar, we introduced the union, now we're moving to the
> struct. And you're right that the changes here are small, so let's not
> get caught in that.

What's your suggestion on the thread pool?  Should we merge the change
where vfio creates the threads on its own (assuming vfio maintainers are ok
with it)?

I would say no, that's what I suggested.  I'd start with reusing
ThreadPool, then we found issue when Stefan reported worry on abusing the
API.  All these discussions seem sensible to me so far.  We can't rush on
these until we figure things out step by step.  I don't see a way.

I saw Cedric suggesting to not even create a thread on recv side.  I am not
sure whether that's easy, but I'd agree with Cedric if possible.  I think
Maciej could have a point where it can block mutlifd threads, aka, IO
threads, which might be unwanted.

However said that, I still think device (even if threads needed) should not
randomly create threads during migration.  It'll be a nightmare.

> 
> >
> > When vfio is the new user of multifd vfio needs to do the heavy lifting to
> > draft the api.
> 
> Well, multifd could have provided a flexible API to being with. That's
> entirely on us. I have been toying with allowing more clients since at
> least 1 year ago. We just couldn't get there in time.
> 
> >
> >> 
> >> I tried to quickly insert the RFC series before the device state series
> >> progressed too much, but it's 3 months later and we're still discussing
> >> it, maybe we don't need to do it this way.
> >
> > Can I get that of your feeling from when you were working on
> > mapped-ram?
> 
> At that time I had already committed to helping maintain the code, so
> the time spent there already went into the maintainer bucket anyway. If
> I were instead just trying to drive-by, then that would have been a
> pain.

I don't think you became a maintainer changed how I would review mapped-ram
series.

OTOH, "I became a maintainer" could, because I know I am more responsible
to a chunk of code until I leave (and please let me know any time when you
think you're ready to take migration on your own).  That's a real
difference to me.

> 
> > That series does take long enough, I agree.  Not so bad yet with the VFIO
> > series - it's good to have you around because you provide great reviews.
> > I'm also trying the best to not let a series dangle for more than a year.
> > I don't think 3 months is long with this feature: this is the 1st multifd
> > extrenal user (and file mapping is also in another angle), it can take some
> > time.
> 
> Oh, I don't mean the VFIO series is taking long. That's a complex
> feature indeed. I just mean going from p->pages to p->data could have
> taken less time. I'm suggesting we might have overdone there a bit.
> 
> >
> > Sorry if it's so, but sorry again I don't think I get convinced: I think we
> > need to go this way to build blocks one by one, and we need to make sure
> > lower blocks are hopefully solid enough to take the upper ones.  Again I'm
> > ok with small things that go against it, but not major designs.  We
> > shouldn't go rewrite major designs if we seem to know a better one.
> >
> >> 
> >> And ok, let's consider the current situation a special case. But I would
> >> like to avoid in the future this kind of uncertainty. 
> >> 
> >> >
> >> > Note that patch 2-3 will be on top of Maciej's changes and they're totally
> >> > not relevant to what we merged so far.  Hence, nothing relevant there to
> >> > what you worked.  And this is the diff of patch 1:
> >> >
> >> >  migration/multifd.h              | 16 +++++++++++-----
> >> >  migration/multifd-device-state.c |  8 ++++++--
> >> >  migration/multifd-nocomp.c       | 13 ++++++-------
> >> >  migration/multifd.c              | 25 ++++++-------------------
> >> >  4 files changed, 29 insertions(+), 33 deletions(-)
> >> >
> >> > It's only 33 lines removed (many of which are comments..), it's not a huge
> >> > lot.  I don't know why you feel so bad at this...
> >> >
> >> > It's probably because we maintain migration together, or we can keep our
> >> > own way of work.  I don't think we did anything wrong yet so far.
> >> >
> >> > We can definitely talk about this in next 1:1.
> >> >
> >> >> 
> >> >> This is also why I've been trying hard to separate core multifd
> >> >> functionality from migration code that uses multifd to transmit their
> >> >> data.
> >> >> 
> >> >> My original RFC plus the suggestion to extend multifd_ops for device
> >> >> state would have (almost) made it so that no client code would be left
> >> >> in multifd. We could have been turning this thing upside down and it
> >> >> wouldn't affect anyone in terms of code conflicts.
> >> >
> >> > Do you mean you preferred the 2*N approach?
> >> >
> >> 
> >> 2*N, where N is usually not larger than 32 and the payload size is
> >> 1k. Yes, I'd trade that off no problem.
> >
> > I think it's a problem.
> >
> > Vdpa when involved with exactly the same pattern of how vfio uses it (as
> > they're really alike underneath) then vdpa will need its own array of
> > buffers, or it'll need to take the same vfio lock which doesn't make sense
> > to me.
> >
> > N+2, or, N+M (M is the user) is the minimum buffers we need.  N because
> > multifd can be worst case 100% busy on all threads occupying the buffers.
> > M because M users can be worst case 100% pre-filling.  It's either about
> > memory consumption, or about logical sensibility.
> 
> I'm aware of the memory consumption. Still, we're not forced to use the
> minimum amount of space we can. If using more memory can lead to a
> better design in the medium term, we're allowed to make that choice.
> 
> Hey, I'm not even saying we *should* have gone with 2N. I think it's
> good that we're now N+M. But I think we also lost some design
> flexibility due to that.
> 
> >
> >> 
> >> >> 
> >> >> The ship has already sailed, so your patches below are fine, I have just
> >> >> some small comments.
> >> >
> >> > I'm not sure what you meant about "ship sailed", but we should merge code
> >> > whenever we think is the most correct.
> >> 
> >> As you put above, I agree that the important bits of the original series
> >> have been preserved, but other secondary goals were lost, such as the
> >> more abstract separation between multifd & client code and that is the
> >> ship that has sailed.
> >> 
> >> That series was not: "introduce this array for no reason", we also lost
> >> the ability to abstract the payload from the multifd threads when we
> >> dropped the .alloc_fn callback for instance. The last patch you posted
> >
> > I don't remember the details there, but my memory was that it was too
> > flexible while we seem to reach the consensus that we only process either
> > RAM or device, nothing else.
> 
> Indeed. I'm being unfair here, sorry.
> 
> >
> >> here now adds multifd_device_state_prepare, somewhat ignoring that the
> >> ram code also has the same pattern and it could be made to use the same
> >> API.
> >
> > I need some further elaborations to understand.
> >
> > multifd_device_state_prepare currently does a few things: taking ownership
> > of the temp device state object, fill in idstr / instance_id, taking the
> > lock (so far is needed because we only have one device state object).  None
> > of them seems to be needed for RAM yet.
> >
> > Feel free to send a rfc patch if that helps.
> 
> What if I don't send a patch, wait for it to get merged and then send a
> refactoring on top so we don't add yet another detour to this
> conversation? =)

I thought it shouldn't conflict much if with ram only, and I used to mean
that can be a "comment in the form of patch".  But yeah, sure thing.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-13 15:22                 ` Peter Xu
@ 2024-09-13 18:26                   ` Fabiano Rosas
  2024-09-17 15:39                     ` Peter Xu
  2024-09-17 17:07                   ` Cédric Le Goater
  1 sibling, 1 reply; 128+ messages in thread
From: Fabiano Rosas @ 2024-09-13 18:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: mail, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Fri, Sep 13, 2024 at 12:04:00PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Fri, Sep 13, 2024 at 10:21:39AM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> > On Thu, Sep 12, 2024 at 03:43:39PM -0300, Fabiano Rosas wrote:
>> >> >> Peter Xu <peterx@redhat.com> writes:
>> >> >> 
>> >> >> Hi Peter, sorry if I'm not very enthusiastic by this, I'm sure you
>> >> >> understand the rework is a little frustrating.
>> >> >
>> >> > That's OK.
>> >> >
>> >> > [For some reason my email sync program decided to give up working for
>> >> >  hours.  I got more time looking at a tsc bug, which is good, but then I
>> >> >  miss a lot of emails..]
>> >> >
>> >> >> 
>> >> >> > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
>> >> >> >> > +size_t multifd_device_state_payload_size(void)
>> >> >> >> > +{
>> >> >> >> > +    return sizeof(MultiFDDeviceState_t);
>> >> >> >> > +}
>> >> >> >> 
>> >> >> >> This will not be necessary because the payload size is the same as the
>> >> >> >> data type. We only need it for the special case where the MultiFDPages_t
>> >> >> >> is smaller than the total ram payload size.
>> >> >> >
>> >> >> > Today I was thinking maybe we should really clean this up, as the current
>> >> >> > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
>> >> >> > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
>> >> >> > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
>> >> >> > that feeling stronger.
>> >> >> 
>> >> >> If we're going to commit bad code and then rewrite it a week later, we
>> >> >> could have just let the original series from Maciej merge without any of
>> >> >> this.
>> >> >
>> >> > Why it's "bad code"?
>> >> >
>> >> > It runs pretty well, I don't think it's bad code.  You wrote it, and I
>> >> > don't think it's bad at all.
>> >> 
>> >> Code that forces us to do arithmetic in order to properly allocate
>> >> memory and comes with a big comment explaining how we're dodging
>> >> compiler warnings is bad code in my book.
>> >> 
>> >> >
>> >> > But now we're rethinking after reading Maciej's new series.
>> >> >Personally I don't think it's a major problem.
>> >> >
>> >> > Note that we're not changing the design back: what was initially proposed
>> >> > was the client submitting an array of multifd objects.  I still don't think
>> >> > that's right.
>> >> >
>> >> > What the change goes so far is we make the union a struct, however that's
>> >> > still N+2 objects not 2*N, where 2 came from RAM+VFIO.  I think the
>> >> > important bits are still there (from your previous refactor series).
>> >> >
>> >> 
>> >> You fail to appreciate that before the RFC series, multifd already
>> >> allocated N for the pages.
>> >
>> > It depends on how you see it, IMHO.  I think it allocates N not for the
>> > "pages" but for the "threads", because the threads can be busy with those
>> > buffers, no matter if it's "page" or "device data".
>> 
>> Each MultiFD*Params had a p->pages, so N channels, N p->pages. The
>> device state series would add p->device_state, one per channel. So 2N +
>> 2 (for the extra slot).
>
> Then it makes sense to have SendData covering pages+device_state.  I think
> it's what we have now, but maybe I missed the point.

I misunderstood you. You're saying that you see the N as per-thread
instead of per-client-per-thread. That's one perspective indeed. It was
not what the device state series had going for it, so I still think that
this could have been a separate discussion independent of the p->pages
-> p->data change. But let's not drag this argument on, it's been
discussed and the code has been merged.

>
>> 
>> >
>> >> The device state adds another client, so that
>> >> would be another N anyway. The problem the RFC tried to solve was that
>> >> multifd channels owned that 2N, so the array was added to move the
>> >> memory into the client's ownership. IOW, it wasn't even the RFC series
>> >> that made it 2N, that was the multifd design all along. Now in hindsight
>> >> I don't think we should have went with the memory saving discussion.
>> >
>> > I think I could have made that feeling that I only wanted to save memory,
>> > if so, I'm sorry.  But do you still remember I mentioned "we can make it a
>> > struct, too" before your series landed?  Then you think it's ok to keep
>> > using union, and I'm ok too! At least at that time.  I don't think that's a
>> > huge deal.  I don't think each route we go must be perfect, but we try the
>> > best to make it as good.
>> 
>> Yep, I did agree with all of this. I'm just saying I now think I
>> shouldn't have.
>> 
>> >
>> > I don't think any discussion must not happen.  I agree memory consumption
>> > is not the 1st thing to worry, but I don't see why it can't be discussed.
>> 
>> It can be discussed, sure, but then 3 months pass and we're still
>> talking about it. Saved ~64k and spent 3 months. We could have just as
>> well said: "let's do a pass to optimize memory consumption after the
>> device state series is in".
>
> We didn't discuss 3 months on discussing memory consumption only!  It's
> unfair to think it like that.

Ok, you're right.

>
>> 
>> >
>> > Note that I never said we can't save those memory either - I plan to have
>> > follow up patches (for this, after Maciej's series land.. that's why I even
>> > didn't yet mention) to allow modules report device state buffer size.  I
>> > just didn't say, yet, and don't plan to worry before vfio series lands.
>> > When with that, we'll save 1M*N when no vfio device attached (but we'll
>> > need to handle hotplug).  So I think we don't need to lose any finally.
>> >
>> > However I think we need to get straight with the base design on how vfio
>> > should use multifd, because it's important bit and I don't want to rework
>> > important bits after a huge feature, if we know a better directions.
>> >
>> > I don't even think what I proposed patch 1-3 here is a must to me, I should
>> > be clear again here just in case we have similar discussions
>> > afterwards.. that I'm ok with below be done after Maciej's:
>> >
>> >   - Avoid memory allocations per-packet (done in patch 2-3)
>> >   - Avoid unnecessary data copy (done in patch 2-3)
>> >   - Avoid allocate device buffers when no device will use (not proposed)
>> >
>> > But I'm not ok building everything on top of the idea of not using multifd
>> > buffers in the current way, because that can involve a lot of changes:
>> > that's where buffer passes from up to down or backwards, and the interface
>> > needs change a lot too.  We already have that in master so it's not a
>> > problem now.
>> >
>> >> 
>> >> >> I already suggested it a couple of times, we shouldn't be doing
>> >> >> core refactorings underneath contributors' patches, this is too
>> >> >> fragile. Just let people contribute their code and we can change it
>> >> >> later.
>> >> >
>> >> > I sincerely don't think a lot needs changing... only patch 1.  Basically
>> >> > patch 1 on top of your previous rework series will be at least what I want,
>> >> > but I'm open to comments from you guys.
>> >> 
>> >> Don't get me wrong, I'm very much in favor of what you're doing
>> >> here. However, I don't think it's ok to be backtracking on our design
>> >> while other people have series in flight that depend on it. You
>> >> certainly know the feeling of trying to merge a feature and having
>> >> maintainers ask you to rewrite a bunch code just to be able to start
>> >> working. That's not ideal.
>> >
>> > I as a patch writer always like to do that when it's essential.  Normally
>> > the case is I don't have enough reviewer resources to help me get a better
>> > design, or discuss about it.
>> 
>> Right, but we can't keep providing a moving target. See the thread pool
>> discussion for an example. It's hard to work that way. The discussion
>> here is similar, we introduced the union, now we're moving to the
>> struct. And you're right that the changes here are small, so let's not
>> get caught in that.
>
> What's your suggestion on the thread pool?  Should we merge the change
> where vfio creates the threads on its own (assuming vfio maintainers are ok
> with it)?

This is not a simple answer and I'm not exactly sure where to draw the
line, but in this case I'm inclined to say: yes.

>
> I would say no, that's what I suggested.  I'd start with reusing
> ThreadPool, then we found issue when Stefan reported worry on abusing the
> API.  All these discussions seem sensible to me so far.  We can't rush on
> these until we figure things out step by step.  I don't see a way.

The problem is that using a thread pool is something that we've agreed
on for a while and is even in the migration TODO list. It's not
something that came up as a result of the device state series. I know
this is not anyone's intent, but it starts to feel like gatekeeping.

The fact that migration lacks a thread pool, that multifd threads have
historically caused issues and that what's in util/thread-pool.c is only
useful to the block layer are all preexisting problems of this code
base. We could be (and are) working to improve those regardless of what
new features are being contributed. I'm not sure it's productive to pick
problems we have had for a while and present those as prerequisites for
merging new code.

But as I said, there's a line to be drawn somewhere and I don't know
exactly where it lies. I understand the points that were brought up in
favor of first figuring out the thread pool situation, those are not at
all unreasonable.

>
> I saw Cedric suggesting to not even create a thread on recv side.  I am not
> sure whether that's easy, but I'd agree with Cedric if possible.  I think
> Maciej could have a point where it can block mutlifd threads, aka, IO
> threads, which might be unwanted.
>
> However said that, I still think device (even if threads needed) should not
> randomly create threads during migration.  It'll be a nightmare.

The thread-pool approach is being looked at to solve this particular
problem, but we have also discussed in the past that multifd threads
themselves should be managed by a thread pool. Will we add this
requirement to what is being done now? Otherwise, don't we risk having
an implementation that doesn't serve the rest of multifd? Do we even
know what the requirements are? Keep in mind that we're already not
modifying the existing ThreadPool, but planning to write a new one.

>
>> 
>> >
>> > When vfio is the new user of multifd vfio needs to do the heavy lifting to
>> > draft the api.
>> 
>> Well, multifd could have provided a flexible API to being with. That's
>> entirely on us. I have been toying with allowing more clients since at
>> least 1 year ago. We just couldn't get there in time.
>> 
>> >
>> >> 
>> >> I tried to quickly insert the RFC series before the device state series
>> >> progressed too much, but it's 3 months later and we're still discussing
>> >> it, maybe we don't need to do it this way.
>> >
>> > Can I get that of your feeling from when you were working on
>> > mapped-ram?
>> 
>> At that time I had already committed to helping maintain the code, so
>> the time spent there already went into the maintainer bucket anyway. If
>> I were instead just trying to drive-by, then that would have been a
>> pain.
>
> I don't think you became a maintainer changed how I would review mapped-ram
> series.
>
> OTOH, "I became a maintainer" could, because I know I am more responsible
> to a chunk of code until I leave (and please let me know any time when you
> think you're ready to take migration on your own).  That's a real
> difference to me.

Right, that's my point. I don't mind that what I'm doing takes time. I'm
not going anywhere. I do mind that because we're not going anywhere, we
start to drag people into a constant state of improving the next little
thing. Again, our definition of what constitutes "a little thing" is of
course different.

>
>> 
>> > That series does take long enough, I agree.  Not so bad yet with the VFIO
>> > series - it's good to have you around because you provide great reviews.
>> > I'm also trying the best to not let a series dangle for more than a year.
>> > I don't think 3 months is long with this feature: this is the 1st multifd
>> > extrenal user (and file mapping is also in another angle), it can take some
>> > time.
>> 
>> Oh, I don't mean the VFIO series is taking long. That's a complex
>> feature indeed. I just mean going from p->pages to p->data could have
>> taken less time. I'm suggesting we might have overdone there a bit.
>> 
>> >
>> > Sorry if it's so, but sorry again I don't think I get convinced: I think we
>> > need to go this way to build blocks one by one, and we need to make sure
>> > lower blocks are hopefully solid enough to take the upper ones.  Again I'm
>> > ok with small things that go against it, but not major designs.  We
>> > shouldn't go rewrite major designs if we seem to know a better one.
>> >
>> >> 
>> >> And ok, let's consider the current situation a special case. But I would
>> >> like to avoid in the future this kind of uncertainty. 
>> >> 
>> >> >
>> >> > Note that patch 2-3 will be on top of Maciej's changes and they're totally
>> >> > not relevant to what we merged so far.  Hence, nothing relevant there to
>> >> > what you worked.  And this is the diff of patch 1:
>> >> >
>> >> >  migration/multifd.h              | 16 +++++++++++-----
>> >> >  migration/multifd-device-state.c |  8 ++++++--
>> >> >  migration/multifd-nocomp.c       | 13 ++++++-------
>> >> >  migration/multifd.c              | 25 ++++++-------------------
>> >> >  4 files changed, 29 insertions(+), 33 deletions(-)
>> >> >
>> >> > It's only 33 lines removed (many of which are comments..), it's not a huge
>> >> > lot.  I don't know why you feel so bad at this...
>> >> >
>> >> > It's probably because we maintain migration together, or we can keep our
>> >> > own way of work.  I don't think we did anything wrong yet so far.
>> >> >
>> >> > We can definitely talk about this in next 1:1.
>> >> >
>> >> >> 
>> >> >> This is also why I've been trying hard to separate core multifd
>> >> >> functionality from migration code that uses multifd to transmit their
>> >> >> data.
>> >> >> 
>> >> >> My original RFC plus the suggestion to extend multifd_ops for device
>> >> >> state would have (almost) made it so that no client code would be left
>> >> >> in multifd. We could have been turning this thing upside down and it
>> >> >> wouldn't affect anyone in terms of code conflicts.
>> >> >
>> >> > Do you mean you preferred the 2*N approach?
>> >> >
>> >> 
>> >> 2*N, where N is usually not larger than 32 and the payload size is
>> >> 1k. Yes, I'd trade that off no problem.
>> >
>> > I think it's a problem.
>> >
>> > Vdpa when involved with exactly the same pattern of how vfio uses it (as
>> > they're really alike underneath) then vdpa will need its own array of
>> > buffers, or it'll need to take the same vfio lock which doesn't make sense
>> > to me.
>> >
>> > N+2, or, N+M (M is the user) is the minimum buffers we need.  N because
>> > multifd can be worst case 100% busy on all threads occupying the buffers.
>> > M because M users can be worst case 100% pre-filling.  It's either about
>> > memory consumption, or about logical sensibility.
>> 
>> I'm aware of the memory consumption. Still, we're not forced to use the
>> minimum amount of space we can. If using more memory can lead to a
>> better design in the medium term, we're allowed to make that choice.
>> 
>> Hey, I'm not even saying we *should* have gone with 2N. I think it's
>> good that we're now N+M. But I think we also lost some design
>> flexibility due to that.
>> 
>> >
>> >> 
>> >> >> 
>> >> >> The ship has already sailed, so your patches below are fine, I have just
>> >> >> some small comments.
>> >> >
>> >> > I'm not sure what you meant about "ship sailed", but we should merge code
>> >> > whenever we think is the most correct.
>> >> 
>> >> As you put above, I agree that the important bits of the original series
>> >> have been preserved, but other secondary goals were lost, such as the
>> >> more abstract separation between multifd & client code and that is the
>> >> ship that has sailed.
>> >> 
>> >> That series was not: "introduce this array for no reason", we also lost
>> >> the ability to abstract the payload from the multifd threads when we
>> >> dropped the .alloc_fn callback for instance. The last patch you posted
>> >
>> > I don't remember the details there, but my memory was that it was too
>> > flexible while we seem to reach the consensus that we only process either
>> > RAM or device, nothing else.
>> 
>> Indeed. I'm being unfair here, sorry.
>> 
>> >
>> >> here now adds multifd_device_state_prepare, somewhat ignoring that the
>> >> ram code also has the same pattern and it could be made to use the same
>> >> API.
>> >
>> > I need some further elaborations to understand.
>> >
>> > multifd_device_state_prepare currently does a few things: taking ownership
>> > of the temp device state object, fill in idstr / instance_id, taking the
>> > lock (so far is needed because we only have one device state object).  None
>> > of them seems to be needed for RAM yet.
>> >
>> > Feel free to send a rfc patch if that helps.
>> 
>> What if I don't send a patch, wait for it to get merged and then send a
>> refactoring on top so we don't add yet another detour to this
>> conversation? =)
>
> I thought it shouldn't conflict much if with ram only, and I used to mean
> that can be a "comment in the form of patch".  But yeah, sure thing.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-13 18:26                   ` Fabiano Rosas
@ 2024-09-17 15:39                     ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-17 15:39 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: mail, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Sep 13, 2024 at 03:26:43PM -0300, Fabiano Rosas wrote:
> The thread-pool approach is being looked at to solve this particular
> problem, but we have also discussed in the past that multifd threads
> themselves should be managed by a thread pool. Will we add this
> requirement to what is being done now? Otherwise, don't we risk having
> an implementation that doesn't serve the rest of multifd? Do we even
> know what the requirements are? Keep in mind that we're already not
> modifying the existing ThreadPool, but planning to write a new one.

Multifd currently has below specialties:

  - Multifd thread has 1:1 mapping with iochannels

  - Multifd thread num should be relevant to target bandwidth (e.g., the
    NIC performance)

While for a generic thread pool:

  - Thread has no correlation to any iochannel, but some async cpu
    intensive workloads during migration (either during switchover, or
    maybe even before that?)

  - Thread number should have no correlation to NIC/bandwidth, a sane start
    could be $(nproc), but maybe not..

I don't remember what I was thinking previously, but now it sounds ok to
keep multifd separate as of now to me, because multifd does service a
slightly different purpose (maximum network throughput) v.s. where we want
a pool of threads doing async tasks (which can be anything).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-13 15:22                 ` Peter Xu
  2024-09-13 18:26                   ` Fabiano Rosas
@ 2024-09-17 17:07                   ` Cédric Le Goater
  2024-09-17 17:50                     ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Cédric Le Goater @ 2024-09-17 17:07 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: mail, Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

[ ... ]

>>> I as a patch writer always like to do that when it's essential.  Normally
>>> the case is I don't have enough reviewer resources to help me get a better
>>> design, or discuss about it.
>>
>> Right, but we can't keep providing a moving target. See the thread pool
>> discussion for an example. It's hard to work that way. The discussion
>> here is similar, we introduced the union, now we're moving to the
>> struct. And you're right that the changes here are small, so let's not
>> get caught in that.
> 
> What's your suggestion on the thread pool?  Should we merge the change
> where vfio creates the threads on its own (assuming vfio maintainers are ok
> with it)?
> 
> I would say no, that's what I suggested.  I'd start with reusing
> ThreadPool, then we found issue when Stefan reported worry on abusing the
> API.  All these discussions seem sensible to me so far.  We can't rush on
> these until we figure things out step by step.  I don't see a way.
> 
> I saw Cedric suggesting to not even create a thread on recv side.  I am not
> sure whether that's easy, but I'd agree with Cedric if possible.  I think
> Maciej could have a point where it can block mutlifd threads, aka, IO
> threads, which might be unwanted.

Sorry, If I am adding noise on this topic. I made this suggestion
because I spotted some asymmetry in the proposal.

The send and recv implementation in VFIO relies on different
interfaces with different level of complexity. The send part is
using a set of multifd callbacks called from multifd threads,
if I am correct. Whereas the recv part is directly implemented
in VFIO with local thread(s?) doing their own state receive cookery.

I was expecting a common interface to minimize assumptions on both
ends. It doesn't have to be callback based. It could be a set of
services a subsystem could use to transfer state in parallel.
<side note>
      VFIO migration is driven by numerous callbacks and it is
      difficult to understand the context in which these are called.
      Adding more callbacks might not be the best approach.
</side note>

The other comment was on optimisation. If this is an optimisation
then I would expect, first, a non-optimized version not using threads
(on the recv side).

VFIO Migration is a "new" feature which needs some more run-in.
That said, it is stable, MLX5 VFs devices have good support, you
can rely on me to evaluate the future respins.

Thanks,

C.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-17 17:07                   ` Cédric Le Goater
@ 2024-09-17 17:50                     ` Peter Xu
  2024-09-19 19:51                       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-17 17:50 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Fabiano Rosas, mail, Alex Williamson, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Sep 17, 2024 at 07:07:10PM +0200, Cédric Le Goater wrote:
> [ ... ]
> 
> > > > I as a patch writer always like to do that when it's essential.  Normally
> > > > the case is I don't have enough reviewer resources to help me get a better
> > > > design, or discuss about it.
> > > 
> > > Right, but we can't keep providing a moving target. See the thread pool
> > > discussion for an example. It's hard to work that way. The discussion
> > > here is similar, we introduced the union, now we're moving to the
> > > struct. And you're right that the changes here are small, so let's not
> > > get caught in that.
> > 
> > What's your suggestion on the thread pool?  Should we merge the change
> > where vfio creates the threads on its own (assuming vfio maintainers are ok
> > with it)?
> > 
> > I would say no, that's what I suggested.  I'd start with reusing
> > ThreadPool, then we found issue when Stefan reported worry on abusing the
> > API.  All these discussions seem sensible to me so far.  We can't rush on
> > these until we figure things out step by step.  I don't see a way.
> > 
> > I saw Cedric suggesting to not even create a thread on recv side.  I am not
> > sure whether that's easy, but I'd agree with Cedric if possible.  I think
> > Maciej could have a point where it can block mutlifd threads, aka, IO
> > threads, which might be unwanted.
> 
> Sorry, If I am adding noise on this topic. I made this suggestion
> because I spotted some asymmetry in the proposal.
> 
> The send and recv implementation in VFIO relies on different
> interfaces with different level of complexity. The send part is
> using a set of multifd callbacks called from multifd threads,
> if I am correct. Whereas the recv part is directly implemented
> in VFIO with local thread(s?) doing their own state receive cookery.

Yeh, the send/recv sides are indeed not fully symmetrical in the case of
multifd - the recv side is more IO-driven, e.g., QEMU reacts based on what
it receives (which was encoded in the headers of the received packets).

The src is more of a generic consumer / producer model where threads can
enqueue tasks / data to different iochannels.

> 
> I was expecting a common interface to minimize assumptions on both
> ends. It doesn't have to be callback based. It could be a set of
> services a subsystem could use to transfer state in parallel.
> <side note>
>      VFIO migration is driven by numerous callbacks and it is
>      difficult to understand the context in which these are called.
>      Adding more callbacks might not be the best approach.
> </side note>
> 
> The other comment was on optimisation. If this is an optimisation
> then I would expect, first, a non-optimized version not using threads
> (on the recv side).

As commented in a previous email, I had a feeling that Maciej wanted to
avoid blocking multifd threads when applying VFIO data chunks to the kernel
driver, but Maciej please correct me.. I could be wrong.

To me I think I'm fine even if it blocks multifd threads, as it'll only
happen when with VFIO (we may want to consider n_multifd_threads to be
based on num of vfio devices then, so we still always have some idle
threads taking IOs out of the NIC buffers).

So I agree with Cedric that if we can provide a functional working version
first then we can at least go with the simpler approach first.

> 
> VFIO Migration is a "new" feature which needs some more run-in.
> That said, it is stable, MLX5 VFs devices have good support, you
> can rely on me to evaluate the future respins.
> 
> Thanks,
> 
> C.
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-09 19:08         ` Peter Xu
  2024-09-09 19:32           ` Peter Xu
@ 2024-09-19 19:47           ` Maciej S. Szmigiero
  2024-09-19 20:54             ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Avihai Horon, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On 9.09.2024 21:08, Peter Xu wrote:
> On Mon, Sep 09, 2024 at 08:32:45PM +0200, Maciej S. Szmigiero wrote:
>> On 9.09.2024 19:59, Peter Xu wrote:
>>> On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
>>>>
>>>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> These SaveVMHandlers help device provide its own asynchronous
>>>>> transmission of the remaining data at the end of a precopy phase.
>>>>>
>>>>> In this use case the save_live_complete_precopy_begin handler might
>>>>> be used to mark the stream boundary before proceeding with asynchronous
>>>>> transmission of the remaining data while the
>>>>> save_live_complete_precopy_end handler might be used to mark the
>>>>> stream boundary after performing the asynchronous transmission.
>>>>>
>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>> ---
>>>>>     include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
>>>>>     migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
>>>>>     2 files changed, 71 insertions(+)
>>>>>
>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>> index f60e797894e5..9de123252edf 100644
>>>>> --- a/include/migration/register.h
>>>>> +++ b/include/migration/register.h
>>>>> @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
>>>>>          */
>>>>>         int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>>>>>
>>>>> +    /**
>>>>> +     * @save_live_complete_precopy_begin
>>>>> +     *
>>>>> +     * Called at the end of a precopy phase, before all
>>>>> +     * @save_live_complete_precopy handlers and before launching
>>>>> +     * all @save_live_complete_precopy_thread threads.
>>>>> +     * The handler might, for example, mark the stream boundary before
>>>>> +     * proceeding with asynchronous transmission of the remaining data via
>>>>> +     * @save_live_complete_precopy_thread.
>>>>> +     * When postcopy is enabled, devices that support postcopy will skip this step.
>>>>> +     *
>>>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>>>> +     * @idstr: this device section idstr
>>>>> +     * @instance_id: this device section instance_id
>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>> +     *
>>>>> +     * Returns zero to indicate success and negative for error
>>>>> +     */
>>>>> +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
>>>>> +                                            char *idstr, uint32_t instance_id,
>>>>> +                                            void *opaque);
>>>>> +    /**
>>>>> +     * @save_live_complete_precopy_end
>>>>> +     *
>>>>> +     * Called at the end of a precopy phase, after @save_live_complete_precopy
>>>>> +     * handlers and after all @save_live_complete_precopy_thread threads have
>>>>> +     * finished. When postcopy is enabled, devices that support postcopy will
>>>>> +     * skip this step.
>>>>> +     *
>>>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>> +     *
>>>>> +     * Returns zero to indicate success and negative for error
>>>>> +     */
>>>>> +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
>>>>
>>>> Is this handler necessary now that migration core is responsible for the
>>>> threads and joins them? I don't see VFIO implementing it later on.
>>>
>>> Right, I spot the same thing.
>>>
>>> This series added three hooks: begin, end, precopy_thread.
>>>
>>> What I think is it only needs one, which is precopy_async.  My vague memory
>>> was that was what we used to discuss too, so that when migration precopy
>>> flushes the final round of iterable data, it does:
>>>
>>>     (1) loop over all complete_precopy_async() and enqueue the tasks if
>>>         existed into the migration worker pool.  Then,
>>>
>>>     (2) loop over all complete_precopy() like before.
>>>
>>> Optionally, we can enforce one vmstate handler only provides either
>>> complete_precopy_async() or complete_precopy().  In this case VFIO can
>>> update the two hooks during setup() by detecting multifd && !mapped_ram &&
>>> nocomp.
>>>
>>
>> The "_begin" hook is still necessary to mark the end of the device state
>> sent via the main migration stream (during the phase VM is still running)
>> since we can't start loading the multifd sent device state until all of
>> that earlier data finishes loading first.
> 
> Ah I remembered some more now, thanks.
> 
> If vfio can send data during iterations this new hook will also not be
> needed, right?
> 
> I remember you mentioned you'd have a look and see the challenges there, is
> there any conclusion yet on whether we can use multifd even during that?

Yeah, I looked at that and it wasn't a straightforward thing to introduce.

I am worried that with all the things that already piled up (including the
new thread pool implementation) we risk missing QEMU 9.2 too if this is
included.

> It's also a pity that we introduce this hook only because we want a
> boundary between "iterable stage" and "final stage".  IIUC if we have any
> kind of message telling dest before hand that "we're going to the last
> stage" then this hook can be avoided.  Now it's at least inefficient
> because we need to trigger begin() per-device, even if I think it's more of
> a global request saying that "we need to load all main stream data first
> before moving on".

It should be pretty easy to remove that begin() hook once it is no longer
needed - after all, it's only necessary for the sender.

>>
>> We shouldn't send that boundary marker in .save_live_complete_precopy
>> either since it would meant unnecessary waiting for other devices
>> (not necessary VFIO ones) .save_live_complete_precopy bulk data.
>>
>> And VFIO SaveVMHandlers are shared for all VFIO devices (and const) so
>> we can't really change them at runtime.
> 
> In all cases, please consider dropping end() if it's never used; IMO it's
> fine if there is only begin(), and we shouldn't keep hooks that are never
> used.

Okay, will remove the end() hook then.

> Thanks,
> 

Thanks,
Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-09 19:32           ` Peter Xu
@ 2024-09-19 19:48             ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: Avihai Horon, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On 9.09.2024 21:32, Peter Xu wrote:
> On Mon, Sep 09, 2024 at 03:08:40PM -0400, Peter Xu wrote:
>> On Mon, Sep 09, 2024 at 08:32:45PM +0200, Maciej S. Szmigiero wrote:
>>> On 9.09.2024 19:59, Peter Xu wrote:
>>>> On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
>>>>>
>>>>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>>>>> External email: Use caution opening links or attachments
>>>>>>
>>>>>>
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> These SaveVMHandlers help device provide its own asynchronous
>>>>>> transmission of the remaining data at the end of a precopy phase.
>>>>>>
>>>>>> In this use case the save_live_complete_precopy_begin handler might
>>>>>> be used to mark the stream boundary before proceeding with asynchronous
>>>>>> transmission of the remaining data while the
>>>>>> save_live_complete_precopy_end handler might be used to mark the
>>>>>> stream boundary after performing the asynchronous transmission.
>>>>>>
>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>> ---
>>>>>>     include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
>>>>>>     migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
>>>>>>     2 files changed, 71 insertions(+)
>>>>>>
>>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>>> index f60e797894e5..9de123252edf 100644
>>>>>> --- a/include/migration/register.h
>>>>>> +++ b/include/migration/register.h
>>>>>> @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
>>>>>>          */
>>>>>>         int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>>>>>>
>>>>>> +    /**
>>>>>> +     * @save_live_complete_precopy_begin
>>>>>> +     *
>>>>>> +     * Called at the end of a precopy phase, before all
>>>>>> +     * @save_live_complete_precopy handlers and before launching
>>>>>> +     * all @save_live_complete_precopy_thread threads.
>>>>>> +     * The handler might, for example, mark the stream boundary before
>>>>>> +     * proceeding with asynchronous transmission of the remaining data via
>>>>>> +     * @save_live_complete_precopy_thread.
>>>>>> +     * When postcopy is enabled, devices that support postcopy will skip this step.
>>>>>> +     *
>>>>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>>>>> +     * @idstr: this device section idstr
>>>>>> +     * @instance_id: this device section instance_id
>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>> +     *
>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>> +     */
>>>>>> +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
>>>>>> +                                            char *idstr, uint32_t instance_id,
>>>>>> +                                            void *opaque);
>>>>>> +    /**
>>>>>> +     * @save_live_complete_precopy_end
>>>>>> +     *
>>>>>> +     * Called at the end of a precopy phase, after @save_live_complete_precopy
>>>>>> +     * handlers and after all @save_live_complete_precopy_thread threads have
>>>>>> +     * finished. When postcopy is enabled, devices that support postcopy will
>>>>>> +     * skip this step.
>>>>>> +     *
>>>>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>> +     *
>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>> +     */
>>>>>> +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
>>>>>
>>>>> Is this handler necessary now that migration core is responsible for the
>>>>> threads and joins them? I don't see VFIO implementing it later on.
>>>>
>>>> Right, I spot the same thing.
>>>>
>>>> This series added three hooks: begin, end, precopy_thread.
>>>>
>>>> What I think is it only needs one, which is precopy_async.  My vague memory
>>>> was that was what we used to discuss too, so that when migration precopy
>>>> flushes the final round of iterable data, it does:
>>>>
>>>>     (1) loop over all complete_precopy_async() and enqueue the tasks if
>>>>         existed into the migration worker pool.  Then,
>>>>
>>>>     (2) loop over all complete_precopy() like before.
>>>>
>>>> Optionally, we can enforce one vmstate handler only provides either
>>>> complete_precopy_async() or complete_precopy().  In this case VFIO can
>>>> update the two hooks during setup() by detecting multifd && !mapped_ram &&
>>>> nocomp.
>>>>
>>>
>>> The "_begin" hook is still necessary to mark the end of the device state
>>> sent via the main migration stream (during the phase VM is still running)
>>> since we can't start loading the multifd sent device state until all of
>>> that earlier data finishes loading first.
>>
>> Ah I remembered some more now, thanks.
>>
>> If vfio can send data during iterations this new hook will also not be
>> needed, right?
>>
>> I remember you mentioned you'd have a look and see the challenges there, is
>> there any conclusion yet on whether we can use multifd even during that?
>>
>> It's also a pity that we introduce this hook only because we want a
>> boundary between "iterable stage" and "final stage".  IIUC if we have any
>> kind of message telling dest before hand that "we're going to the last
>> stage" then this hook can be avoided.  Now it's at least inefficient
>> because we need to trigger begin() per-device, even if I think it's more of
>> a global request saying that "we need to load all main stream data first
>> before moving on".
> 
> Or, we could add one MIG_CMD_SWITCHOVER under QEMU_VM_COMMAND, then send it
> at the beginning of the switchover phase.  Then we can have a generic
> marker on destination to be the boundary of "iterations" v.s. "switchover".
> Then I think we can also drop the begin() here, just to avoid one such sync
> per-device (also in case if others may have such need, like vdpa, then vdpa
> doesn't need that flag too).
> 
> Fundamentally, that makes the VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE to be a
> migration flag..
> 
> But for sure the best is still if VFIO can enable multifd even during
> iterations.  Then the boundary guard may not be needed.

That begin() handler was supposed to be generic for multiple device types,
that's why it was paired with the end() one that has no current use.

But you are right that using a single "barrier" or "sync" command for all
device types makes sense, so I will change it to MIG_CMD_SWITCHOVER.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-09 20:03   ` Peter Xu
@ 2024-09-19 19:49     ` Maciej S. Szmigiero
  2024-09-19 21:11       ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 9.09.2024 22:03, Peter Xu wrote:
> On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> load_finish SaveVMHandler allows migration code to poll whether
>> a device-specific asynchronous device state loading operation had finished.
>>
>> In order to avoid calling this handler needlessly the device is supposed
>> to notify the migration code of its possible readiness via a call to
>> qemu_loadvm_load_finish_ready_broadcast() while holding
>> qemu_loadvm_load_finish_ready_lock.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/register.h | 21 +++++++++++++++
>>   migration/migration.c        |  6 +++++
>>   migration/migration.h        |  3 +++
>>   migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>>   migration/savevm.h           |  4 +++
>>   5 files changed, 86 insertions(+)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 4a578f140713..44d8cf5192ae 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>>       int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>>                                Error **errp);
>>   
>> +    /**
>> +     * @load_finish
>> +     *
>> +     * Poll whether all asynchronous device state loading had finished.
>> +     * Not called on the load failure path.
>> +     *
>> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
>> +     *
>> +     * If this method signals "not ready" then it might not be called
>> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>> +     * while holding qemu_loadvm_load_finish_ready_lock.
> 
> [1]
> 
>> +     *
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     * @is_finished: whether the loading had finished (output parameter)
>> +     * @errp: pointer to Error*, to store an error if it happens.
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     * It's not an error that the loading still hasn't finished.
>> +     */
>> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> 
> The load_finish() semantics is a bit weird, especially above [1] on "only
> allowed to be called once if ..." and also on the locks.

The point of this remark is that a driver needs to call
qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
core to call its load_finish handler again.

> It looks to me vfio_load_finish() also does the final load of the device.
> 
> I wonder whether that final load can be done in the threads, 

Here, the problem is that current VFIO VMState has to be loaded from the main
migration thread as it internally calls QEMU core address space modification
methods which explode if called from another thread(s).

> then after
> everything loaded the device post a semaphore telling the main thread to
> continue.  See e.g.:
> 
>      if (migrate_switchover_ack()) {
>          qemu_loadvm_state_switchover_ack_needed(mis);
>      }
> 
> IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
> when all things are loaded?  We can then get rid of this slightly awkward
> interface.  I had a feeling that things can be simplified (e.g., if the
> thread will take care of loading the final vmstate then the mutex is also
> not needed? etc.).

With just a single call to switchover_ack_needed per VFIO device it would
need to do a blocking wait for the device buffers and config state load
to finish, therefore blocking other VFIO devices from potentially loading
their config state if they are ready to begin this operation earlier.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-09 19:52       ` Peter Xu
@ 2024-09-19 19:49         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 9.09.2024 21:52, Peter Xu wrote:
> On Mon, Sep 02, 2024 at 10:12:01PM +0200, Maciej S. Szmigiero wrote:
>>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>>> index a3e35196d179..a8f3e4838c01 100644
>>>> --- a/migration/multifd.h
>>>> +++ b/migration/multifd.h
>>>> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>>>    #define MULTIFD_FLAG_QPL (4 << 1)
>>>>    #define MULTIFD_FLAG_UADK (8 << 1)
>>>> +/*
>>>> + * If set it means that this packet contains device state
>>>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>>>> + */
>>>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
>>>
>>> Overlaps with UADK. I assume on purpose because device_state doesn't
>>> support compression? Might be worth a comment.
>>>
>>
>> Yes, the device state transfer bit stream does not support compression
>> so it is not a problem since these "compression type" flags will never
>> be set in such bit stream anyway.
>>
>> Will add a relevant comment here.
> 
> Why reuse?  Would using a new bit easier if we still have plenty of bits
> (just to tell what is what directly from a stream dump)?
> 

Will move that flag to the next unique bit then.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-10 19:48     ` Peter Xu
  2024-09-12 18:43       ` Fabiano Rosas
@ 2024-09-19 19:49       ` Maciej S. Szmigiero
  2024-09-19 21:17         ` Peter Xu
  1 sibling, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 10.09.2024 21:48, Peter Xu wrote:
> On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
>>> +size_t multifd_device_state_payload_size(void)
>>> +{
>>> +    return sizeof(MultiFDDeviceState_t);
>>> +}
>>
>> This will not be necessary because the payload size is the same as the
>> data type. We only need it for the special case where the MultiFDPages_t
>> is smaller than the total ram payload size.
> 
> Today I was thinking maybe we should really clean this up, as the current
> multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
> that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
> and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
> that feeling stronger.
> 
> I think we should change it now perhaps, otherwise we'll need to introduce
> other helpers to e.g. reset the device buffers, and that's not only slow
> but also not good looking, IMO.
> 
> So I went ahead with the idea in previous discussion, that I managed to
> change the SendData union into struct; the memory consumption is not super
> important yet, IMHO, but we should still stick with the object model where
> multifd enqueue thread switch buffer with multifd, as it still sounds a
> sane way to do.
> 
> Then when that patch is ready, I further tried to make VFIO reuse multifd
> buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
> don't allocate it every time we enqueue.
> 
> I hope it'll also work for VFIO.  VFIO has a specialty on being able to
> dump the config space so it's more complex (and I noticed Maciej's current
> design requires the final chunk of VFIO config data be migrated in one
> packet.. that is also part of the complexity there).  So I allowed that
> part to allocate a buffer but only that.  IOW, I made some API (see below)
> that can either reuse preallocated buffer, or use a separate one only for
> the final bulk.
> 
> In short, could both of you have a look at what I came up with below?  I
> did that in patches because I think it's too much to comment, so patches
> may work better.  No concern if any of below could be good changes to you,
> then either Maciej can squash whatever into existing patches (and I feel
> like some existing patches in this series can go away with below design),
> or I can post pre-requisite patch but only if any of you prefer that.
> 
> Anyway, let me know, the patches apply on top of this whole series applied
> first.
> 
> I also wonder whether there can be any perf difference already (I tested
> all multifd qtest with below, but no VFIO I can run), perhaps not that
> much, but just to mention below should avoid both buffer allocations and
> one round of copy (so VFIO read() directly writes to the multifd buffers
> now).

I am not against making MultiFDSendData a struct and maybe introducing
some pre-allocated buffer.

But to be honest, that manual memory management with having to remember
to call multifd_device_state_finish() on error paths as in your
proposed patch 3 really invites memory leaks.

Will think about some other way to have a reusable buffer.

In terms of not making idstr copy (your proposed patch 2) I am not
100% sure that avoiding such tiny allocation really justifies the risk
of possible use-after-free of a dangling pointer.
Not 100% against it either if you are confident that it will never happen.

By the way, I guess it makes sense to carry these changes in the main patch
set rather than as a separate changes?

> Thanks,

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-10 16:06   ` Peter Xu
@ 2024-09-19 19:49     ` Maciej S. Szmigiero
  2024-09-19 21:18       ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 10.09.2024 18:06, Peter Xu wrote:
> On Tue, Aug 27, 2024 at 07:54:31PM +0200, Maciej S. Szmigiero wrote:
>> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> +                                char *data, size_t len)
>> +{
>> +    /* Device state submissions can come from multiple threads */
>> +    QEMU_LOCK_GUARD(&queue_job_mutex);
> 
> Ah, just notice there's the mutex.
> 
> So please consider the reply in the other thread, IIUC we can make it for
> multifd_send() to be a generic mutex to simplify the other patch too, then
> drop here.
> 
> I assume the ram code should be fine taking one more mutex even without
> vfio, if it only takes once for each ~128 pages to enqueue, and only take
> in the main thread, then each update should be also in the hot path
> (e.g. no cache bouncing).
> 

Will check whether it is possible to use a common mutex here for both RAM
and device state submission without drop in performance.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-09 19:40         ` Peter Xu
@ 2024-09-19 19:50           ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Williamson, Fabiano Rosas, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 9.09.2024 21:40, Peter Xu wrote:
> On Fri, Aug 30, 2024 at 10:02:40AM -0300, Fabiano Rosas wrote:
>>>>> @@ -397,20 +404,16 @@ bool multifd_send(MultiFDSendData **send_data)
>>>>>    
>>>>>            p = &multifd_send_state->params[i];
>>>>>            /*
>>>>> -         * Lockless read to p->pending_job is safe, because only multifd
>>>>> -         * sender thread can clear it.
>>>>> +         * Lockless RMW on p->pending_job_preparing is safe, because only multifd
>>>>> +         * sender thread can clear it after it had seen p->pending_job being set.
>>>>> +         *
>>>>> +         * Pairs with qatomic_store_release() in multifd_send_thread().
>>>>>             */
>>>>> -        if (qatomic_read(&p->pending_job) == false) {
>>>>> +        if (qatomic_cmpxchg(&p->pending_job_preparing, false, true) == false) {
>>>>
>>>> What's the motivation for this change? It would be better to have it in
>>>> a separate patch with a proper justification.
>>>
>>> The original RFC patch set used dedicated device state multifd channels.
>>>
>>> Peter and other people wanted this functionality removed, however this caused
>>> a performance (downtime) regression.
>>>
>>> One of the things that seemed to help mitigate this regression was making
>>> the multifd channel selection more fair via this change.
>>>
>>> But I can split out it to a separate commit in the next patch set version and
>>> then see what performance improvement it currently brings.
>>
>> Yes, better to have it separate if anything for documentation of the
>> rationale.
> 
> And when drafting that patch, please add a comment explaining the field.
> Currently it's missing:
> 
>      /*
>       * The sender thread has work to do if either of below boolean is set.
>       *
>       * @pending_job:  a job is pending
>       * @pending_sync: a sync request is pending
>       *
>       * For both of these fields, they're only set by the requesters, and
>       * cleared by the multifd sender threads.
>       */
>      bool pending_job;
>      bool pending_job_preparing;
>      bool pending_sync;
> 

Will do if these variables end staying in the patch (instead of being
replaced by the common send mutex, for example).

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-17 17:50                     ` Peter Xu
@ 2024-09-19 19:51                       ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:51 UTC (permalink / raw)
  To: Peter Xu, Cédric Le Goater
  Cc: Fabiano Rosas, Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 17.09.2024 19:50, Peter Xu wrote:
> On Tue, Sep 17, 2024 at 07:07:10PM +0200, Cédric Le Goater wrote:
>> [ ... ]
>>
>>>>> I as a patch writer always like to do that when it's essential.  Normally
>>>>> the case is I don't have enough reviewer resources to help me get a better
>>>>> design, or discuss about it.
>>>>
>>>> Right, but we can't keep providing a moving target. See the thread pool
>>>> discussion for an example. It's hard to work that way. The discussion
>>>> here is similar, we introduced the union, now we're moving to the
>>>> struct. And you're right that the changes here are small, so let's not
>>>> get caught in that.
>>>
>>> What's your suggestion on the thread pool?  Should we merge the change
>>> where vfio creates the threads on its own (assuming vfio maintainers are ok
>>> with it)?
>>>
>>> I would say no, that's what I suggested.  I'd start with reusing
>>> ThreadPool, then we found issue when Stefan reported worry on abusing the
>>> API.  All these discussions seem sensible to me so far.  We can't rush on
>>> these until we figure things out step by step.  I don't see a way.
>>>
>>> I saw Cedric suggesting to not even create a thread on recv side.  I am not
>>> sure whether that's easy, but I'd agree with Cedric if possible.  I think
>>> Maciej could have a point where it can block mutlifd threads, aka, IO
>>> threads, which might be unwanted.
>>
>> Sorry, If I am adding noise on this topic. I made this suggestion
>> because I spotted some asymmetry in the proposal.
>>
>> The send and recv implementation in VFIO relies on different
>> interfaces with different level of complexity. The send part is
>> using a set of multifd callbacks called from multifd threads,
>> if I am correct. Whereas the recv part is directly implemented
>> in VFIO with local thread(s?) doing their own state receive cookery.
> 
> Yeh, the send/recv sides are indeed not fully symmetrical in the case of
> multifd - the recv side is more IO-driven, e.g., QEMU reacts based on what
> it receives (which was encoded in the headers of the received packets).
> 
> The src is more of a generic consumer / producer model where threads can
> enqueue tasks / data to different iochannels.

Currently, the best case happens if both sides are I/O bound with respect
to the VFIO devices - reading device state from the source device as
fast as it can produce it and loading device state into the target device
as fast as it can consume it.

These devices aren't normally seriously network bandwidth constrained
here.

>>
>> I was expecting a common interface to minimize assumptions on both
>> ends. It doesn't have to be callback based. It could be a set of
>> services a subsystem could use to transfer state in parallel.
>> <side note>
>>       VFIO migration is driven by numerous callbacks and it is
>>       difficult to understand the context in which these are called.
>>       Adding more callbacks might not be the best approach.
>> </side note>
>>
>> The other comment was on optimisation. If this is an optimisation
>> then I would expect, first, a non-optimized version not using threads
>> (on the recv side).
> 
> As commented in a previous email, I had a feeling that Maciej wanted to
> avoid blocking multifd threads when applying VFIO data chunks to the kernel
> driver, but Maciej please correct me.. I could be wrong.

Yes, we don't want the case that loading device state into one VFIO device
blocks loading such state into another VFIO device and so the second VFIO
device ends up being partially idle during that time.

> To me I think I'm fine even if it blocks multifd threads, as it'll only
> happen when with VFIO (we may want to consider n_multifd_threads to be
> based on num of vfio devices then, so we still always have some idle
> threads taking IOs out of the NIC buffers).

The current design uses exactly one loading thread per VFIO device.

> So I agree with Cedric that if we can provide a functional working version
> first then we can at least go with the simpler approach first.
> 
>>
>> VFIO Migration is a "new" feature which needs some more run-in.
>> That said, it is stable, MLX5 VFs devices have good support, you
>> can rely on me to evaluate the future respins.
>>
>> Thanks,
>>
>> C.
>>
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side
  2024-09-12 13:52         ` Fabiano Rosas
@ 2024-09-19 19:59           ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-19 19:59 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Avihai Horon, Peter Xu, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 12.09.2024 15:52, Fabiano Rosas wrote:
> Avihai Horon <avihaih@nvidia.com> writes:
> 
>> On 09/09/2024 21:05, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 5.09.2024 18:47, Avihai Horon wrote:
>>>>
>>>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> Add a basic support for receiving device state via multifd channels -
>>>>> channels that are shared with RAM transfers.
>>>>>
>>>>> To differentiate between a device state and a RAM packet the packet
>>>>> header is read first.
>>>>>
>>>>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not
>>>>> in the
>>>>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>>>>> data (existing MultiFDPacket_t) is then read.
>>>>>
>>>>> The received device state data is provided to
>>>>> qemu_loadvm_load_state_buffer() function for processing in the
>>>>> device's load_state_buffer handler.
>>>>>
>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>> ---
>>>>>    migration/multifd.c | 127
>>>>> +++++++++++++++++++++++++++++++++++++-------
>>>>>    migration/multifd.h |  31 ++++++++++-
>>>>>    2 files changed, 138 insertions(+), 20 deletions(-)
>>>>>
>>>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>>>> index b06a9fab500e..d5a8e5a9c9b5 100644
>>>>> --- a/migration/multifd.c
>>>>> +++ b/migration/multifd.c
>>>>> @@ -21,6 +21,7 @@
>>>>>    #include "file.h"
>>>>>    #include "migration.h"
>>>>>    #include "migration-stats.h"
>>>>> +#include "savevm.h"
>>>>>    #include "socket.h"
>>>>>    #include "tls.h"
>>>>>    #include "qemu-file.h"
>>>>> @@ -209,10 +210,10 @@ void
>>>>> multifd_send_fill_packet(MultiFDSendParams *p)
>>>>>
>>>>>        memset(packet, 0, p->packet_len);
>>>>>
>>>>> -    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>>>>> -    packet->version = cpu_to_be32(MULTIFD_VERSION);
>>>>> +    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
>>>>> +    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>>>>>
>>>>> -    packet->flags = cpu_to_be32(p->flags);
>>>>> +    packet->hdr.flags = cpu_to_be32(p->flags);
>>>>>        packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>>>>>
>>>>>        packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
>>>>> @@ -228,31 +229,49 @@ void
>>>>> multifd_send_fill_packet(MultiFDSendParams *p)
>>>>>                                p->flags, p->next_packet_size);
>>>>>    }
>>>>>
>>>>> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error
>>>>> **errp)
>>>>> +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
>>>>> + MultiFDPacketHdr_t *hdr,
>>>>> +                                             Error **errp)
>>>>>    {
>>>>> -    MultiFDPacket_t *packet = p->packet;
>>>>> -    int ret = 0;
>>>>> -
>>>>> -    packet->magic = be32_to_cpu(packet->magic);
>>>>> -    if (packet->magic != MULTIFD_MAGIC) {
>>>>> +    hdr->magic = be32_to_cpu(hdr->magic);
>>>>> +    if (hdr->magic != MULTIFD_MAGIC) {
>>>>>            error_setg(errp, "multifd: received packet "
>>>>>                       "magic %x and expected magic %x",
>>>>> -                   packet->magic, MULTIFD_MAGIC);
>>>>> +                   hdr->magic, MULTIFD_MAGIC);
>>>>>            return -1;
>>>>>        }
>>>>>
>>>>> -    packet->version = be32_to_cpu(packet->version);
>>>>> -    if (packet->version != MULTIFD_VERSION) {
>>>>> +    hdr->version = be32_to_cpu(hdr->version);
>>>>> +    if (hdr->version != MULTIFD_VERSION) {
>>>>>            error_setg(errp, "multifd: received packet "
>>>>>                       "version %u and expected version %u",
>>>>> -                   packet->version, MULTIFD_VERSION);
>>>>> +                   hdr->version, MULTIFD_VERSION);
>>>>>            return -1;
>>>>>        }
>>>>>
>>>>> -    p->flags = be32_to_cpu(packet->flags);
>>>>> +    p->flags = be32_to_cpu(hdr->flags);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
>>>>> +                                                   Error **errp)
>>>>> +{
>>>>> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
>>>>> +
>>>>> +    packet->instance_id = be32_to_cpu(packet->instance_id);
>>>>> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p,
>>>>> Error **errp)
>>>>> +{
>>>>> +    MultiFDPacket_t *packet = p->packet;
>>>>> +    int ret = 0;
>>>>> +
>>>>>        p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>>>>        p->packet_num = be64_to_cpu(packet->packet_num);
>>>>> -    p->packets_recved++;
>>>>>
>>>>>        if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>>>>>            ret = multifd_ram_unfill_packet(p, errp);
>>>>> @@ -264,6 +283,19 @@ static int
>>>>> multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>>>>>        return ret;
>>>>>    }
>>>>>
>>>>> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error
>>>>> **errp)
>>>>> +{
>>>>> +    p->packets_recved++;
>>>>> +
>>>>> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>>>>> +        return multifd_recv_unfill_packet_device_state(p, errp);
>>>>> +    } else {
>>>>> +        return multifd_recv_unfill_packet_ram(p, errp);
>>>>> +    }
>>>>> +
>>>>> +    g_assert_not_reached();
>>>>
>>>> We can drop the assert and the "else":
>>>> if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>>>>       return multifd_recv_unfill_packet_device_state(p, errp);
>>>> }
>>>>
>>>> return multifd_recv_unfill_packet_ram(p, errp);
>>>
>>> Ack.
>>>
>>>>> +}
>>>>> +
>>>>>    static bool multifd_send_should_exit(void)
>>>>>    {
>>>>>        return qatomic_read(&multifd_send_state->exiting);
>>>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>>>> index a3e35196d179..a8f3e4838c01 100644
>>>>> --- a/migration/multifd.h
>>>>> +++ b/migration/multifd.h
>>>>> @@ -45,6 +45,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>>>>    #define MULTIFD_FLAG_QPL (4 << 1)
>>>>>    #define MULTIFD_FLAG_UADK (8 << 1)
>>>>>
>>>>> +/*
>>>>> + * If set it means that this packet contains device state
>>>>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>>>>> + */
>>>>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 4)
>>>>> +
>>>>>    /* This value needs to be a multiple of qemu_target_page_size() */
>>>>>    #define MULTIFD_PACKET_SIZE (512 * 1024)
>>>>>
>>>>> @@ -52,6 +58,11 @@ typedef struct {
>>>>>        uint32_t magic;
>>>>>        uint32_t version;
>>>>>        uint32_t flags;
>>>>> +} __attribute__((packed)) MultiFDPacketHdr_t;
>>>>
>>>> Maybe split this patch into two: one that adds the packet header
>>>> concept and another that adds the new device packet?
>>>
>>> Can do.
>>>
>>>>> +
>>>>> +typedef struct {
>>>>> +    MultiFDPacketHdr_t hdr;
>>>>> +
>>>>>        /* maximum number of allocated pages */
>>>>>        uint32_t pages_alloc;
>>>>>        /* non zero pages */
>>>>> @@ -72,6 +83,16 @@ typedef struct {
>>>>>        uint64_t offset[];
>>>>>    } __attribute__((packed)) MultiFDPacket_t;
>>>>>
>>>>> +typedef struct {
>>>>> +    MultiFDPacketHdr_t hdr;
>>>>> +
>>>>> +    char idstr[256] QEMU_NONSTRING;
>>>>
>>>> idstr should be null terminated, or am I missing something?
>>>
>>> There's no need to always NULL-terminate a constant-size field,
>>> since the strncpy() already stops at the field size, so we can
>>> gain another byte for actual string use this way.
>>>
>>> RAM block idstr also uses the same "trick":
>>>> void multifd_ram_fill_packet(MultiFDSendParams *p):
>>>> strncpy(packet->ramblock, pages->block->idstr, 256);
>>>
>> But can idstr actually be 256 bytes long without null byte?
>> There are a lot of places where idstr is a parameter for functions that
>> expect null terminated string and it is also printed as such.
> 
> Yeah, and I actually don't see the "trick" being used in
> RAMBlock. Anyway, it's best to null terminate to be more predictable. We
> also had Coverity reports about similar things:
> 
> https://lore.kernel.org/r/CAFEAcA_F2qrSAacY=V5Hez1qFGuNW0-XqL2LQ=Y_UKYuHEJWhw@mail.gmail.com

That's because MultiFDPacket_t::ramblock is missing QEMU_NONSTRING
annotation (like the proposed MultiFDPacketDeviceState_t::idstr has)
so Coverity assumes it was meant to be a NULL-terminated C-string.

> I haven't got the time to send that patch yet.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-19 19:47           ` Maciej S. Szmigiero
@ 2024-09-19 20:54             ` Peter Xu
  2024-09-20 15:22               ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-19 20:54 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On Thu, Sep 19, 2024 at 09:47:53PM +0200, Maciej S. Szmigiero wrote:
> On 9.09.2024 21:08, Peter Xu wrote:
> > On Mon, Sep 09, 2024 at 08:32:45PM +0200, Maciej S. Szmigiero wrote:
> > > On 9.09.2024 19:59, Peter Xu wrote:
> > > > On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
> > > > > 
> > > > > On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> > > > > > External email: Use caution opening links or attachments
> > > > > > 
> > > > > > 
> > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > 
> > > > > > These SaveVMHandlers help device provide its own asynchronous
> > > > > > transmission of the remaining data at the end of a precopy phase.
> > > > > > 
> > > > > > In this use case the save_live_complete_precopy_begin handler might
> > > > > > be used to mark the stream boundary before proceeding with asynchronous
> > > > > > transmission of the remaining data while the
> > > > > > save_live_complete_precopy_end handler might be used to mark the
> > > > > > stream boundary after performing the asynchronous transmission.
> > > > > > 
> > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > ---
> > > > > >     include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
> > > > > >     migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
> > > > > >     2 files changed, 71 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > > index f60e797894e5..9de123252edf 100644
> > > > > > --- a/include/migration/register.h
> > > > > > +++ b/include/migration/register.h
> > > > > > @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
> > > > > >          */
> > > > > >         int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
> > > > > > 
> > > > > > +    /**
> > > > > > +     * @save_live_complete_precopy_begin
> > > > > > +     *
> > > > > > +     * Called at the end of a precopy phase, before all
> > > > > > +     * @save_live_complete_precopy handlers and before launching
> > > > > > +     * all @save_live_complete_precopy_thread threads.
> > > > > > +     * The handler might, for example, mark the stream boundary before
> > > > > > +     * proceeding with asynchronous transmission of the remaining data via
> > > > > > +     * @save_live_complete_precopy_thread.
> > > > > > +     * When postcopy is enabled, devices that support postcopy will skip this step.
> > > > > > +     *
> > > > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > > > +     * @idstr: this device section idstr
> > > > > > +     * @instance_id: this device section instance_id
> > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > +     *
> > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > +     */
> > > > > > +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
> > > > > > +                                            char *idstr, uint32_t instance_id,
> > > > > > +                                            void *opaque);
> > > > > > +    /**
> > > > > > +     * @save_live_complete_precopy_end
> > > > > > +     *
> > > > > > +     * Called at the end of a precopy phase, after @save_live_complete_precopy
> > > > > > +     * handlers and after all @save_live_complete_precopy_thread threads have
> > > > > > +     * finished. When postcopy is enabled, devices that support postcopy will
> > > > > > +     * skip this step.
> > > > > > +     *
> > > > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > +     *
> > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > +     */
> > > > > > +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
> > > > > 
> > > > > Is this handler necessary now that migration core is responsible for the
> > > > > threads and joins them? I don't see VFIO implementing it later on.
> > > > 
> > > > Right, I spot the same thing.
> > > > 
> > > > This series added three hooks: begin, end, precopy_thread.
> > > > 
> > > > What I think is it only needs one, which is precopy_async.  My vague memory
> > > > was that was what we used to discuss too, so that when migration precopy
> > > > flushes the final round of iterable data, it does:
> > > > 
> > > >     (1) loop over all complete_precopy_async() and enqueue the tasks if
> > > >         existed into the migration worker pool.  Then,
> > > > 
> > > >     (2) loop over all complete_precopy() like before.
> > > > 
> > > > Optionally, we can enforce one vmstate handler only provides either
> > > > complete_precopy_async() or complete_precopy().  In this case VFIO can
> > > > update the two hooks during setup() by detecting multifd && !mapped_ram &&
> > > > nocomp.
> > > > 
> > > 
> > > The "_begin" hook is still necessary to mark the end of the device state
> > > sent via the main migration stream (during the phase VM is still running)
> > > since we can't start loading the multifd sent device state until all of
> > > that earlier data finishes loading first.
> > 
> > Ah I remembered some more now, thanks.
> > 
> > If vfio can send data during iterations this new hook will also not be
> > needed, right?
> > 
> > I remember you mentioned you'd have a look and see the challenges there, is
> > there any conclusion yet on whether we can use multifd even during that?
> 
> Yeah, I looked at that and it wasn't a straightforward thing to introduce.
> 
> I am worried that with all the things that already piled up (including the
> new thread pool implementation) we risk missing QEMU 9.2 too if this is
> included.

Not explicitly required, but IMHO it'll be nice to provide a paragraph in
the new version when repost explaining the challenges of using it during
iterations.  It'll be not only for me but for whoever may want to extend
your solution to iterations.

I asked this question again mostly because I found that when with iteration
support the design looks simpler in begin(), so that the extra sync is not
needed.  But I confess you know better than me, so whatever you think best
is ok here.

> 
> > It's also a pity that we introduce this hook only because we want a
> > boundary between "iterable stage" and "final stage".  IIUC if we have any
> > kind of message telling dest before hand that "we're going to the last
> > stage" then this hook can be avoided.  Now it's at least inefficient
> > because we need to trigger begin() per-device, even if I think it's more of
> > a global request saying that "we need to load all main stream data first
> > before moving on".
> 
> It should be pretty easy to remove that begin() hook once it is no longer
> needed - after all, it's only necessary for the sender.

Do you mean you have plan to remove the begin() hook even without making
interate() work too?  That's definitely nice if so.

> 
> > > 
> > > We shouldn't send that boundary marker in .save_live_complete_precopy
> > > either since it would meant unnecessary waiting for other devices
> > > (not necessary VFIO ones) .save_live_complete_precopy bulk data.
> > > 
> > > And VFIO SaveVMHandlers are shared for all VFIO devices (and const) so
> > > we can't really change them at runtime.
> > 
> > In all cases, please consider dropping end() if it's never used; IMO it's
> > fine if there is only begin(), and we shouldn't keep hooks that are never
> > used.
> 
> Okay, will remove the end() hook then.
> 
> > Thanks,
> > 
> 
> Thanks,
> Maciej
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-19 19:49     ` Maciej S. Szmigiero
@ 2024-09-19 21:11       ` Peter Xu
  2024-09-20 15:23         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-19 21:11 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
> On 9.09.2024 22:03, Peter Xu wrote:
> > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > load_finish SaveVMHandler allows migration code to poll whether
> > > a device-specific asynchronous device state loading operation had finished.
> > > 
> > > In order to avoid calling this handler needlessly the device is supposed
> > > to notify the migration code of its possible readiness via a call to
> > > qemu_loadvm_load_finish_ready_broadcast() while holding
> > > qemu_loadvm_load_finish_ready_lock.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > ---
> > >   include/migration/register.h | 21 +++++++++++++++
> > >   migration/migration.c        |  6 +++++
> > >   migration/migration.h        |  3 +++
> > >   migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
> > >   migration/savevm.h           |  4 +++
> > >   5 files changed, 86 insertions(+)
> > > 
> > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > index 4a578f140713..44d8cf5192ae 100644
> > > --- a/include/migration/register.h
> > > +++ b/include/migration/register.h
> > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
> > >       int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
> > >                                Error **errp);
> > > +    /**
> > > +     * @load_finish
> > > +     *
> > > +     * Poll whether all asynchronous device state loading had finished.
> > > +     * Not called on the load failure path.
> > > +     *
> > > +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > +     *
> > > +     * If this method signals "not ready" then it might not be called
> > > +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> > > +     * while holding qemu_loadvm_load_finish_ready_lock.
> > 
> > [1]
> > 
> > > +     *
> > > +     * @opaque: data pointer passed to register_savevm_live()
> > > +     * @is_finished: whether the loading had finished (output parameter)
> > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > +     *
> > > +     * Returns zero to indicate success and negative for error
> > > +     * It's not an error that the loading still hasn't finished.
> > > +     */
> > > +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > 
> > The load_finish() semantics is a bit weird, especially above [1] on "only
> > allowed to be called once if ..." and also on the locks.
> 
> The point of this remark is that a driver needs to call
> qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
> core to call its load_finish handler again.
> 
> > It looks to me vfio_load_finish() also does the final load of the device.
> > 
> > I wonder whether that final load can be done in the threads,
> 
> Here, the problem is that current VFIO VMState has to be loaded from the main
> migration thread as it internally calls QEMU core address space modification
> methods which explode if called from another thread(s).

Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
BQL if possible, when that's ready then in your case here IIUC you can
simply take BQL in whichever thread that loads it.. but yeah it's not ready
at least..

Would it be possible vfio_save_complete_precopy_async_thread_config_state()
be done in VFIO's save_live_complete_precopy() through the main channel
somehow?  IOW, does it rely on iterative data to be fetched first from
kernel, or completely separate states?  And just curious: how large is it
normally (and I suppose this decides whether it's applicable to be sent via
the main channel at all..)?

> 
> > then after
> > everything loaded the device post a semaphore telling the main thread to
> > continue.  See e.g.:
> > 
> >      if (migrate_switchover_ack()) {
> >          qemu_loadvm_state_switchover_ack_needed(mis);
> >      }
> > 
> > IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
> > when all things are loaded?  We can then get rid of this slightly awkward
> > interface.  I had a feeling that things can be simplified (e.g., if the
> > thread will take care of loading the final vmstate then the mutex is also
> > not needed? etc.).
> 
> With just a single call to switchover_ack_needed per VFIO device it would
> need to do a blocking wait for the device buffers and config state load
> to finish, therefore blocking other VFIO devices from potentially loading
> their config state if they are ready to begin this operation earlier.

I am not sure I get you here, loading VFIO device states (I mean, the
non-iterable part) will need to be done sequentially IIUC due to what you
said and should rely on BQL, so I don't know how that could happen
concurrently for now.  But I think indeed BQL is a problem.

So IMHO this recv side interface so far is the major pain that I really
want to avoid (comparing to the rest) in the series.  Let's see whether we
can come up with something better..

One other (probably not pretty..) idea is when waiting here in the main
thread it yields BQL, then other threads can take it and load the VFIO
final chunk of data.  But I could miss something else.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-19 19:49       ` Maciej S. Szmigiero
@ 2024-09-19 21:17         ` Peter Xu
  2024-09-20 15:23           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-19 21:17 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Sep 19, 2024 at 09:49:43PM +0200, Maciej S. Szmigiero wrote:
> On 10.09.2024 21:48, Peter Xu wrote:
> > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
> > > > +size_t multifd_device_state_payload_size(void)
> > > > +{
> > > > +    return sizeof(MultiFDDeviceState_t);
> > > > +}
> > > 
> > > This will not be necessary because the payload size is the same as the
> > > data type. We only need it for the special case where the MultiFDPages_t
> > > is smaller than the total ram payload size.
> > 
> > Today I was thinking maybe we should really clean this up, as the current
> > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
> > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
> > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
> > that feeling stronger.
> > 
> > I think we should change it now perhaps, otherwise we'll need to introduce
> > other helpers to e.g. reset the device buffers, and that's not only slow
> > but also not good looking, IMO.
> > 
> > So I went ahead with the idea in previous discussion, that I managed to
> > change the SendData union into struct; the memory consumption is not super
> > important yet, IMHO, but we should still stick with the object model where
> > multifd enqueue thread switch buffer with multifd, as it still sounds a
> > sane way to do.
> > 
> > Then when that patch is ready, I further tried to make VFIO reuse multifd
> > buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
> > don't allocate it every time we enqueue.
> > 
> > I hope it'll also work for VFIO.  VFIO has a specialty on being able to
> > dump the config space so it's more complex (and I noticed Maciej's current
> > design requires the final chunk of VFIO config data be migrated in one
> > packet.. that is also part of the complexity there).  So I allowed that
> > part to allocate a buffer but only that.  IOW, I made some API (see below)
> > that can either reuse preallocated buffer, or use a separate one only for
> > the final bulk.
> > 
> > In short, could both of you have a look at what I came up with below?  I
> > did that in patches because I think it's too much to comment, so patches
> > may work better.  No concern if any of below could be good changes to you,
> > then either Maciej can squash whatever into existing patches (and I feel
> > like some existing patches in this series can go away with below design),
> > or I can post pre-requisite patch but only if any of you prefer that.
> > 
> > Anyway, let me know, the patches apply on top of this whole series applied
> > first.
> > 
> > I also wonder whether there can be any perf difference already (I tested
> > all multifd qtest with below, but no VFIO I can run), perhaps not that
> > much, but just to mention below should avoid both buffer allocations and
> > one round of copy (so VFIO read() directly writes to the multifd buffers
> > now).
> 
> I am not against making MultiFDSendData a struct and maybe introducing
> some pre-allocated buffer.
> 
> But to be honest, that manual memory management with having to remember
> to call multifd_device_state_finish() on error paths as in your
> proposed patch 3 really invites memory leaks.
> 
> Will think about some other way to have a reusable buffer.

Sure.  That's patch 3, and I suppose then it looks like patch 1 is still
OK in one way or another.

> 
> In terms of not making idstr copy (your proposed patch 2) I am not
> 100% sure that avoiding such tiny allocation really justifies the risk
> of possible use-after-free of a dangling pointer.

Why there's risk?  Someone strdup() on the stack?  That only goes via VFIO
itself, so I thought it wasn't that complicated.  But yeah as I said this
part (patch 2) is optional.

> Not 100% against it either if you are confident that it will never happen.
> 
> By the way, I guess it makes sense to carry these changes in the main patch
> set rather than as a separate changes?

Whatever you prefer.

I wrote those patches only because I thought maybe you'd like to run some
perf test to see whether they would help at all, and when the patches are
there it'll be much easier for you, then you can decide whether it's worth
intergrating already, or leave that for later.

If not I'd say they're even lower priority, so feel free to stick with
whatever easier for you.  I'm ok there.

However it'll be always good we can still have patch 1 as I mentioned
before (as part of your series, if you won't disagree), to make the
SendData interface slightly cleaner and easier to follow.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-19 19:49     ` Maciej S. Szmigiero
@ 2024-09-19 21:18       ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-19 21:18 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Sep 19, 2024 at 09:49:57PM +0200, Maciej S. Szmigiero wrote:
> On 10.09.2024 18:06, Peter Xu wrote:
> > On Tue, Aug 27, 2024 at 07:54:31PM +0200, Maciej S. Szmigiero wrote:
> > > +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> > > +                                char *data, size_t len)
> > > +{
> > > +    /* Device state submissions can come from multiple threads */
> > > +    QEMU_LOCK_GUARD(&queue_job_mutex);
> > 
> > Ah, just notice there's the mutex.
> > 
> > So please consider the reply in the other thread, IIUC we can make it for
> > multifd_send() to be a generic mutex to simplify the other patch too, then
> > drop here.
> > 
> > I assume the ram code should be fine taking one more mutex even without
> > vfio, if it only takes once for each ~128 pages to enqueue, and only take
> > in the main thread, then each update should be also in the hot path
> > (e.g. no cache bouncing).
> > 
> 
> Will check whether it is possible to use a common mutex here for both RAM
> and device state submission without drop in performance.

Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-19 20:54             ` Peter Xu
@ 2024-09-20 15:22               ` Maciej S. Szmigiero
  2024-09-20 16:08                 ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-20 15:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Avihai Horon, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On 19.09.2024 22:54, Peter Xu wrote:
> On Thu, Sep 19, 2024 at 09:47:53PM +0200, Maciej S. Szmigiero wrote:
>> On 9.09.2024 21:08, Peter Xu wrote:
>>> On Mon, Sep 09, 2024 at 08:32:45PM +0200, Maciej S. Szmigiero wrote:
>>>> On 9.09.2024 19:59, Peter Xu wrote:
>>>>> On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
>>>>>>
>>>>>> On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
>>>>>>> External email: Use caution opening links or attachments
>>>>>>>
>>>>>>>
>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>
>>>>>>> These SaveVMHandlers help device provide its own asynchronous
>>>>>>> transmission of the remaining data at the end of a precopy phase.
>>>>>>>
>>>>>>> In this use case the save_live_complete_precopy_begin handler might
>>>>>>> be used to mark the stream boundary before proceeding with asynchronous
>>>>>>> transmission of the remaining data while the
>>>>>>> save_live_complete_precopy_end handler might be used to mark the
>>>>>>> stream boundary after performing the asynchronous transmission.
>>>>>>>
>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>> ---
>>>>>>>      include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
>>>>>>>      migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
>>>>>>>      2 files changed, 71 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>>>> index f60e797894e5..9de123252edf 100644
>>>>>>> --- a/include/migration/register.h
>>>>>>> +++ b/include/migration/register.h
>>>>>>> @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
>>>>>>>           */
>>>>>>>          int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>>>>>>>
>>>>>>> +    /**
>>>>>>> +     * @save_live_complete_precopy_begin
>>>>>>> +     *
>>>>>>> +     * Called at the end of a precopy phase, before all
>>>>>>> +     * @save_live_complete_precopy handlers and before launching
>>>>>>> +     * all @save_live_complete_precopy_thread threads.
>>>>>>> +     * The handler might, for example, mark the stream boundary before
>>>>>>> +     * proceeding with asynchronous transmission of the remaining data via
>>>>>>> +     * @save_live_complete_precopy_thread.
>>>>>>> +     * When postcopy is enabled, devices that support postcopy will skip this step.
>>>>>>> +     *
>>>>>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>>>>>> +     * @idstr: this device section idstr
>>>>>>> +     * @instance_id: this device section instance_id
>>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>>> +     *
>>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>>> +     */
>>>>>>> +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
>>>>>>> +                                            char *idstr, uint32_t instance_id,
>>>>>>> +                                            void *opaque);
>>>>>>> +    /**
>>>>>>> +     * @save_live_complete_precopy_end
>>>>>>> +     *
>>>>>>> +     * Called at the end of a precopy phase, after @save_live_complete_precopy
>>>>>>> +     * handlers and after all @save_live_complete_precopy_thread threads have
>>>>>>> +     * finished. When postcopy is enabled, devices that support postcopy will
>>>>>>> +     * skip this step.
>>>>>>> +     *
>>>>>>> +     * @f: QEMUFile where the handler can synchronously send data before returning
>>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>>> +     *
>>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>>> +     */
>>>>>>> +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
>>>>>>
>>>>>> Is this handler necessary now that migration core is responsible for the
>>>>>> threads and joins them? I don't see VFIO implementing it later on.
>>>>>
>>>>> Right, I spot the same thing.
>>>>>
>>>>> This series added three hooks: begin, end, precopy_thread.
>>>>>
>>>>> What I think is it only needs one, which is precopy_async.  My vague memory
>>>>> was that was what we used to discuss too, so that when migration precopy
>>>>> flushes the final round of iterable data, it does:
>>>>>
>>>>>      (1) loop over all complete_precopy_async() and enqueue the tasks if
>>>>>          existed into the migration worker pool.  Then,
>>>>>
>>>>>      (2) loop over all complete_precopy() like before.
>>>>>
>>>>> Optionally, we can enforce one vmstate handler only provides either
>>>>> complete_precopy_async() or complete_precopy().  In this case VFIO can
>>>>> update the two hooks during setup() by detecting multifd && !mapped_ram &&
>>>>> nocomp.
>>>>>
>>>>
>>>> The "_begin" hook is still necessary to mark the end of the device state
>>>> sent via the main migration stream (during the phase VM is still running)
>>>> since we can't start loading the multifd sent device state until all of
>>>> that earlier data finishes loading first.
>>>
>>> Ah I remembered some more now, thanks.
>>>
>>> If vfio can send data during iterations this new hook will also not be
>>> needed, right?
>>>
>>> I remember you mentioned you'd have a look and see the challenges there, is
>>> there any conclusion yet on whether we can use multifd even during that?
>>
>> Yeah, I looked at that and it wasn't a straightforward thing to introduce.
>>
>> I am worried that with all the things that already piled up (including the
>> new thread pool implementation) we risk missing QEMU 9.2 too if this is
>> included.
> 
> Not explicitly required, but IMHO it'll be nice to provide a paragraph in
> the new version when repost explaining the challenges of using it during
> iterations.  It'll be not only for me but for whoever may want to extend
> your solution to iterations.

Will do.

> I asked this question again mostly because I found that when with iteration
> support the design looks simpler in begin(), so that the extra sync is not
> needed.  But I confess you know better than me, so whatever you think best
> is ok here.

If we do the MIG_CMD_SWITCHOVER / QEMU_VM_COMMAND thing common for all
devices then we don't need begin() even without live-phase multifd
device state transfer.

>>
>>> It's also a pity that we introduce this hook only because we want a
>>> boundary between "iterable stage" and "final stage".  IIUC if we have any
>>> kind of message telling dest before hand that "we're going to the last
>>> stage" then this hook can be avoided.  Now it's at least inefficient
>>> because we need to trigger begin() per-device, even if I think it's more of
>>> a global request saying that "we need to load all main stream data first
>>> before moving on".
>>
>> It should be pretty easy to remove that begin() hook once it is no longer
>> needed - after all, it's only necessary for the sender.
> 
> Do you mean you have plan to remove the begin() hook even without making
> interate() work too?  That's definitely nice if so.

As I wrote above, I think with MIG_CMD_SWITCHOVER it shouldn't be needed?

Thanks,
Maciej




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-19 21:11       ` Peter Xu
@ 2024-09-20 15:23         ` Maciej S. Szmigiero
  2024-09-20 16:45           ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-20 15:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 19.09.2024 23:11, Peter Xu wrote:
> On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
>> On 9.09.2024 22:03, Peter Xu wrote:
>>> On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> load_finish SaveVMHandler allows migration code to poll whether
>>>> a device-specific asynchronous device state loading operation had finished.
>>>>
>>>> In order to avoid calling this handler needlessly the device is supposed
>>>> to notify the migration code of its possible readiness via a call to
>>>> qemu_loadvm_load_finish_ready_broadcast() while holding
>>>> qemu_loadvm_load_finish_ready_lock.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>    include/migration/register.h | 21 +++++++++++++++
>>>>    migration/migration.c        |  6 +++++
>>>>    migration/migration.h        |  3 +++
>>>>    migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>>>>    migration/savevm.h           |  4 +++
>>>>    5 files changed, 86 insertions(+)
>>>>
>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>> index 4a578f140713..44d8cf5192ae 100644
>>>> --- a/include/migration/register.h
>>>> +++ b/include/migration/register.h
>>>> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>>>>        int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>>>>                                 Error **errp);
>>>> +    /**
>>>> +     * @load_finish
>>>> +     *
>>>> +     * Poll whether all asynchronous device state loading had finished.
>>>> +     * Not called on the load failure path.
>>>> +     *
>>>> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
>>>> +     *
>>>> +     * If this method signals "not ready" then it might not be called
>>>> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>>>> +     * while holding qemu_loadvm_load_finish_ready_lock.
>>>
>>> [1]
>>>
>>>> +     *
>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>> +     * @is_finished: whether the loading had finished (output parameter)
>>>> +     * @errp: pointer to Error*, to store an error if it happens.
>>>> +     *
>>>> +     * Returns zero to indicate success and negative for error
>>>> +     * It's not an error that the loading still hasn't finished.
>>>> +     */
>>>> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
>>>
>>> The load_finish() semantics is a bit weird, especially above [1] on "only
>>> allowed to be called once if ..." and also on the locks.
>>
>> The point of this remark is that a driver needs to call
>> qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
>> core to call its load_finish handler again.
>>
>>> It looks to me vfio_load_finish() also does the final load of the device.
>>>
>>> I wonder whether that final load can be done in the threads,
>>
>> Here, the problem is that current VFIO VMState has to be loaded from the main
>> migration thread as it internally calls QEMU core address space modification
>> methods which explode if called from another thread(s).
> 
> Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
> BQL if possible, when that's ready then in your case here IIUC you can
> simply take BQL in whichever thread that loads it.. but yeah it's not ready
> at least..

Yeah, long term we might want to work on making these QEMU core address space
modification methods somehow callable from multiple threads but that's
definitely not something for the initial patch set.

> Would it be possible vfio_save_complete_precopy_async_thread_config_state()
> be done in VFIO's save_live_complete_precopy() through the main channel
> somehow?  IOW, does it rely on iterative data to be fetched first from
> kernel, or completely separate states? 

The device state data needs to be fully loaded first before "activating"
the device by loading its config state.

> And just curious: how large is it
> normally (and I suppose this decides whether it's applicable to be sent via
> the main channel at all..)?

Config data is *much* smaller than device state data - as far as I remember
it was on order of kilobytes.

>>
>>> then after
>>> everything loaded the device post a semaphore telling the main thread to
>>> continue.  See e.g.:
>>>
>>>       if (migrate_switchover_ack()) {
>>>           qemu_loadvm_state_switchover_ack_needed(mis);
>>>       }
>>>
>>> IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
>>> when all things are loaded?  We can then get rid of this slightly awkward
>>> interface.  I had a feeling that things can be simplified (e.g., if the
>>> thread will take care of loading the final vmstate then the mutex is also
>>> not needed? etc.).
>>
>> With just a single call to switchover_ack_needed per VFIO device it would
>> need to do a blocking wait for the device buffers and config state load
>> to finish, therefore blocking other VFIO devices from potentially loading
>> their config state if they are ready to begin this operation earlier.
> 
> I am not sure I get you here, loading VFIO device states (I mean, the
> non-iterable part) will need to be done sequentially IIUC due to what you
> said and should rely on BQL, so I don't know how that could happen
> concurrently for now.  But I think indeed BQL is a problem.
Consider that we have two VFIO devices (A and B), with the following order
of switchover_ack_needed handler calls for them: first A get this call,
once the call for A finishes then B gets this call.

Now consider what happens if B had loaded all its buffers (in the loading
thread) and it is ready for its config load before A finished loading its
buffers.

B has to wait idle in this situation (even though it could have been already
loading its config) since the switchover_ack_needed handler for A won't
return until A is fully done.

> So IMHO this recv side interface so far is the major pain that I really
> want to avoid (comparing to the rest) in the series.  Let's see whether we
> can come up with something better..
> 
> One other (probably not pretty..) idea is when waiting here in the main
> thread it yields BQL, then other threads can take it and load the VFIO
> final chunk of data.  But I could miss something else.
> 

I think temporary dropping BQL deep inside migration code is similar
to running QEMU event loop deep inside migration code (about which
people complained in my generic thread pool implementation): it's easy
to miss some subtle dependency/race somewhere and accidentally cause rare
hard to debug deadlock.

That's why I think that it's ultimately probably better to make QEMU core
address space modification methods thread safe / re-entrant instead.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-19 21:17         ` Peter Xu
@ 2024-09-20 15:23           ` Maciej S. Szmigiero
  2024-09-20 17:09             ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-20 15:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 19.09.2024 23:17, Peter Xu wrote:
> On Thu, Sep 19, 2024 at 09:49:43PM +0200, Maciej S. Szmigiero wrote:
>> On 10.09.2024 21:48, Peter Xu wrote:
>>> On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
>>>>> +size_t multifd_device_state_payload_size(void)
>>>>> +{
>>>>> +    return sizeof(MultiFDDeviceState_t);
>>>>> +}
>>>>
>>>> This will not be necessary because the payload size is the same as the
>>>> data type. We only need it for the special case where the MultiFDPages_t
>>>> is smaller than the total ram payload size.
>>>
>>> Today I was thinking maybe we should really clean this up, as the current
>>> multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
>>> that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
>>> and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
>>> that feeling stronger.
>>>
>>> I think we should change it now perhaps, otherwise we'll need to introduce
>>> other helpers to e.g. reset the device buffers, and that's not only slow
>>> but also not good looking, IMO.
>>>
>>> So I went ahead with the idea in previous discussion, that I managed to
>>> change the SendData union into struct; the memory consumption is not super
>>> important yet, IMHO, but we should still stick with the object model where
>>> multifd enqueue thread switch buffer with multifd, as it still sounds a
>>> sane way to do.
>>>
>>> Then when that patch is ready, I further tried to make VFIO reuse multifd
>>> buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
>>> don't allocate it every time we enqueue.
>>>
>>> I hope it'll also work for VFIO.  VFIO has a specialty on being able to
>>> dump the config space so it's more complex (and I noticed Maciej's current
>>> design requires the final chunk of VFIO config data be migrated in one
>>> packet.. that is also part of the complexity there).  So I allowed that
>>> part to allocate a buffer but only that.  IOW, I made some API (see below)
>>> that can either reuse preallocated buffer, or use a separate one only for
>>> the final bulk.
>>>
>>> In short, could both of you have a look at what I came up with below?  I
>>> did that in patches because I think it's too much to comment, so patches
>>> may work better.  No concern if any of below could be good changes to you,
>>> then either Maciej can squash whatever into existing patches (and I feel
>>> like some existing patches in this series can go away with below design),
>>> or I can post pre-requisite patch but only if any of you prefer that.
>>>
>>> Anyway, let me know, the patches apply on top of this whole series applied
>>> first.
>>>
>>> I also wonder whether there can be any perf difference already (I tested
>>> all multifd qtest with below, but no VFIO I can run), perhaps not that
>>> much, but just to mention below should avoid both buffer allocations and
>>> one round of copy (so VFIO read() directly writes to the multifd buffers
>>> now).
>>
>> I am not against making MultiFDSendData a struct and maybe introducing
>> some pre-allocated buffer.
>>
>> But to be honest, that manual memory management with having to remember
>> to call multifd_device_state_finish() on error paths as in your
>> proposed patch 3 really invites memory leaks.
>>
>> Will think about some other way to have a reusable buffer.
> 
> Sure.  That's patch 3, and I suppose then it looks like patch 1 is still
> OK in one way or another.
> 
>>
>> In terms of not making idstr copy (your proposed patch 2) I am not
>> 100% sure that avoiding such tiny allocation really justifies the risk
>> of possible use-after-free of a dangling pointer.
> 
> Why there's risk?  Someone strdup() on the stack?  That only goes via VFIO
> itself, so I thought it wasn't that complicated.  But yeah as I said this
> part (patch 2) is optional.

I mean the risk here is somebody providing idstr that somehow gets free'd
or overwritten before the device state buffer gets sent.

With a static idstr that's obviously not an issue, but I see that, for example,
vmstate_register_with_alias_id() generates idstr dynamically and this API
is used by all qdevs that have a VMSD (in device_set_realized()).

>> Not 100% against it either if you are confident that it will never happen.
>>
>> By the way, I guess it makes sense to carry these changes in the main patch
>> set rather than as a separate changes?
> 
> Whatever you prefer.
> 
> I wrote those patches only because I thought maybe you'd like to run some
> perf test to see whether they would help at all, and when the patches are
> there it'll be much easier for you, then you can decide whether it's worth
> intergrating already, or leave that for later.
> 
> If not I'd say they're even lower priority, so feel free to stick with
> whatever easier for you.  I'm ok there.
> 
> However it'll be always good we can still have patch 1 as I mentioned
> before (as part of your series, if you won't disagree), to make the
> SendData interface slightly cleaner and easier to follow.
> 

Will try to include these patches in my patch set if they don't cause any
downtime regressions.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers
  2024-09-20 15:22               ` Maciej S. Szmigiero
@ 2024-09-20 16:08                 ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-20 16:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On Fri, Sep 20, 2024 at 05:22:54PM +0200, Maciej S. Szmigiero wrote:
> On 19.09.2024 22:54, Peter Xu wrote:
> > On Thu, Sep 19, 2024 at 09:47:53PM +0200, Maciej S. Szmigiero wrote:
> > > On 9.09.2024 21:08, Peter Xu wrote:
> > > > On Mon, Sep 09, 2024 at 08:32:45PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 9.09.2024 19:59, Peter Xu wrote:
> > > > > > On Thu, Sep 05, 2024 at 04:45:48PM +0300, Avihai Horon wrote:
> > > > > > > 
> > > > > > > On 27/08/2024 20:54, Maciej S. Szmigiero wrote:
> > > > > > > > External email: Use caution opening links or attachments
> > > > > > > > 
> > > > > > > > 
> > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > 
> > > > > > > > These SaveVMHandlers help device provide its own asynchronous
> > > > > > > > transmission of the remaining data at the end of a precopy phase.
> > > > > > > > 
> > > > > > > > In this use case the save_live_complete_precopy_begin handler might
> > > > > > > > be used to mark the stream boundary before proceeding with asynchronous
> > > > > > > > transmission of the remaining data while the
> > > > > > > > save_live_complete_precopy_end handler might be used to mark the
> > > > > > > > stream boundary after performing the asynchronous transmission.
> > > > > > > > 
> > > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > > > ---
> > > > > > > >      include/migration/register.h | 36 ++++++++++++++++++++++++++++++++++++
> > > > > > > >      migration/savevm.c           | 35 +++++++++++++++++++++++++++++++++++
> > > > > > > >      2 files changed, 71 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > > > > index f60e797894e5..9de123252edf 100644
> > > > > > > > --- a/include/migration/register.h
> > > > > > > > +++ b/include/migration/register.h
> > > > > > > > @@ -103,6 +103,42 @@ typedef struct SaveVMHandlers {
> > > > > > > >           */
> > > > > > > >          int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
> > > > > > > > 
> > > > > > > > +    /**
> > > > > > > > +     * @save_live_complete_precopy_begin
> > > > > > > > +     *
> > > > > > > > +     * Called at the end of a precopy phase, before all
> > > > > > > > +     * @save_live_complete_precopy handlers and before launching
> > > > > > > > +     * all @save_live_complete_precopy_thread threads.
> > > > > > > > +     * The handler might, for example, mark the stream boundary before
> > > > > > > > +     * proceeding with asynchronous transmission of the remaining data via
> > > > > > > > +     * @save_live_complete_precopy_thread.
> > > > > > > > +     * When postcopy is enabled, devices that support postcopy will skip this step.
> > > > > > > > +     *
> > > > > > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > > > > > +     * @idstr: this device section idstr
> > > > > > > > +     * @instance_id: this device section instance_id
> > > > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > > > +     *
> > > > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > > > +     */
> > > > > > > > +    int (*save_live_complete_precopy_begin)(QEMUFile *f,
> > > > > > > > +                                            char *idstr, uint32_t instance_id,
> > > > > > > > +                                            void *opaque);
> > > > > > > > +    /**
> > > > > > > > +     * @save_live_complete_precopy_end
> > > > > > > > +     *
> > > > > > > > +     * Called at the end of a precopy phase, after @save_live_complete_precopy
> > > > > > > > +     * handlers and after all @save_live_complete_precopy_thread threads have
> > > > > > > > +     * finished. When postcopy is enabled, devices that support postcopy will
> > > > > > > > +     * skip this step.
> > > > > > > > +     *
> > > > > > > > +     * @f: QEMUFile where the handler can synchronously send data before returning
> > > > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > > > +     *
> > > > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > > > +     */
> > > > > > > > +    int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
> > > > > > > 
> > > > > > > Is this handler necessary now that migration core is responsible for the
> > > > > > > threads and joins them? I don't see VFIO implementing it later on.
> > > > > > 
> > > > > > Right, I spot the same thing.
> > > > > > 
> > > > > > This series added three hooks: begin, end, precopy_thread.
> > > > > > 
> > > > > > What I think is it only needs one, which is precopy_async.  My vague memory
> > > > > > was that was what we used to discuss too, so that when migration precopy
> > > > > > flushes the final round of iterable data, it does:
> > > > > > 
> > > > > >      (1) loop over all complete_precopy_async() and enqueue the tasks if
> > > > > >          existed into the migration worker pool.  Then,
> > > > > > 
> > > > > >      (2) loop over all complete_precopy() like before.
> > > > > > 
> > > > > > Optionally, we can enforce one vmstate handler only provides either
> > > > > > complete_precopy_async() or complete_precopy().  In this case VFIO can
> > > > > > update the two hooks during setup() by detecting multifd && !mapped_ram &&
> > > > > > nocomp.
> > > > > > 
> > > > > 
> > > > > The "_begin" hook is still necessary to mark the end of the device state
> > > > > sent via the main migration stream (during the phase VM is still running)
> > > > > since we can't start loading the multifd sent device state until all of
> > > > > that earlier data finishes loading first.
> > > > 
> > > > Ah I remembered some more now, thanks.
> > > > 
> > > > If vfio can send data during iterations this new hook will also not be
> > > > needed, right?
> > > > 
> > > > I remember you mentioned you'd have a look and see the challenges there, is
> > > > there any conclusion yet on whether we can use multifd even during that?
> > > 
> > > Yeah, I looked at that and it wasn't a straightforward thing to introduce.
> > > 
> > > I am worried that with all the things that already piled up (including the
> > > new thread pool implementation) we risk missing QEMU 9.2 too if this is
> > > included.
> > 
> > Not explicitly required, but IMHO it'll be nice to provide a paragraph in
> > the new version when repost explaining the challenges of using it during
> > iterations.  It'll be not only for me but for whoever may want to extend
> > your solution to iterations.
> 
> Will do.
> 
> > I asked this question again mostly because I found that when with iteration
> > support the design looks simpler in begin(), so that the extra sync is not
> > needed.  But I confess you know better than me, so whatever you think best
> > is ok here.
> 
> If we do the MIG_CMD_SWITCHOVER / QEMU_VM_COMMAND thing common for all
> devices then we don't need begin() even without live-phase multifd
> device state transfer.
> 
> > > 
> > > > It's also a pity that we introduce this hook only because we want a
> > > > boundary between "iterable stage" and "final stage".  IIUC if we have any
> > > > kind of message telling dest before hand that "we're going to the last
> > > > stage" then this hook can be avoided.  Now it's at least inefficient
> > > > because we need to trigger begin() per-device, even if I think it's more of
> > > > a global request saying that "we need to load all main stream data first
> > > > before moving on".
> > > 
> > > It should be pretty easy to remove that begin() hook once it is no longer
> > > needed - after all, it's only necessary for the sender.
> > 
> > Do you mean you have plan to remove the begin() hook even without making
> > interate() work too?  That's definitely nice if so.
> 
> As I wrote above, I think with MIG_CMD_SWITCHOVER it shouldn't be needed?

Ah I see, yes if with that it's ok.

Just a heads-up - please remember to add one migration_properties[] entry
and a compat property for pre-9.1 so that we don't generate that message
when migrating to old binaries.

Meanwhile if we're going to add it, let's also make sure postcopy also has
it, as it shares the same SWITCHOVER idea.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-20 15:23         ` Maciej S. Szmigiero
@ 2024-09-20 16:45           ` Peter Xu
  2024-09-26 22:34             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-20 16:45 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
> On 19.09.2024 23:11, Peter Xu wrote:
> > On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
> > > On 9.09.2024 22:03, Peter Xu wrote:
> > > > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > load_finish SaveVMHandler allows migration code to poll whether
> > > > > a device-specific asynchronous device state loading operation had finished.
> > > > > 
> > > > > In order to avoid calling this handler needlessly the device is supposed
> > > > > to notify the migration code of its possible readiness via a call to
> > > > > qemu_loadvm_load_finish_ready_broadcast() while holding
> > > > > qemu_loadvm_load_finish_ready_lock.
> > > > > 
> > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > ---
> > > > >    include/migration/register.h | 21 +++++++++++++++
> > > > >    migration/migration.c        |  6 +++++
> > > > >    migration/migration.h        |  3 +++
> > > > >    migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
> > > > >    migration/savevm.h           |  4 +++
> > > > >    5 files changed, 86 insertions(+)
> > > > > 
> > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > index 4a578f140713..44d8cf5192ae 100644
> > > > > --- a/include/migration/register.h
> > > > > +++ b/include/migration/register.h
> > > > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
> > > > >        int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
> > > > >                                 Error **errp);
> > > > > +    /**
> > > > > +     * @load_finish
> > > > > +     *
> > > > > +     * Poll whether all asynchronous device state loading had finished.
> > > > > +     * Not called on the load failure path.
> > > > > +     *
> > > > > +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > > > +     *
> > > > > +     * If this method signals "not ready" then it might not be called
> > > > > +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> > > > > +     * while holding qemu_loadvm_load_finish_ready_lock.
> > > > 
> > > > [1]
> > > > 
> > > > > +     *
> > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > +     * @is_finished: whether the loading had finished (output parameter)
> > > > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > > > +     *
> > > > > +     * Returns zero to indicate success and negative for error
> > > > > +     * It's not an error that the loading still hasn't finished.
> > > > > +     */
> > > > > +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > > > 
> > > > The load_finish() semantics is a bit weird, especially above [1] on "only
> > > > allowed to be called once if ..." and also on the locks.
> > > 
> > > The point of this remark is that a driver needs to call
> > > qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
> > > core to call its load_finish handler again.
> > > 
> > > > It looks to me vfio_load_finish() also does the final load of the device.
> > > > 
> > > > I wonder whether that final load can be done in the threads,
> > > 
> > > Here, the problem is that current VFIO VMState has to be loaded from the main
> > > migration thread as it internally calls QEMU core address space modification
> > > methods which explode if called from another thread(s).
> > 
> > Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
> > BQL if possible, when that's ready then in your case here IIUC you can
> > simply take BQL in whichever thread that loads it.. but yeah it's not ready
> > at least..
> 
> Yeah, long term we might want to work on making these QEMU core address space
> modification methods somehow callable from multiple threads but that's
> definitely not something for the initial patch set.
> 
> > Would it be possible vfio_save_complete_precopy_async_thread_config_state()
> > be done in VFIO's save_live_complete_precopy() through the main channel
> > somehow?  IOW, does it rely on iterative data to be fetched first from
> > kernel, or completely separate states?
> 
> The device state data needs to be fully loaded first before "activating"
> the device by loading its config state.
> 
> > And just curious: how large is it
> > normally (and I suppose this decides whether it's applicable to be sent via
> > the main channel at all..)?
> 
> Config data is *much* smaller than device state data - as far as I remember
> it was on order of kilobytes.
> 
> > > 
> > > > then after
> > > > everything loaded the device post a semaphore telling the main thread to
> > > > continue.  See e.g.:
> > > > 
> > > >       if (migrate_switchover_ack()) {
> > > >           qemu_loadvm_state_switchover_ack_needed(mis);
> > > >       }
> > > > 
> > > > IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
> > > > when all things are loaded?  We can then get rid of this slightly awkward
> > > > interface.  I had a feeling that things can be simplified (e.g., if the
> > > > thread will take care of loading the final vmstate then the mutex is also
> > > > not needed? etc.).
> > > 
> > > With just a single call to switchover_ack_needed per VFIO device it would
> > > need to do a blocking wait for the device buffers and config state load
> > > to finish, therefore blocking other VFIO devices from potentially loading
> > > their config state if they are ready to begin this operation earlier.
> > 
> > I am not sure I get you here, loading VFIO device states (I mean, the
> > non-iterable part) will need to be done sequentially IIUC due to what you
> > said and should rely on BQL, so I don't know how that could happen
> > concurrently for now.  But I think indeed BQL is a problem.
> Consider that we have two VFIO devices (A and B), with the following order
> of switchover_ack_needed handler calls for them: first A get this call,
> once the call for A finishes then B gets this call.
> 
> Now consider what happens if B had loaded all its buffers (in the loading
> thread) and it is ready for its config load before A finished loading its
> buffers.
> 
> B has to wait idle in this situation (even though it could have been already
> loading its config) since the switchover_ack_needed handler for A won't
> return until A is fully done.

This sounds like a performance concern, and I wonder how much this impacts
the real workload (that you run a test and measure, with/without such
concurrency) when we can save two devices in parallel anyway; I would
expect the real diff is small due to the fact I mentioned that we save >1
VFIO devices concurrently via multifd.

Do you think we can start with a simpler approach?

So what I'm thinking could be very clean is, we just discussed about
MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
wonder when with it why not we move one step further to have
MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
ready to send non-iterable".  It can be controlled by the same migration
property so we only send these two flags in 9.2+ machine types.

Then IIUC VFIO can send config data through main wire (just like most of
other pci devices! which is IMHO a good fit..) and on destination VFIO
holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.

Side note: when looking again, I really think we should cleanup some
migration switchover phase functions, e.g. I think
qemu_savevm_state_complete_precopy() parameters are pretty confusing,
especially iterable_only, even if inside it it also have some postcopy
implicit checks, urgh.. but this is not relevant to our discussion, and I
won't draft that before your series land; that can complicate stuff.

> 
> > So IMHO this recv side interface so far is the major pain that I really
> > want to avoid (comparing to the rest) in the series.  Let's see whether we
> > can come up with something better..
> > 
> > One other (probably not pretty..) idea is when waiting here in the main
> > thread it yields BQL, then other threads can take it and load the VFIO
> > final chunk of data.  But I could miss something else.
> > 
> 
> I think temporary dropping BQL deep inside migration code is similar
> to running QEMU event loop deep inside migration code (about which
> people complained in my generic thread pool implementation): it's easy
> to miss some subtle dependency/race somewhere and accidentally cause rare
> hard to debug deadlock.
> 
> That's why I think that it's ultimately probably better to make QEMU core
> address space modification methods thread safe / re-entrant instead.

Right, let's see how you think about above.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 12/17] migration/multifd: Device state transfer support - send side
  2024-09-20 15:23           ` Maciej S. Szmigiero
@ 2024-09-20 17:09             ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-09-20 17:09 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Sep 20, 2024 at 05:23:20PM +0200, Maciej S. Szmigiero wrote:
> On 19.09.2024 23:17, Peter Xu wrote:
> > On Thu, Sep 19, 2024 at 09:49:43PM +0200, Maciej S. Szmigiero wrote:
> > > On 10.09.2024 21:48, Peter Xu wrote:
> > > > On Wed, Aug 28, 2024 at 09:41:17PM -0300, Fabiano Rosas wrote:
> > > > > > +size_t multifd_device_state_payload_size(void)
> > > > > > +{
> > > > > > +    return sizeof(MultiFDDeviceState_t);
> > > > > > +}
> > > > > 
> > > > > This will not be necessary because the payload size is the same as the
> > > > > data type. We only need it for the special case where the MultiFDPages_t
> > > > > is smaller than the total ram payload size.
> > > > 
> > > > Today I was thinking maybe we should really clean this up, as the current
> > > > multifd_send_data_alloc() is indeed too tricky (blame me.. who requested
> > > > that more or less).  Knowing that VFIO can use dynamic buffers with ->idstr
> > > > and ->buf (I was thinking it could be buf[1M].. but I was wrong...) made
> > > > that feeling stronger.
> > > > 
> > > > I think we should change it now perhaps, otherwise we'll need to introduce
> > > > other helpers to e.g. reset the device buffers, and that's not only slow
> > > > but also not good looking, IMO.
> > > > 
> > > > So I went ahead with the idea in previous discussion, that I managed to
> > > > change the SendData union into struct; the memory consumption is not super
> > > > important yet, IMHO, but we should still stick with the object model where
> > > > multifd enqueue thread switch buffer with multifd, as it still sounds a
> > > > sane way to do.
> > > > 
> > > > Then when that patch is ready, I further tried to make VFIO reuse multifd
> > > > buffers just like what we do with MultiFDPages_t->offset[]: in RAM code we
> > > > don't allocate it every time we enqueue.
> > > > 
> > > > I hope it'll also work for VFIO.  VFIO has a specialty on being able to
> > > > dump the config space so it's more complex (and I noticed Maciej's current
> > > > design requires the final chunk of VFIO config data be migrated in one
> > > > packet.. that is also part of the complexity there).  So I allowed that
> > > > part to allocate a buffer but only that.  IOW, I made some API (see below)
> > > > that can either reuse preallocated buffer, or use a separate one only for
> > > > the final bulk.
> > > > 
> > > > In short, could both of you have a look at what I came up with below?  I
> > > > did that in patches because I think it's too much to comment, so patches
> > > > may work better.  No concern if any of below could be good changes to you,
> > > > then either Maciej can squash whatever into existing patches (and I feel
> > > > like some existing patches in this series can go away with below design),
> > > > or I can post pre-requisite patch but only if any of you prefer that.
> > > > 
> > > > Anyway, let me know, the patches apply on top of this whole series applied
> > > > first.
> > > > 
> > > > I also wonder whether there can be any perf difference already (I tested
> > > > all multifd qtest with below, but no VFIO I can run), perhaps not that
> > > > much, but just to mention below should avoid both buffer allocations and
> > > > one round of copy (so VFIO read() directly writes to the multifd buffers
> > > > now).
> > > 
> > > I am not against making MultiFDSendData a struct and maybe introducing
> > > some pre-allocated buffer.
> > > 
> > > But to be honest, that manual memory management with having to remember
> > > to call multifd_device_state_finish() on error paths as in your
> > > proposed patch 3 really invites memory leaks.
> > > 
> > > Will think about some other way to have a reusable buffer.
> > 
> > Sure.  That's patch 3, and I suppose then it looks like patch 1 is still
> > OK in one way or another.
> > 
> > > 
> > > In terms of not making idstr copy (your proposed patch 2) I am not
> > > 100% sure that avoiding such tiny allocation really justifies the risk
> > > of possible use-after-free of a dangling pointer.
> > 
> > Why there's risk?  Someone strdup() on the stack?  That only goes via VFIO
> > itself, so I thought it wasn't that complicated.  But yeah as I said this
> > part (patch 2) is optional.
> 
> I mean the risk here is somebody providing idstr that somehow gets free'd
> or overwritten before the device state buffer gets sent.
> 
> With a static idstr that's obviously not an issue, but I see that, for example,
> vmstate_register_with_alias_id() generates idstr dynamically and this API
> is used by all qdevs that have a VMSD (in device_set_realized()).
> 
> > > Not 100% against it either if you are confident that it will never happen.
> > > 
> > > By the way, I guess it makes sense to carry these changes in the main patch
> > > set rather than as a separate changes?
> > 
> > Whatever you prefer.
> > 
> > I wrote those patches only because I thought maybe you'd like to run some
> > perf test to see whether they would help at all, and when the patches are
> > there it'll be much easier for you, then you can decide whether it's worth
> > intergrating already, or leave that for later.
> > 
> > If not I'd say they're even lower priority, so feel free to stick with
> > whatever easier for you.  I'm ok there.
> > 
> > However it'll be always good we can still have patch 1 as I mentioned
> > before (as part of your series, if you won't disagree), to make the
> > SendData interface slightly cleaner and easier to follow.
> > 
> 
> Will try to include these patches in my patch set if they don't cause any
> downtime regressions.

Thanks, note that it's not my request to have patch 2-3. :)  Please take
your own judgement.

Again, I'd run a round of perf (only if my patches can directly run through
without heavy debugging on top of yours), if nothing shows a benefit I'd go
with what you have right now, and we drop patch 2-3 - they're not justified
if they provide zero perf benefit.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-20 16:45           ` Peter Xu
@ 2024-09-26 22:34             ` Maciej S. Szmigiero
  2024-09-27  0:53               ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-26 22:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 20.09.2024 18:45, Peter Xu wrote:
> On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
>> On 19.09.2024 23:11, Peter Xu wrote:
>>> On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
>>>> On 9.09.2024 22:03, Peter Xu wrote:
>>>>> On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> load_finish SaveVMHandler allows migration code to poll whether
>>>>>> a device-specific asynchronous device state loading operation had finished.
>>>>>>
>>>>>> In order to avoid calling this handler needlessly the device is supposed
>>>>>> to notify the migration code of its possible readiness via a call to
>>>>>> qemu_loadvm_load_finish_ready_broadcast() while holding
>>>>>> qemu_loadvm_load_finish_ready_lock.
>>>>>>
>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>> ---
>>>>>>     include/migration/register.h | 21 +++++++++++++++
>>>>>>     migration/migration.c        |  6 +++++
>>>>>>     migration/migration.h        |  3 +++
>>>>>>     migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>>>>>>     migration/savevm.h           |  4 +++
>>>>>>     5 files changed, 86 insertions(+)
>>>>>>
>>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>>> index 4a578f140713..44d8cf5192ae 100644
>>>>>> --- a/include/migration/register.h
>>>>>> +++ b/include/migration/register.h
>>>>>> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>>>>>>         int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>>>>>>                                  Error **errp);
>>>>>> +    /**
>>>>>> +     * @load_finish
>>>>>> +     *
>>>>>> +     * Poll whether all asynchronous device state loading had finished.
>>>>>> +     * Not called on the load failure path.
>>>>>> +     *
>>>>>> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
>>>>>> +     *
>>>>>> +     * If this method signals "not ready" then it might not be called
>>>>>> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>>>>>> +     * while holding qemu_loadvm_load_finish_ready_lock.
>>>>>
>>>>> [1]
>>>>>
>>>>>> +     *
>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>> +     * @is_finished: whether the loading had finished (output parameter)
>>>>>> +     * @errp: pointer to Error*, to store an error if it happens.
>>>>>> +     *
>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>> +     * It's not an error that the loading still hasn't finished.
>>>>>> +     */
>>>>>> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
>>>>>
>>>>> The load_finish() semantics is a bit weird, especially above [1] on "only
>>>>> allowed to be called once if ..." and also on the locks.
>>>>
>>>> The point of this remark is that a driver needs to call
>>>> qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
>>>> core to call its load_finish handler again.
>>>>
>>>>> It looks to me vfio_load_finish() also does the final load of the device.
>>>>>
>>>>> I wonder whether that final load can be done in the threads,
>>>>
>>>> Here, the problem is that current VFIO VMState has to be loaded from the main
>>>> migration thread as it internally calls QEMU core address space modification
>>>> methods which explode if called from another thread(s).
>>>
>>> Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
>>> BQL if possible, when that's ready then in your case here IIUC you can
>>> simply take BQL in whichever thread that loads it.. but yeah it's not ready
>>> at least..
>>
>> Yeah, long term we might want to work on making these QEMU core address space
>> modification methods somehow callable from multiple threads but that's
>> definitely not something for the initial patch set.
>>
>>> Would it be possible vfio_save_complete_precopy_async_thread_config_state()
>>> be done in VFIO's save_live_complete_precopy() through the main channel
>>> somehow?  IOW, does it rely on iterative data to be fetched first from
>>> kernel, or completely separate states?
>>
>> The device state data needs to be fully loaded first before "activating"
>> the device by loading its config state.
>>
>>> And just curious: how large is it
>>> normally (and I suppose this decides whether it's applicable to be sent via
>>> the main channel at all..)?
>>
>> Config data is *much* smaller than device state data - as far as I remember
>> it was on order of kilobytes.
>>
>>>>
>>>>> then after
>>>>> everything loaded the device post a semaphore telling the main thread to
>>>>> continue.  See e.g.:
>>>>>
>>>>>        if (migrate_switchover_ack()) {
>>>>>            qemu_loadvm_state_switchover_ack_needed(mis);
>>>>>        }
>>>>>
>>>>> IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
>>>>> when all things are loaded?  We can then get rid of this slightly awkward
>>>>> interface.  I had a feeling that things can be simplified (e.g., if the
>>>>> thread will take care of loading the final vmstate then the mutex is also
>>>>> not needed? etc.).
>>>>
>>>> With just a single call to switchover_ack_needed per VFIO device it would
>>>> need to do a blocking wait for the device buffers and config state load
>>>> to finish, therefore blocking other VFIO devices from potentially loading
>>>> their config state if they are ready to begin this operation earlier.
>>>
>>> I am not sure I get you here, loading VFIO device states (I mean, the
>>> non-iterable part) will need to be done sequentially IIUC due to what you
>>> said and should rely on BQL, so I don't know how that could happen
>>> concurrently for now.  But I think indeed BQL is a problem.
>> Consider that we have two VFIO devices (A and B), with the following order
>> of switchover_ack_needed handler calls for them: first A get this call,
>> once the call for A finishes then B gets this call.
>>
>> Now consider what happens if B had loaded all its buffers (in the loading
>> thread) and it is ready for its config load before A finished loading its
>> buffers.
>>
>> B has to wait idle in this situation (even though it could have been already
>> loading its config) since the switchover_ack_needed handler for A won't
>> return until A is fully done.
> 
> This sounds like a performance concern, and I wonder how much this impacts
> the real workload (that you run a test and measure, with/without such
> concurrency) when we can save two devices in parallel anyway; I would
> expect the real diff is small due to the fact I mentioned that we save >1
> VFIO devices concurrently via multifd.
> 
> Do you think we can start with a simpler approach?

I don't think introducing a performance/scalability issue like that is
a good thing, especially that we already have a design that avoids it.

Unfortunately, my current setup does not allow live migrating VMs with
more than 4 VFs so I can't benchmark that.

But I almost certain that with more VFs the situation with devices being
ready out-of-order will get even more likely.

> So what I'm thinking could be very clean is, we just discussed about
> MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
> wonder when with it why not we move one step further to have
> MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
> ready to send non-iterable".  It can be controlled by the same migration
> property so we only send these two flags in 9.2+ machine types.
> 
> Then IIUC VFIO can send config data through main wire (just like most of
> other pci devices! which is IMHO a good fit..) and on destination VFIO
> holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.

Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
to the considerations above) also delay starting the config load until all
iterable devices were read/transferred/loaded and also would complicate
future efforts at loading that config data in parallel.

> 
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-26 22:34             ` Maciej S. Szmigiero
@ 2024-09-27  0:53               ` Peter Xu
  2024-09-30 19:25                 ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-27  0:53 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
> On 20.09.2024 18:45, Peter Xu wrote:
> > On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
> > > On 19.09.2024 23:11, Peter Xu wrote:
> > > > On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 9.09.2024 22:03, Peter Xu wrote:
> > > > > > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > 
> > > > > > > load_finish SaveVMHandler allows migration code to poll whether
> > > > > > > a device-specific asynchronous device state loading operation had finished.
> > > > > > > 
> > > > > > > In order to avoid calling this handler needlessly the device is supposed
> > > > > > > to notify the migration code of its possible readiness via a call to
> > > > > > > qemu_loadvm_load_finish_ready_broadcast() while holding
> > > > > > > qemu_loadvm_load_finish_ready_lock.
> > > > > > > 
> > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > > ---
> > > > > > >     include/migration/register.h | 21 +++++++++++++++
> > > > > > >     migration/migration.c        |  6 +++++
> > > > > > >     migration/migration.h        |  3 +++
> > > > > > >     migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
> > > > > > >     migration/savevm.h           |  4 +++
> > > > > > >     5 files changed, 86 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > > > index 4a578f140713..44d8cf5192ae 100644
> > > > > > > --- a/include/migration/register.h
> > > > > > > +++ b/include/migration/register.h
> > > > > > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
> > > > > > >         int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
> > > > > > >                                  Error **errp);
> > > > > > > +    /**
> > > > > > > +     * @load_finish
> > > > > > > +     *
> > > > > > > +     * Poll whether all asynchronous device state loading had finished.
> > > > > > > +     * Not called on the load failure path.
> > > > > > > +     *
> > > > > > > +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > > > > > +     *
> > > > > > > +     * If this method signals "not ready" then it might not be called
> > > > > > > +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> > > > > > > +     * while holding qemu_loadvm_load_finish_ready_lock.
> > > > > > 
> > > > > > [1]
> > > > > > 
> > > > > > > +     *
> > > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > > +     * @is_finished: whether the loading had finished (output parameter)
> > > > > > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > > > > > +     *
> > > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > > +     * It's not an error that the loading still hasn't finished.
> > > > > > > +     */
> > > > > > > +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > > > > > 
> > > > > > The load_finish() semantics is a bit weird, especially above [1] on "only
> > > > > > allowed to be called once if ..." and also on the locks.
> > > > > 
> > > > > The point of this remark is that a driver needs to call
> > > > > qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
> > > > > core to call its load_finish handler again.
> > > > > 
> > > > > > It looks to me vfio_load_finish() also does the final load of the device.
> > > > > > 
> > > > > > I wonder whether that final load can be done in the threads,
> > > > > 
> > > > > Here, the problem is that current VFIO VMState has to be loaded from the main
> > > > > migration thread as it internally calls QEMU core address space modification
> > > > > methods which explode if called from another thread(s).
> > > > 
> > > > Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
> > > > BQL if possible, when that's ready then in your case here IIUC you can
> > > > simply take BQL in whichever thread that loads it.. but yeah it's not ready
> > > > at least..
> > > 
> > > Yeah, long term we might want to work on making these QEMU core address space
> > > modification methods somehow callable from multiple threads but that's
> > > definitely not something for the initial patch set.
> > > 
> > > > Would it be possible vfio_save_complete_precopy_async_thread_config_state()
> > > > be done in VFIO's save_live_complete_precopy() through the main channel
> > > > somehow?  IOW, does it rely on iterative data to be fetched first from
> > > > kernel, or completely separate states?
> > > 
> > > The device state data needs to be fully loaded first before "activating"
> > > the device by loading its config state.
> > > 
> > > > And just curious: how large is it
> > > > normally (and I suppose this decides whether it's applicable to be sent via
> > > > the main channel at all..)?
> > > 
> > > Config data is *much* smaller than device state data - as far as I remember
> > > it was on order of kilobytes.
> > > 
> > > > > 
> > > > > > then after
> > > > > > everything loaded the device post a semaphore telling the main thread to
> > > > > > continue.  See e.g.:
> > > > > > 
> > > > > >        if (migrate_switchover_ack()) {
> > > > > >            qemu_loadvm_state_switchover_ack_needed(mis);
> > > > > >        }
> > > > > > 
> > > > > > IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
> > > > > > when all things are loaded?  We can then get rid of this slightly awkward
> > > > > > interface.  I had a feeling that things can be simplified (e.g., if the
> > > > > > thread will take care of loading the final vmstate then the mutex is also
> > > > > > not needed? etc.).
> > > > > 
> > > > > With just a single call to switchover_ack_needed per VFIO device it would
> > > > > need to do a blocking wait for the device buffers and config state load
> > > > > to finish, therefore blocking other VFIO devices from potentially loading
> > > > > their config state if they are ready to begin this operation earlier.
> > > > 
> > > > I am not sure I get you here, loading VFIO device states (I mean, the
> > > > non-iterable part) will need to be done sequentially IIUC due to what you
> > > > said and should rely on BQL, so I don't know how that could happen
> > > > concurrently for now.  But I think indeed BQL is a problem.
> > > Consider that we have two VFIO devices (A and B), with the following order
> > > of switchover_ack_needed handler calls for them: first A get this call,
> > > once the call for A finishes then B gets this call.
> > > 
> > > Now consider what happens if B had loaded all its buffers (in the loading
> > > thread) and it is ready for its config load before A finished loading its
> > > buffers.
> > > 
> > > B has to wait idle in this situation (even though it could have been already
> > > loading its config) since the switchover_ack_needed handler for A won't
> > > return until A is fully done.
> > 
> > This sounds like a performance concern, and I wonder how much this impacts
> > the real workload (that you run a test and measure, with/without such
> > concurrency) when we can save two devices in parallel anyway; I would
> > expect the real diff is small due to the fact I mentioned that we save >1
> > VFIO devices concurrently via multifd.
> > 
> > Do you think we can start with a simpler approach?
> 
> I don't think introducing a performance/scalability issue like that is
> a good thing, especially that we already have a design that avoids it.
> 
> Unfortunately, my current setup does not allow live migrating VMs with
> more than 4 VFs so I can't benchmark that.

/me wonders why benchmarking it requires more than 4 VFs.

> 
> But I almost certain that with more VFs the situation with devices being
> ready out-of-order will get even more likely.

If the config space is small, why loading it in sequence would be a
problem?

Have you measured how much time it needs to load one VF's config space that
you're using?  I suppose that's vfio_load_device_config_state() alone?

> 
> > So what I'm thinking could be very clean is, we just discussed about
> > MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
> > wonder when with it why not we move one step further to have
> > MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
> > ready to send non-iterable".  It can be controlled by the same migration
> > property so we only send these two flags in 9.2+ machine types.
> > 
> > Then IIUC VFIO can send config data through main wire (just like most of
> > other pci devices! which is IMHO a good fit..) and on destination VFIO
> > holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.
> 
> Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
> to the considerations above) also delay starting the config load until all
> iterable devices were read/transferred/loaded and also would complicate
> future efforts at loading that config data in parallel.

However I wonder whether we can keep it simple in that VFIO's config space
is still always saved in vfio_save_state().  I still think it's easier we
stick with the main channel whenever possible.  For this specific case, if
the config space is small I think it's tricky you bypass this with:

    if (migration->multifd_transfer) {
        /* Emit dummy NOP data */
        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
        return;
    }

Then squash this as the tail of the iterable data.

On the src, I think it could use a per-device semaphore, so that iterable
save() thread will post() only if it finishes dumping all the data, then
that orders VFIO iterable data v.s. config space save().

On the dst, after a 2nd thought, MIG_CMD_SEND_NON_ITERABE may not work or
needed indeed, because multifd bypasses the main channel, so if we send
anything like MIG_CMD_SEND_NON_ITERABE on the main channel it won't
guarantee multifd load all complete.  However IIUC that can be used in a
similar way as the src qemu I mentioned above with a per-device semaphore,
so that only all the iterable data of this device loaded and applied to the
HW would it post(), before that, vfio_load_state() should wait() on that
sem waiting for data to ready (while multifd threads will be doing that
part).  I wonder whether we may reuse the multifd recv thread in the
initial version, so maybe we don't need any other threads on destination.

The load_finish() interface is currently not able to be reused right,
afaict.  Just have a look at its definition:

    /**
     * @load_finish
     *
     * Poll whether all asynchronous device state loading had finished.
     * Not called on the load failure path.
     *
     * Called while holding the qemu_loadvm_load_finish_ready_lock.
     *
     * If this method signals "not ready" then it might not be called
     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
     * while holding qemu_loadvm_load_finish_ready_lock.
     *
     * @opaque: data pointer passed to register_savevm_live()
     * @is_finished: whether the loading had finished (output parameter)
     * @errp: pointer to Error*, to store an error if it happens.
     *
     * Returns zero to indicate success and negative for error
     * It's not an error that the loading still hasn't finished.
     */
    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);

It's over complicated on defining all its details:

  - Not re-entrant by default.. this is so weirdly designed so that the
    caller needs to know which is even the "1st invocation of the
    function"... It is just weird.

  - Requires one more global mutex that non vmstate handler ever requested,
    that I feel like perhaps can be replaced by a sem (then to drop the
    condvar)?

  - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
    above..

So if you really think it matters to load whatever VFIO device who's
iterable data is ready first, then let's try come up with some better
interface..  I can try to think about it too, but please answer me
questions above so I can understand what I am missing on why that's
important.  Numbers could help, even if 4 VF and I wonder how much diff
there can be.  Mostly, I don't know why it's slow right now if it is; I
thought it should be pretty fast, at least not a concern in VFIO migration
world (which can take seconds of downtime or more..).

IOW, it sounds more reasonalbe to me that no matter whether vfio will
support multifd, it'll be nice we stick with vfio_load_state() /
vfio_save_state() for config space, and hopefully it's also easier it
always go via the main channel to everyone.  In these two hooks, VFIO can
do whatever it wants to sync with other things (on src, sync with
concurrent thread pool saving iterable data and dumping things to multifd
channels; on dst, sync with multifd concurrent loads). I think it can
remove the requirement on the load_finish() interface completely.  Yes,
this can only load VFIO's pci config space one by one, but I think this is
much simpler, and I hope it's also not that slow, but I'm not sure.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-27  0:53               ` Peter Xu
@ 2024-09-30 19:25                 ` Maciej S. Szmigiero
  2024-09-30 21:57                   ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-09-30 19:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 27.09.2024 02:53, Peter Xu wrote:
> On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
>> On 20.09.2024 18:45, Peter Xu wrote:
>>> On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
>>>> On 19.09.2024 23:11, Peter Xu wrote:
>>>>> On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 9.09.2024 22:03, Peter Xu wrote:
>>>>>>> On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>
>>>>>>>> load_finish SaveVMHandler allows migration code to poll whether
>>>>>>>> a device-specific asynchronous device state loading operation had finished.
>>>>>>>>
>>>>>>>> In order to avoid calling this handler needlessly the device is supposed
>>>>>>>> to notify the migration code of its possible readiness via a call to
>>>>>>>> qemu_loadvm_load_finish_ready_broadcast() while holding
>>>>>>>> qemu_loadvm_load_finish_ready_lock.
>>>>>>>>
>>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>>> ---
>>>>>>>>      include/migration/register.h | 21 +++++++++++++++
>>>>>>>>      migration/migration.c        |  6 +++++
>>>>>>>>      migration/migration.h        |  3 +++
>>>>>>>>      migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>>>>>>>>      migration/savevm.h           |  4 +++
>>>>>>>>      5 files changed, 86 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>>>>> index 4a578f140713..44d8cf5192ae 100644
>>>>>>>> --- a/include/migration/register.h
>>>>>>>> +++ b/include/migration/register.h
>>>>>>>> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>>>>>>>>          int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>>>>>>>>                                   Error **errp);
>>>>>>>> +    /**
>>>>>>>> +     * @load_finish
>>>>>>>> +     *
>>>>>>>> +     * Poll whether all asynchronous device state loading had finished.
>>>>>>>> +     * Not called on the load failure path.
>>>>>>>> +     *
>>>>>>>> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
>>>>>>>> +     *
>>>>>>>> +     * If this method signals "not ready" then it might not be called
>>>>>>>> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>>>>>>>> +     * while holding qemu_loadvm_load_finish_ready_lock.
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>>> +     *
>>>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>>>> +     * @is_finished: whether the loading had finished (output parameter)
>>>>>>>> +     * @errp: pointer to Error*, to store an error if it happens.
>>>>>>>> +     *
>>>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>>>> +     * It's not an error that the loading still hasn't finished.
>>>>>>>> +     */
>>>>>>>> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
>>>>>>>
>>>>>>> The load_finish() semantics is a bit weird, especially above [1] on "only
>>>>>>> allowed to be called once if ..." and also on the locks.
>>>>>>
>>>>>> The point of this remark is that a driver needs to call
>>>>>> qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
>>>>>> core to call its load_finish handler again.
>>>>>>
>>>>>>> It looks to me vfio_load_finish() also does the final load of the device.
>>>>>>>
>>>>>>> I wonder whether that final load can be done in the threads,
>>>>>>
>>>>>> Here, the problem is that current VFIO VMState has to be loaded from the main
>>>>>> migration thread as it internally calls QEMU core address space modification
>>>>>> methods which explode if called from another thread(s).
>>>>>
>>>>> Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
>>>>> BQL if possible, when that's ready then in your case here IIUC you can
>>>>> simply take BQL in whichever thread that loads it.. but yeah it's not ready
>>>>> at least..
>>>>
>>>> Yeah, long term we might want to work on making these QEMU core address space
>>>> modification methods somehow callable from multiple threads but that's
>>>> definitely not something for the initial patch set.
>>>>
>>>>> Would it be possible vfio_save_complete_precopy_async_thread_config_state()
>>>>> be done in VFIO's save_live_complete_precopy() through the main channel
>>>>> somehow?  IOW, does it rely on iterative data to be fetched first from
>>>>> kernel, or completely separate states?
>>>>
>>>> The device state data needs to be fully loaded first before "activating"
>>>> the device by loading its config state.
>>>>
>>>>> And just curious: how large is it
>>>>> normally (and I suppose this decides whether it's applicable to be sent via
>>>>> the main channel at all..)?
>>>>
>>>> Config data is *much* smaller than device state data - as far as I remember
>>>> it was on order of kilobytes.
>>>>
>>>>>>
>>>>>>> then after
>>>>>>> everything loaded the device post a semaphore telling the main thread to
>>>>>>> continue.  See e.g.:
>>>>>>>
>>>>>>>         if (migrate_switchover_ack()) {
>>>>>>>             qemu_loadvm_state_switchover_ack_needed(mis);
>>>>>>>         }
>>>>>>>
>>>>>>> IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
>>>>>>> when all things are loaded?  We can then get rid of this slightly awkward
>>>>>>> interface.  I had a feeling that things can be simplified (e.g., if the
>>>>>>> thread will take care of loading the final vmstate then the mutex is also
>>>>>>> not needed? etc.).
>>>>>>
>>>>>> With just a single call to switchover_ack_needed per VFIO device it would
>>>>>> need to do a blocking wait for the device buffers and config state load
>>>>>> to finish, therefore blocking other VFIO devices from potentially loading
>>>>>> their config state if they are ready to begin this operation earlier.
>>>>>
>>>>> I am not sure I get you here, loading VFIO device states (I mean, the
>>>>> non-iterable part) will need to be done sequentially IIUC due to what you
>>>>> said and should rely on BQL, so I don't know how that could happen
>>>>> concurrently for now.  But I think indeed BQL is a problem.
>>>> Consider that we have two VFIO devices (A and B), with the following order
>>>> of switchover_ack_needed handler calls for them: first A get this call,
>>>> once the call for A finishes then B gets this call.
>>>>
>>>> Now consider what happens if B had loaded all its buffers (in the loading
>>>> thread) and it is ready for its config load before A finished loading its
>>>> buffers.
>>>>
>>>> B has to wait idle in this situation (even though it could have been already
>>>> loading its config) since the switchover_ack_needed handler for A won't
>>>> return until A is fully done.
>>>
>>> This sounds like a performance concern, and I wonder how much this impacts
>>> the real workload (that you run a test and measure, with/without such
>>> concurrency) when we can save two devices in parallel anyway; I would
>>> expect the real diff is small due to the fact I mentioned that we save >1
>>> VFIO devices concurrently via multifd.
>>>
>>> Do you think we can start with a simpler approach?
>>
>> I don't think introducing a performance/scalability issue like that is
>> a good thing, especially that we already have a design that avoids it.
>>
>> Unfortunately, my current setup does not allow live migrating VMs with
>> more than 4 VFs so I can't benchmark that.
> 
> /me wonders why benchmarking it requires more than 4 VFs.

My point here was that the scalability problem will most likely get more
pronounced with more VFs.

>>
>> But I almost certain that with more VFs the situation with devices being
>> ready out-of-order will get even more likely.
> 
> If the config space is small, why loading it in sequence would be a
> problem?
> 
> Have you measured how much time it needs to load one VF's config space that
> you're using?  I suppose that's vfio_load_device_config_state() alone?

It's not the amount of data to load matters here but that these address
space operations are slow.

The whole config load takes ~70 ms per device - that's time equivalent
of transferring 875 MiB of device state via a 100 GBit/s link.

>>
>>> So what I'm thinking could be very clean is, we just discussed about
>>> MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
>>> wonder when with it why not we move one step further to have
>>> MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
>>> ready to send non-iterable".  It can be controlled by the same migration
>>> property so we only send these two flags in 9.2+ machine types.
>>>
>>> Then IIUC VFIO can send config data through main wire (just like most of
>>> other pci devices! which is IMHO a good fit..) and on destination VFIO
>>> holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.
>>
>> Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
>> to the considerations above) also delay starting the config load until all
>> iterable devices were read/transferred/loaded and also would complicate
>> future efforts at loading that config data in parallel.
> 
> However I wonder whether we can keep it simple in that VFIO's config space
> is still always saved in vfio_save_state().  I still think it's easier we
> stick with the main channel whenever possible.  For this specific case, if
> the config space is small I think it's tricky you bypass this with:
> 
>      if (migration->multifd_transfer) {
>          /* Emit dummy NOP data */
>          qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>          return;
>      }
> 
> Then squash this as the tail of the iterable data.
> 
> On the src, I think it could use a per-device semaphore, so that iterable
> save() thread will post() only if it finishes dumping all the data, then
> that orders VFIO iterable data v.s. config space save().

In the future we want to not only transfer but also load the config data
in parallel.

So going back to transferring this data serialized via the main migration
channel would be taking a step back here.

By the way, we already have a serialization point in
qemu_savevm_state_complete_precopy_iterable() after iterables have been sent -
waiting for device state sending threads to finish their work.

Whether this thread_pool_wait() operation will be implemented using
semaphores I'm not sure yet - will depend on how well this will fit other
GThreadPool internals.

> On the dst, after a 2nd thought, MIG_CMD_SEND_NON_ITERABE may not work or
> needed indeed, because multifd bypasses the main channel, so if we send
> anything like MIG_CMD_SEND_NON_ITERABE on the main channel it won't
> guarantee multifd load all complete.  However IIUC that can be used in a
> similar way as the src qemu I mentioned above with a per-device semaphore,
> so that only all the iterable data of this device loaded and applied to the
> HW would it post(), before that, vfio_load_state() should wait() on that
> sem waiting for data to ready (while multifd threads will be doing that
> part).  I wonder whether we may reuse the multifd recv thread in the
> initial version, so maybe we don't need any other threads on destination.
> 
> The load_finish() interface is currently not able to be reused right,
> afaict.  Just have a look at its definition:
> 
>      /**
>       * @load_finish
>       *
>       * Poll whether all asynchronous device state loading had finished.
>       * Not called on the load failure path.
>       *
>       * Called while holding the qemu_loadvm_load_finish_ready_lock.
>       *
>       * If this method signals "not ready" then it might not be called
>       * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>       * while holding qemu_loadvm_load_finish_ready_lock.
>       *
>       * @opaque: data pointer passed to register_savevm_live()
>       * @is_finished: whether the loading had finished (output parameter)
>       * @errp: pointer to Error*, to store an error if it happens.
>       *
>       * Returns zero to indicate success and negative for error
>       * It's not an error that the loading still hasn't finished.
>       */
>      int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> 
> It's over complicated on defining all its details:
> 
>    - Not re-entrant by default.. 

What do you mean by "re-entrant" here?

This handler is called only from single migration thread, so it cannot
be re-entered anyway since the control doesn't return to the migration
code until this handler exits (and obviously the handler won't call
itself recursively).

> this is so weirdly designed so that the
>      caller needs to know which is even the "1st invocation of the
>      function"... It is just weird.

I don't quite understand that - why do you think that caller needs to
know whether this is the "1st invocation of the function"?

The caller only tracks whether all these handlers reported that they
finished their work:
>       bool all_ready = true;
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           bool this_ready;
> 
>           ret = se->ops->load_finish(se->opaque, &this_ready, &local_err);
>           if (ret) {
>           } else if (!this_ready) {
>               all_ready = false;
>           }
>
>       }
>       if (all_ready) {
>             break;
>       }


>    - Requires one more global mutex that non vmstate handler ever requested,

Could you elaborate what do you mean by "that non vmstate handler ever requested"?

>      that I feel like perhaps can be replaced by a sem (then to drop the
>      condvar)?

Once we have ability to load device config state outside main migration
thread replacing "load_finish" handler with a semaphore should indeed be
possible (that's internal migration API so there should be no issue
removing it as not necessary anymore at this point).

But for now, the devices need to have ability to run their config load
code on the main migration thread, and for that they need to be called
from this handler "load_finish".

>    - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
>      above..
> 
> So if you really think it matters to load whatever VFIO device who's
> iterable data is ready first, then let's try come up with some better
> interface..  I can try to think about it too, but please answer me
> questions above so I can understand what I am missing on why that's
> important.  Numbers could help, even if 4 VF and I wonder how much diff
> there can be.  Mostly, I don't know why it's slow right now if it is; I
> thought it should be pretty fast, at least not a concern in VFIO migration
> world (which can take seconds of downtime or more..).
> 
> IOW, it sounds more reasonalbe to me that no matter whether vfio will
> support multifd, it'll be nice we stick with vfio_load_state() /
> vfio_save_state() for config space, and hopefully it's also easier it
> always go via the main channel to everyone.  In these two hooks, VFIO can
> do whatever it wants to sync with other things (on src, sync with
> concurrent thread pool saving iterable data and dumping things to multifd
> channels; on dst, sync with multifd concurrent loads). I think it can
> remove the requirement on the load_finish() interface completely.  Yes,
> this can only load VFIO's pci config space one by one, but I think this is
> much simpler, and I hope it's also not that slow, but I'm not sure.

To be clear, I made a following diagram describing how the patch set
is supposed to work right now, including changing per-device
VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.

Time flows on it left to right (->).

----------- DIAGRAM START -----------
Source overall flow:
Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /

Target overall flow:
Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable -> config data load operations
Multifd channels:                                       \ multifd device state (1) -> multifd config data read (1)

Target config data load operations flow:
multifd config data read (1) -> config data load (2)

Notes:
(1): per device threads running in parallel
(2): currently serialized (only one such operation running at a particular time), will hopefully be parallelized in the future
----------- DIAGRAM END -----------

Hope the diagram survived being pasted into an e-mail message.

One can see, that even now there's a bit of "low hanging fruit" of missing
possible parallelism:
It seems that the source could wait for multifd device state + multifd config
data *after* non-iterables are sent rather than before as it is done
currently - so they will be sent in parallel with multifd data.

Since written description is often prone to misunderstanding
could you please annotate that diagram with your proposed new flow?

Thanks,
Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-30 19:25                 ` Maciej S. Szmigiero
@ 2024-09-30 21:57                   ` Peter Xu
  2024-10-01 20:41                     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-09-30 21:57 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Mon, Sep 30, 2024 at 09:25:54PM +0200, Maciej S. Szmigiero wrote:
> On 27.09.2024 02:53, Peter Xu wrote:
> > On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
> > > On 20.09.2024 18:45, Peter Xu wrote:
> > > > On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 19.09.2024 23:11, Peter Xu wrote:
> > > > > > On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 9.09.2024 22:03, Peter Xu wrote:
> > > > > > > > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > > 
> > > > > > > > > load_finish SaveVMHandler allows migration code to poll whether
> > > > > > > > > a device-specific asynchronous device state loading operation had finished.
> > > > > > > > > 
> > > > > > > > > In order to avoid calling this handler needlessly the device is supposed
> > > > > > > > > to notify the migration code of its possible readiness via a call to
> > > > > > > > > qemu_loadvm_load_finish_ready_broadcast() while holding
> > > > > > > > > qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > > > > ---
> > > > > > > > >      include/migration/register.h | 21 +++++++++++++++
> > > > > > > > >      migration/migration.c        |  6 +++++
> > > > > > > > >      migration/migration.h        |  3 +++
> > > > > > > > >      migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
> > > > > > > > >      migration/savevm.h           |  4 +++
> > > > > > > > >      5 files changed, 86 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > > > > > index 4a578f140713..44d8cf5192ae 100644
> > > > > > > > > --- a/include/migration/register.h
> > > > > > > > > +++ b/include/migration/register.h
> > > > > > > > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
> > > > > > > > >          int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
> > > > > > > > >                                   Error **errp);
> > > > > > > > > +    /**
> > > > > > > > > +     * @load_finish
> > > > > > > > > +     *
> > > > > > > > > +     * Poll whether all asynchronous device state loading had finished.
> > > > > > > > > +     * Not called on the load failure path.
> > > > > > > > > +     *
> > > > > > > > > +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > +     *
> > > > > > > > > +     * If this method signals "not ready" then it might not be called
> > > > > > > > > +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> > > > > > > > > +     * while holding qemu_loadvm_load_finish_ready_lock.
> > > > > > > > 
> > > > > > > > [1]
> > > > > > > > 
> > > > > > > > > +     *
> > > > > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > > > > +     * @is_finished: whether the loading had finished (output parameter)
> > > > > > > > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > > > > > > > +     *
> > > > > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > > > > +     * It's not an error that the loading still hasn't finished.
> > > > > > > > > +     */
> > > > > > > > > +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > > > > > > > 
> > > > > > > > The load_finish() semantics is a bit weird, especially above [1] on "only
> > > > > > > > allowed to be called once if ..." and also on the locks.
> > > > > > > 
> > > > > > > The point of this remark is that a driver needs to call
> > > > > > > qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
> > > > > > > core to call its load_finish handler again.
> > > > > > > 
> > > > > > > > It looks to me vfio_load_finish() also does the final load of the device.
> > > > > > > > 
> > > > > > > > I wonder whether that final load can be done in the threads,
> > > > > > > 
> > > > > > > Here, the problem is that current VFIO VMState has to be loaded from the main
> > > > > > > migration thread as it internally calls QEMU core address space modification
> > > > > > > methods which explode if called from another thread(s).
> > > > > > 
> > > > > > Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
> > > > > > BQL if possible, when that's ready then in your case here IIUC you can
> > > > > > simply take BQL in whichever thread that loads it.. but yeah it's not ready
> > > > > > at least..
> > > > > 
> > > > > Yeah, long term we might want to work on making these QEMU core address space
> > > > > modification methods somehow callable from multiple threads but that's
> > > > > definitely not something for the initial patch set.
> > > > > 
> > > > > > Would it be possible vfio_save_complete_precopy_async_thread_config_state()
> > > > > > be done in VFIO's save_live_complete_precopy() through the main channel
> > > > > > somehow?  IOW, does it rely on iterative data to be fetched first from
> > > > > > kernel, or completely separate states?
> > > > > 
> > > > > The device state data needs to be fully loaded first before "activating"
> > > > > the device by loading its config state.
> > > > > 
> > > > > > And just curious: how large is it
> > > > > > normally (and I suppose this decides whether it's applicable to be sent via
> > > > > > the main channel at all..)?
> > > > > 
> > > > > Config data is *much* smaller than device state data - as far as I remember
> > > > > it was on order of kilobytes.
> > > > > 
> > > > > > > 
> > > > > > > > then after
> > > > > > > > everything loaded the device post a semaphore telling the main thread to
> > > > > > > > continue.  See e.g.:
> > > > > > > > 
> > > > > > > >         if (migrate_switchover_ack()) {
> > > > > > > >             qemu_loadvm_state_switchover_ack_needed(mis);
> > > > > > > >         }
> > > > > > > > 
> > > > > > > > IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
> > > > > > > > when all things are loaded?  We can then get rid of this slightly awkward
> > > > > > > > interface.  I had a feeling that things can be simplified (e.g., if the
> > > > > > > > thread will take care of loading the final vmstate then the mutex is also
> > > > > > > > not needed? etc.).
> > > > > > > 
> > > > > > > With just a single call to switchover_ack_needed per VFIO device it would
> > > > > > > need to do a blocking wait for the device buffers and config state load
> > > > > > > to finish, therefore blocking other VFIO devices from potentially loading
> > > > > > > their config state if they are ready to begin this operation earlier.
> > > > > > 
> > > > > > I am not sure I get you here, loading VFIO device states (I mean, the
> > > > > > non-iterable part) will need to be done sequentially IIUC due to what you
> > > > > > said and should rely on BQL, so I don't know how that could happen
> > > > > > concurrently for now.  But I think indeed BQL is a problem.
> > > > > Consider that we have two VFIO devices (A and B), with the following order
> > > > > of switchover_ack_needed handler calls for them: first A get this call,
> > > > > once the call for A finishes then B gets this call.
> > > > > 
> > > > > Now consider what happens if B had loaded all its buffers (in the loading
> > > > > thread) and it is ready for its config load before A finished loading its
> > > > > buffers.
> > > > > 
> > > > > B has to wait idle in this situation (even though it could have been already
> > > > > loading its config) since the switchover_ack_needed handler for A won't
> > > > > return until A is fully done.
> > > > 
> > > > This sounds like a performance concern, and I wonder how much this impacts
> > > > the real workload (that you run a test and measure, with/without such
> > > > concurrency) when we can save two devices in parallel anyway; I would
> > > > expect the real diff is small due to the fact I mentioned that we save >1
> > > > VFIO devices concurrently via multifd.
> > > > 
> > > > Do you think we can start with a simpler approach?
> > > 
> > > I don't think introducing a performance/scalability issue like that is
> > > a good thing, especially that we already have a design that avoids it.
> > > 
> > > Unfortunately, my current setup does not allow live migrating VMs with
> > > more than 4 VFs so I can't benchmark that.
> > 
> > /me wonders why benchmarking it requires more than 4 VFs.
> 
> My point here was that the scalability problem will most likely get more
> pronounced with more VFs.
> 
> > > 
> > > But I almost certain that with more VFs the situation with devices being
> > > ready out-of-order will get even more likely.
> > 
> > If the config space is small, why loading it in sequence would be a
> > problem?
> > 
> > Have you measured how much time it needs to load one VF's config space that
> > you're using?  I suppose that's vfio_load_device_config_state() alone?
> 
> It's not the amount of data to load matters here but that these address
> space operations are slow.
> 
> The whole config load takes ~70 ms per device - that's time equivalent
> of transferring 875 MiB of device state via a 100 GBit/s link.

What's the downtime of migration with 1/2/4 VFs?  I remember I saw some
data somewhere but it's not in the cover letter.  It'll be good to mention
these results in the cover letter when repost.

I'm guessing 70ms isn't a huge deal here, if your NIC has 128GB internal
device state to migrate.. but maybe I'm wrong.

I also wonder whether you profiled a bit on how that 70ms contributes to
what is slow.

> 
> > > 
> > > > So what I'm thinking could be very clean is, we just discussed about
> > > > MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
> > > > wonder when with it why not we move one step further to have
> > > > MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
> > > > ready to send non-iterable".  It can be controlled by the same migration
> > > > property so we only send these two flags in 9.2+ machine types.
> > > > 
> > > > Then IIUC VFIO can send config data through main wire (just like most of
> > > > other pci devices! which is IMHO a good fit..) and on destination VFIO
> > > > holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.
> > > 
> > > Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
> > > to the considerations above) also delay starting the config load until all
> > > iterable devices were read/transferred/loaded and also would complicate
> > > future efforts at loading that config data in parallel.
> > 
> > However I wonder whether we can keep it simple in that VFIO's config space
> > is still always saved in vfio_save_state().  I still think it's easier we
> > stick with the main channel whenever possible.  For this specific case, if
> > the config space is small I think it's tricky you bypass this with:
> > 
> >      if (migration->multifd_transfer) {
> >          /* Emit dummy NOP data */
> >          qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >          return;
> >      }
> > 
> > Then squash this as the tail of the iterable data.
> > 
> > On the src, I think it could use a per-device semaphore, so that iterable
> > save() thread will post() only if it finishes dumping all the data, then
> > that orders VFIO iterable data v.s. config space save().
> 
> In the future we want to not only transfer but also load the config data
> in parallel.

How feasible do you think this idea is?  E.g. does it involve BQL so far
(e.g. memory updates, others)?  What's still missing to make it concurrent?

> 
> So going back to transferring this data serialized via the main migration
> channel would be taking a step back here.

If below holds true:

  - 70ms is still very small amount in the total downtime, and,

  - this can avoid the below load_finish() API

Then I'd go for it.. or again, at least the load_finish() needs change,
IMHO..

> 
> By the way, we already have a serialization point in
> qemu_savevm_state_complete_precopy_iterable() after iterables have been sent -
> waiting for device state sending threads to finish their work.
> 
> Whether this thread_pool_wait() operation will be implemented using
> semaphores I'm not sure yet - will depend on how well this will fit other
> GThreadPool internals.
> 
> > On the dst, after a 2nd thought, MIG_CMD_SEND_NON_ITERABE may not work or
> > needed indeed, because multifd bypasses the main channel, so if we send
> > anything like MIG_CMD_SEND_NON_ITERABE on the main channel it won't
> > guarantee multifd load all complete.  However IIUC that can be used in a
> > similar way as the src qemu I mentioned above with a per-device semaphore,
> > so that only all the iterable data of this device loaded and applied to the
> > HW would it post(), before that, vfio_load_state() should wait() on that
> > sem waiting for data to ready (while multifd threads will be doing that
> > part).  I wonder whether we may reuse the multifd recv thread in the
> > initial version, so maybe we don't need any other threads on destination.
> > 
> > The load_finish() interface is currently not able to be reused right,
> > afaict.  Just have a look at its definition:
> > 
> >      /**
> >       * @load_finish
> >       *
> >       * Poll whether all asynchronous device state loading had finished.
> >       * Not called on the load failure path.
> >       *
> >       * Called while holding the qemu_loadvm_load_finish_ready_lock.
> >       *
> >       * If this method signals "not ready" then it might not be called
> >       * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> >       * while holding qemu_loadvm_load_finish_ready_lock.
> >       *
> >       * @opaque: data pointer passed to register_savevm_live()
> >       * @is_finished: whether the loading had finished (output parameter)
> >       * @errp: pointer to Error*, to store an error if it happens.
> >       *
> >       * Returns zero to indicate success and negative for error
> >       * It's not an error that the loading still hasn't finished.
> >       */
> >      int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > 
> > It's over complicated on defining all its details:
> > 
> >    - Not re-entrant by default..
> 
> What do you mean by "re-entrant" here?
> 
> This handler is called only from single migration thread, so it cannot
> be re-entered anyway since the control doesn't return to the migration
> code until this handler exits (and obviously the handler won't call
> itself recursively).

I think it's not a good design to say "you can call this function once, but
not the 2nd time until you wait on a semaphore".

> 
> > this is so weirdly designed so that the
> >      caller needs to know which is even the "1st invocation of the
> >      function"... It is just weird.
> 
> I don't quite understand that - why do you think that caller needs to
> know whether this is the "1st invocation of the function"?
> 
> The caller only tracks whether all these handlers reported that they
> finished their work:
> >       bool all_ready = true;
> >       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >           bool this_ready;
> > 
> >           ret = se->ops->load_finish(se->opaque, &this_ready, &local_err);
> >           if (ret) {
> >           } else if (!this_ready) {
> >               all_ready = false;
> >           }
> > 
> >       }
> >       if (all_ready) {
> >             break;
> >       }
> 
> 
> >    - Requires one more global mutex that non vmstate handler ever requested,
> 
> Could you elaborate what do you mean by "that non vmstate handler ever requested"?

I meant no historical vmstate handler hook functions requires such
complicated locking to work.

> 
> >      that I feel like perhaps can be replaced by a sem (then to drop the
> >      condvar)?
> 
> Once we have ability to load device config state outside main migration
> thread replacing "load_finish" handler with a semaphore should indeed be
> possible (that's internal migration API so there should be no issue
> removing it as not necessary anymore at this point).
> 
> But for now, the devices need to have ability to run their config load
> code on the main migration thread, and for that they need to be called
> from this handler "load_finish".

A sem seems a must here to notify the iterable data finished loading, but
that doesn't need to hook to the vmstate handler, but some post-process
tasks, like what we do around cpu_synchronize_all_post_init() time.

If per-device vmstate handler hook version of load_finish() is destined to
look as weird in this case, I'd rather consider a totally separate way to
enqueue some jobs that needs to be run after all vmstates loaded.  Then
after one VFIO device fully loads its data, it enqueues the task and post()
to one migration sem saying that "there's one post-process task, please run
it in migration thread".  There can be a total number of tasks registered
so that migration thread knows not to continue until these number of tasks
processed.  That counter can be part of vmstate handler, maybe, reporting
that "this vmstate handler has one post-process task".

Maybe you have other ideas, but please no, let's avoid this load_finish()
thing..

I'd rather still see justifications showing that this 70ms really is
helpful.. I'd rather wish we have +70ms*Ndev downtime but drop this hook
until we have a clearer mind when all config space can be loaded
concurrently, for example.  So we start from simple.

> 
> >    - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
> >      above..
> > 
> > So if you really think it matters to load whatever VFIO device who's
> > iterable data is ready first, then let's try come up with some better
> > interface..  I can try to think about it too, but please answer me
> > questions above so I can understand what I am missing on why that's
> > important.  Numbers could help, even if 4 VF and I wonder how much diff
> > there can be.  Mostly, I don't know why it's slow right now if it is; I
> > thought it should be pretty fast, at least not a concern in VFIO migration
> > world (which can take seconds of downtime or more..).
> > 
> > IOW, it sounds more reasonalbe to me that no matter whether vfio will
> > support multifd, it'll be nice we stick with vfio_load_state() /
> > vfio_save_state() for config space, and hopefully it's also easier it
> > always go via the main channel to everyone.  In these two hooks, VFIO can
> > do whatever it wants to sync with other things (on src, sync with
> > concurrent thread pool saving iterable data and dumping things to multifd
> > channels; on dst, sync with multifd concurrent loads). I think it can
> > remove the requirement on the load_finish() interface completely.  Yes,
> > this can only load VFIO's pci config space one by one, but I think this is
> > much simpler, and I hope it's also not that slow, but I'm not sure.
> 
> To be clear, I made a following diagram describing how the patch set
> is supposed to work right now, including changing per-device
> VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.
> 
> Time flows on it left to right (->).
> 
> ----------- DIAGRAM START -----------
> Source overall flow:
> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
> Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
> 
> Target overall flow:
> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable -> config data load operations
> Multifd channels:                                       \ multifd device state (1) -> multifd config data read (1)
> 
> Target config data load operations flow:
> multifd config data read (1) -> config data load (2)
> 
> Notes:
> (1): per device threads running in parallel

Here I raised this question before, but I'll ask again: do you think we can
avoid using a separate thread on dest qemu, but reuse multifd recv threads?

Src probably needs its own threads because multifd sender threads takes
request, so it can't block on its own.

However dest qemu isn't like that, it's packet driven so I think maybe it's
ok VFIO directly loads the data in the multifd threads.  We may want to
have enough multifd threads to make sure IO still don't block much on the
NIC, but I think tuning the num of multifd threads should work in this
case.

> (2): currently serialized (only one such operation running at a particular time), will hopefully be parallelized in the future
> ----------- DIAGRAM END -----------
> 
> Hope the diagram survived being pasted into an e-mail message.
> 
> One can see, that even now there's a bit of "low hanging fruit" of missing
> possible parallelism:
> It seems that the source could wait for multifd device state + multifd config
> data *after* non-iterables are sent rather than before as it is done
> currently - so they will be sent in parallel with multifd data.

Currently it's blocked by this chunk of code of yours:

    if (multifd_device_state) {
        ret = multifd_join_device_state_save_threads();
        if (ret) {
            qemu_file_set_error(f, ret);
            return -1;
        }
    }

If with your proposal that vfio config space sent via multifd channels,
indeed I don't see why it can't be moved to be after non-iterable save()
completes.  Is that what you implied as "low hanging fruit"?

[***]

> 
> Since written description is often prone to misunderstanding
> could you please annotate that diagram with your proposed new flow?

What I was suggesting (removing load_finish()) is mostly the same as what
you drew I think, especially on src:

===============
Source overall flow:
Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
===============

In this case we can't do the optimization above [***], since what I
suggested requires VFIO's vfio_save_state() to dump the config space, so
the original order will be needed here.

While on dest, config data load will need to also load using vfio's
vfio_load_state() so it'll be invoked just like what we normally do with
non-iterable device states (so here "config data load operations" is part
of loading all non-iterable devices):

===============
Target overall flow:                                                              (X)
Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable (multifd config data read -> config data load operations)
Multifd channels:                                       \ multifd device state load                                /
                                        (lower part done via multifd recv threads, not separate threads)
===============

So here the ordering of (X) is not guarded by anything, however in
vfio_load_state() the device can sem_wait() on a semaphore that only be
posted until this device's device state is fully loaded.  So it's not
completely serialized - "config data load operations" of DEV1 can still
happen concurrently with "multifd device state load" of DEV2.

Sorry, this might not be as clear as it's not easy to draw in the graph,
but I hope the words can help clarify what I meant.

If 70ms is not a major deal, I suggest we consider above approach, I think
it can simplify at least the vmstate handler API.  If 70ms matters, let's
try refactor load_finish() to something usable.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-09-30 21:57                   ` Peter Xu
@ 2024-10-01 20:41                     ` Maciej S. Szmigiero
  2024-10-01 21:30                       ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-10-01 20:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 30.09.2024 23:57, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 09:25:54PM +0200, Maciej S. Szmigiero wrote:
>> On 27.09.2024 02:53, Peter Xu wrote:
>>> On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
>>>> On 20.09.2024 18:45, Peter Xu wrote:
>>>>> On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 19.09.2024 23:11, Peter Xu wrote:
>>>>>>> On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> On 9.09.2024 22:03, Peter Xu wrote:
>>>>>>>>> On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>
>>>>>>>>>> load_finish SaveVMHandler allows migration code to poll whether
>>>>>>>>>> a device-specific asynchronous device state loading operation had finished.
>>>>>>>>>>
>>>>>>>>>> In order to avoid calling this handler needlessly the device is supposed
>>>>>>>>>> to notify the migration code of its possible readiness via a call to
>>>>>>>>>> qemu_loadvm_load_finish_ready_broadcast() while holding
>>>>>>>>>> qemu_loadvm_load_finish_ready_lock.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>>>>> ---
>>>>>>>>>>       include/migration/register.h | 21 +++++++++++++++
>>>>>>>>>>       migration/migration.c        |  6 +++++
>>>>>>>>>>       migration/migration.h        |  3 +++
>>>>>>>>>>       migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       migration/savevm.h           |  4 +++
>>>>>>>>>>       5 files changed, 86 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>>>>>>> index 4a578f140713..44d8cf5192ae 100644
>>>>>>>>>> --- a/include/migration/register.h
>>>>>>>>>> +++ b/include/migration/register.h
>>>>>>>>>> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>>>>>>>>>>           int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>>>>>>>>>>                                    Error **errp);
>>>>>>>>>> +    /**
>>>>>>>>>> +     * @load_finish
>>>>>>>>>> +     *
>>>>>>>>>> +     * Poll whether all asynchronous device state loading had finished.
>>>>>>>>>> +     * Not called on the load failure path.
>>>>>>>>>> +     *
>>>>>>>>>> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
>>>>>>>>>> +     *
>>>>>>>>>> +     * If this method signals "not ready" then it might not be called
>>>>>>>>>> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>>>>>>>>>> +     * while holding qemu_loadvm_load_finish_ready_lock.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>>>>> +     *
>>>>>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>>>>>> +     * @is_finished: whether the loading had finished (output parameter)
>>>>>>>>>> +     * @errp: pointer to Error*, to store an error if it happens.
>>>>>>>>>> +     *
>>>>>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>>>>>> +     * It's not an error that the loading still hasn't finished.
>>>>>>>>>> +     */
>>>>>>>>>> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
>>>>>>>>>
>>>>>>>>> The load_finish() semantics is a bit weird, especially above [1] on "only
>>>>>>>>> allowed to be called once if ..." and also on the locks.
>>>>>>>>
>>>>>>>> The point of this remark is that a driver needs to call
>>>>>>>> qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
>>>>>>>> core to call its load_finish handler again.
>>>>>>>>
>>>>>>>>> It looks to me vfio_load_finish() also does the final load of the device.
>>>>>>>>>
>>>>>>>>> I wonder whether that final load can be done in the threads,
>>>>>>>>
>>>>>>>> Here, the problem is that current VFIO VMState has to be loaded from the main
>>>>>>>> migration thread as it internally calls QEMU core address space modification
>>>>>>>> methods which explode if called from another thread(s).
>>>>>>>
>>>>>>> Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
>>>>>>> BQL if possible, when that's ready then in your case here IIUC you can
>>>>>>> simply take BQL in whichever thread that loads it.. but yeah it's not ready
>>>>>>> at least..
>>>>>>
>>>>>> Yeah, long term we might want to work on making these QEMU core address space
>>>>>> modification methods somehow callable from multiple threads but that's
>>>>>> definitely not something for the initial patch set.
>>>>>>
>>>>>>> Would it be possible vfio_save_complete_precopy_async_thread_config_state()
>>>>>>> be done in VFIO's save_live_complete_precopy() through the main channel
>>>>>>> somehow?  IOW, does it rely on iterative data to be fetched first from
>>>>>>> kernel, or completely separate states?
>>>>>>
>>>>>> The device state data needs to be fully loaded first before "activating"
>>>>>> the device by loading its config state.
>>>>>>
>>>>>>> And just curious: how large is it
>>>>>>> normally (and I suppose this decides whether it's applicable to be sent via
>>>>>>> the main channel at all..)?
>>>>>>
>>>>>> Config data is *much* smaller than device state data - as far as I remember
>>>>>> it was on order of kilobytes.
>>>>>>
>>>>>>>>
>>>>>>>>> then after
>>>>>>>>> everything loaded the device post a semaphore telling the main thread to
>>>>>>>>> continue.  See e.g.:
>>>>>>>>>
>>>>>>>>>          if (migrate_switchover_ack()) {
>>>>>>>>>              qemu_loadvm_state_switchover_ack_needed(mis);
>>>>>>>>>          }
>>>>>>>>>
>>>>>>>>> IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
>>>>>>>>> when all things are loaded?  We can then get rid of this slightly awkward
>>>>>>>>> interface.  I had a feeling that things can be simplified (e.g., if the
>>>>>>>>> thread will take care of loading the final vmstate then the mutex is also
>>>>>>>>> not needed? etc.).
>>>>>>>>
>>>>>>>> With just a single call to switchover_ack_needed per VFIO device it would
>>>>>>>> need to do a blocking wait for the device buffers and config state load
>>>>>>>> to finish, therefore blocking other VFIO devices from potentially loading
>>>>>>>> their config state if they are ready to begin this operation earlier.
>>>>>>>
>>>>>>> I am not sure I get you here, loading VFIO device states (I mean, the
>>>>>>> non-iterable part) will need to be done sequentially IIUC due to what you
>>>>>>> said and should rely on BQL, so I don't know how that could happen
>>>>>>> concurrently for now.  But I think indeed BQL is a problem.
>>>>>> Consider that we have two VFIO devices (A and B), with the following order
>>>>>> of switchover_ack_needed handler calls for them: first A get this call,
>>>>>> once the call for A finishes then B gets this call.
>>>>>>
>>>>>> Now consider what happens if B had loaded all its buffers (in the loading
>>>>>> thread) and it is ready for its config load before A finished loading its
>>>>>> buffers.
>>>>>>
>>>>>> B has to wait idle in this situation (even though it could have been already
>>>>>> loading its config) since the switchover_ack_needed handler for A won't
>>>>>> return until A is fully done.
>>>>>
>>>>> This sounds like a performance concern, and I wonder how much this impacts
>>>>> the real workload (that you run a test and measure, with/without such
>>>>> concurrency) when we can save two devices in parallel anyway; I would
>>>>> expect the real diff is small due to the fact I mentioned that we save >1
>>>>> VFIO devices concurrently via multifd.
>>>>>
>>>>> Do you think we can start with a simpler approach?
>>>>
>>>> I don't think introducing a performance/scalability issue like that is
>>>> a good thing, especially that we already have a design that avoids it.
>>>>
>>>> Unfortunately, my current setup does not allow live migrating VMs with
>>>> more than 4 VFs so I can't benchmark that.
>>>
>>> /me wonders why benchmarking it requires more than 4 VFs.
>>
>> My point here was that the scalability problem will most likely get more
>> pronounced with more VFs.
>>
>>>>
>>>> But I almost certain that with more VFs the situation with devices being
>>>> ready out-of-order will get even more likely.
>>>
>>> If the config space is small, why loading it in sequence would be a
>>> problem?
>>>
>>> Have you measured how much time it needs to load one VF's config space that
>>> you're using?  I suppose that's vfio_load_device_config_state() alone?
>>
>> It's not the amount of data to load matters here but that these address
>> space operations are slow.
>>
>> The whole config load takes ~70 ms per device - that's time equivalent
>> of transferring 875 MiB of device state via a 100 GBit/s link.
> 
> What's the downtime of migration with 1/2/4 VFs?  I remember I saw some
> data somewhere but it's not in the cover letter.  It'll be good to mention
> these results in the cover letter when repost.

Downtimes with the device state transfer being disabled / enabled:
             4 VFs   2 VFs    1 VF
Disabled: 1783 ms  614 ms  283 ms
Enabled:  1068 ms  434 ms  274 ms

Will add these numbers to the cover letter of the next patch set version.

> I'm guessing 70ms isn't a huge deal here, if your NIC has 128GB internal
> device state to migrate.. but maybe I'm wrong.

It's ~100 MiB of device state per VF here.

And it's 70ms of downtime *per device*:
so with 4 VF it's ~280ms of downtime taken by the config loads.
That's a lot - with perfect parallelization this downtime should
*reduce by* 210ms.

> I also wonder whether you profiled a bit on how that 70ms contributes to
> what is slow.

I think that's something we can do after we have parallel config loads
and it turns out their downtime for some reason still scales strongly
linearly with the number of VFIO devices (rather than taking roughly
constant time regardless of the count of these devices if running perfectly
in parallel).

>>
>>>>
>>>>> So what I'm thinking could be very clean is, we just discussed about
>>>>> MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
>>>>> wonder when with it why not we move one step further to have
>>>>> MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
>>>>> ready to send non-iterable".  It can be controlled by the same migration
>>>>> property so we only send these two flags in 9.2+ machine types.
>>>>>
>>>>> Then IIUC VFIO can send config data through main wire (just like most of
>>>>> other pci devices! which is IMHO a good fit..) and on destination VFIO
>>>>> holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.
>>>>
>>>> Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
>>>> to the considerations above) also delay starting the config load until all
>>>> iterable devices were read/transferred/loaded and also would complicate
>>>> future efforts at loading that config data in parallel.
>>>
>>> However I wonder whether we can keep it simple in that VFIO's config space
>>> is still always saved in vfio_save_state().  I still think it's easier we
>>> stick with the main channel whenever possible.  For this specific case, if
>>> the config space is small I think it's tricky you bypass this with:
>>>
>>>       if (migration->multifd_transfer) {
>>>           /* Emit dummy NOP data */
>>>           qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>           return;
>>>       }
>>>
>>> Then squash this as the tail of the iterable data.
>>>
>>> On the src, I think it could use a per-device semaphore, so that iterable
>>> save() thread will post() only if it finishes dumping all the data, then
>>> that orders VFIO iterable data v.s. config space save().
>>
>> In the future we want to not only transfer but also load the config data
>> in parallel.
> 
> How feasible do you think this idea is?  E.g. does it involve BQL so far
> (e.g. memory updates, others)?  What's still missing to make it concurrent?

My gut feeling is that is feasible overall but it's too much of a rabbit
hole for the first version of this device state transfer feature.

I think it will need some deeper QEMU core address space management changes,
which need to be researched/developed/tested/reviewed/etc. on their own.

If it was an easy task I would have gladly included such support in this
patch set version already for extra downtime reduction :)

>>
>> So going back to transferring this data serialized via the main migration
>> channel would be taking a step back here.
> 
> If below holds true:
> 
>    - 70ms is still very small amount in the total downtime, and,
> 
>    - this can avoid the below load_finish() API
> 
> Then I'd go for it.. or again, at least the load_finish() needs change,
> IMHO..

As I wrote above, it's not 70 ms total but 70 ms per device.

Also, even 70 ms is a lot, considering that the default downtime limit
is 300 ms - with a single device that's nearly 1/4 of the limit already.

>>
>> By the way, we already have a serialization point in
>> qemu_savevm_state_complete_precopy_iterable() after iterables have been sent -
>> waiting for device state sending threads to finish their work.
>>
>> Whether this thread_pool_wait() operation will be implemented using
>> semaphores I'm not sure yet - will depend on how well this will fit other
>> GThreadPool internals.
>>
>>> On the dst, after a 2nd thought, MIG_CMD_SEND_NON_ITERABE may not work or
>>> needed indeed, because multifd bypasses the main channel, so if we send
>>> anything like MIG_CMD_SEND_NON_ITERABE on the main channel it won't
>>> guarantee multifd load all complete.  However IIUC that can be used in a
>>> similar way as the src qemu I mentioned above with a per-device semaphore,
>>> so that only all the iterable data of this device loaded and applied to the
>>> HW would it post(), before that, vfio_load_state() should wait() on that
>>> sem waiting for data to ready (while multifd threads will be doing that
>>> part).  I wonder whether we may reuse the multifd recv thread in the
>>> initial version, so maybe we don't need any other threads on destination.
>>>
>>> The load_finish() interface is currently not able to be reused right,
>>> afaict.  Just have a look at its definition:
>>>
>>>       /**
>>>        * @load_finish
>>>        *
>>>        * Poll whether all asynchronous device state loading had finished.
>>>        * Not called on the load failure path.
>>>        *
>>>        * Called while holding the qemu_loadvm_load_finish_ready_lock.
>>>        *
>>>        * If this method signals "not ready" then it might not be called
>>>        * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>>>        * while holding qemu_loadvm_load_finish_ready_lock.
>>>        *
>>>        * @opaque: data pointer passed to register_savevm_live()
>>>        * @is_finished: whether the loading had finished (output parameter)
>>>        * @errp: pointer to Error*, to store an error if it happens.
>>>        *
>>>        * Returns zero to indicate success and negative for error
>>>        * It's not an error that the loading still hasn't finished.
>>>        */
>>>       int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
>>>
>>> It's over complicated on defining all its details:
>>>
>>>     - Not re-entrant by default..
>>
>> What do you mean by "re-entrant" here?
>>
>> This handler is called only from single migration thread, so it cannot
>> be re-entered anyway since the control doesn't return to the migration
>> code until this handler exits (and obviously the handler won't call
>> itself recursively).
> 
> I think it's not a good design to say "you can call this function once, but
> not the 2nd time until you wait on a semaphore".

That's not exactly how this API is supposed to work.

I suspect that you took that "it might not be called again until
qemu_loadvm_load_finish_ready_broadcast() is invoked" as prohibition
from being called again until that signal is broadcast.

The intended meaning of that sentence was "it is possible that it won't
be called again until qemu_loadvm_load_finish_ready_broadcast() is invoked".

In other words, the migration core is free to call this handler how
many times the migration core wants.

But if the handler wants be *sure* that it will get called by the
migration core after the handler has returned "not ready" then it needs
to arrange for load_finish_ready_broadcast() to be invoked somehow.

(..)
>>
>>>       that I feel like perhaps can be replaced by a sem (then to drop the
>>>       condvar)?
>>
>> Once we have ability to load device config state outside main migration
>> thread replacing "load_finish" handler with a semaphore should indeed be
>> possible (that's internal migration API so there should be no issue
>> removing it as not necessary anymore at this point).
>>
>> But for now, the devices need to have ability to run their config load
>> code on the main migration thread, and for that they need to be called
>> from this handler "load_finish".
> 
> A sem seems a must here to notify the iterable data finished loading, but
> that doesn't need to hook to the vmstate handler, but some post-process
> tasks, like what we do around cpu_synchronize_all_post_init() time.
> 
> If per-device vmstate handler hook version of load_finish() is destined to
> look as weird in this case, I'd rather consider a totally separate way to
> enqueue some jobs that needs to be run after all vmstates loaded.  Then
> after one VFIO device fully loads its data, it enqueues the task and post()
> to one migration sem saying that "there's one post-process task, please run
> it in migration thread".  There can be a total number of tasks registered
> so that migration thread knows not to continue until these number of tasks
> processed.  That counter can be part of vmstate handler, maybe, reporting
> that "this vmstate handler has one post-process task".
> 
> Maybe you have other ideas, but please no, let's avoid this load_finish()
> thing..

I can certainly implement the task-queuing approach instead of the
load_finish() handler API if you like such approach more.

> I'd rather still see justifications showing that this 70ms really is
> helpful.. I'd rather wish we have +70ms*Ndev downtime but drop this hook
> until we have a clearer mind when all config space can be loaded
> concurrently, for example.  So we start from simple.

As I wrote above, even 70ms for a single device is a lot considering the
default downtime limit - and that's even more true if multiplied by
multiple devices.

>>
>>>     - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
>>>       above..
>>>
>>> So if you really think it matters to load whatever VFIO device who's
>>> iterable data is ready first, then let's try come up with some better
>>> interface..  I can try to think about it too, but please answer me
>>> questions above so I can understand what I am missing on why that's
>>> important.  Numbers could help, even if 4 VF and I wonder how much diff
>>> there can be.  Mostly, I don't know why it's slow right now if it is; I
>>> thought it should be pretty fast, at least not a concern in VFIO migration
>>> world (which can take seconds of downtime or more..).
>>>
>>> IOW, it sounds more reasonalbe to me that no matter whether vfio will
>>> support multifd, it'll be nice we stick with vfio_load_state() /
>>> vfio_save_state() for config space, and hopefully it's also easier it
>>> always go via the main channel to everyone.  In these two hooks, VFIO can
>>> do whatever it wants to sync with other things (on src, sync with
>>> concurrent thread pool saving iterable data and dumping things to multifd
>>> channels; on dst, sync with multifd concurrent loads). I think it can
>>> remove the requirement on the load_finish() interface completely.  Yes,
>>> this can only load VFIO's pci config space one by one, but I think this is
>>> much simpler, and I hope it's also not that slow, but I'm not sure.
>>
>> To be clear, I made a following diagram describing how the patch set
>> is supposed to work right now, including changing per-device
>> VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.
>>
>> Time flows on it left to right (->).
>>
>> ----------- DIAGRAM START -----------
>> Source overall flow:
>> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
>> Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
>>
>> Target overall flow:
>> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable -> config data load operations
>> Multifd channels:                                       \ multifd device state (1) -> multifd config data read (1)
>>
>> Target config data load operations flow:
>> multifd config data read (1) -> config data load (2)
>>
>> Notes:
>> (1): per device threads running in parallel
> 
> Here I raised this question before, but I'll ask again: do you think we can
> avoid using a separate thread on dest qemu, but reuse multifd recv threads?
> 
> Src probably needs its own threads because multifd sender threads takes
> request, so it can't block on its own.
> 
> However dest qemu isn't like that, it's packet driven so I think maybe it's
> ok VFIO directly loads the data in the multifd threads.  We may want to
> have enough multifd threads to make sure IO still don't block much on the
> NIC, but I think tuning the num of multifd threads should work in this
> case.

We need to have the receiving threads decoupled from the VFIO device state
loading threads at least because otherwise:
1) You can have a deadlock if device state for multiple devices arrives
out of order, like here:

Time flows left to right (->).
Multifd channel 1: (VFIO device 1 buffer 2) (VFIO device 2 buffer 1)
Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 1)

Both channel receive/load threads would be stuck forever in this case,
since they can't load buffer 2 for devices 1 and 2 until they load
buffer 1 for each of these devices.

2) If devices are loading buffers at different speeds you don't want
to block the faster device from receiving new buffer just because
the slower one hasn't finished its loading yet.

>> (2): currently serialized (only one such operation running at a particular time), will hopefully be parallelized in the future
>> ----------- DIAGRAM END -----------
>>
>> Hope the diagram survived being pasted into an e-mail message.
>>
>> One can see, that even now there's a bit of "low hanging fruit" of missing
>> possible parallelism:
>> It seems that the source could wait for multifd device state + multifd config
>> data *after* non-iterables are sent rather than before as it is done
>> currently - so they will be sent in parallel with multifd data.
> 
> Currently it's blocked by this chunk of code of yours:
> 
>      if (multifd_device_state) {
>          ret = multifd_join_device_state_save_threads();
>          if (ret) {
>              qemu_file_set_error(f, ret);
>              return -1;
>          }
>      }
> 
> If with your proposal that vfio config space sent via multifd channels,
> indeed I don't see why it can't be moved to be after non-iterable save()
> completes.  Is that what you implied as "low hanging fruit"?

Yes, exactly - to wait for save threads to finish only after non-iterables
have already been saved.

By "low hanging fruit" I meant it should be a fairly easy change.

> [***]
> 
>>
>> Since written description is often prone to misunderstanding
>> could you please annotate that diagram with your proposed new flow?
> 
> What I was suggesting (removing load_finish()) is mostly the same as what
> you drew I think, especially on src:
> 
> ===============
> Source overall flow:
> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
> Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
> ===============
> 
> In this case we can't do the optimization above [***], since what I
> suggested requires VFIO's vfio_save_state() to dump the config space, so
> the original order will be needed here.
> 
> While on dest, config data load will need to also load using vfio's
> vfio_load_state() so it'll be invoked just like what we normally do with
> non-iterable device states (so here "config data load operations" is part
> of loading all non-iterable devices):
> 
> ===============
> Target overall flow:                                                              (X)
> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable (multifd config data read -> config data load operations)
> Multifd channels:                                       \ multifd device state load                                /
>                                          (lower part done via multifd recv threads, not separate threads)
> ===============
> 
> So here the ordering of (X) is not guarded by anything, however in
> vfio_load_state() the device can sem_wait() on a semaphore that only be
> posted until this device's device state is fully loaded.  So it's not
> completely serialized - "config data load operations" of DEV1 can still
> happen concurrently with "multifd device state load" of DEV2.
> 
> Sorry, this might not be as clear as it's not easy to draw in the graph,
> but I hope the words can help clarify what I meant.
> 
> If 70ms is not a major deal, I suggest we consider above approach, I think
> it can simplify at least the vmstate handler API.  If 70ms matters, let's
> try refactor load_finish() to something usable.

I understand your point here, however as I wrote above, I think that's too
much downtime to "waste" so I will try to rework the load_finish() handler
into the task-queuing approach as you suggested earlier.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-10-01 20:41                     ` Maciej S. Szmigiero
@ 2024-10-01 21:30                       ` Peter Xu
  2024-10-02 20:11                         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-10-01 21:30 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Oct 01, 2024 at 10:41:14PM +0200, Maciej S. Szmigiero wrote:
> On 30.09.2024 23:57, Peter Xu wrote:
> > On Mon, Sep 30, 2024 at 09:25:54PM +0200, Maciej S. Szmigiero wrote:
> > > On 27.09.2024 02:53, Peter Xu wrote:
> > > > On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
> > > > > On 20.09.2024 18:45, Peter Xu wrote:
> > > > > > On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 19.09.2024 23:11, Peter Xu wrote:
> > > > > > > > On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > On 9.09.2024 22:03, Peter Xu wrote:
> > > > > > > > > > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > > > > 
> > > > > > > > > > > load_finish SaveVMHandler allows migration code to poll whether
> > > > > > > > > > > a device-specific asynchronous device state loading operation had finished.
> > > > > > > > > > > 
> > > > > > > > > > > In order to avoid calling this handler needlessly the device is supposed
> > > > > > > > > > > to notify the migration code of its possible readiness via a call to
> > > > > > > > > > > qemu_loadvm_load_finish_ready_broadcast() while holding
> > > > > > > > > > > qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > > > > > > ---
> > > > > > > > > > >       include/migration/register.h | 21 +++++++++++++++
> > > > > > > > > > >       migration/migration.c        |  6 +++++
> > > > > > > > > > >       migration/migration.h        |  3 +++
> > > > > > > > > > >       migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >       migration/savevm.h           |  4 +++
> > > > > > > > > > >       5 files changed, 86 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > > > > > > > index 4a578f140713..44d8cf5192ae 100644
> > > > > > > > > > > --- a/include/migration/register.h
> > > > > > > > > > > +++ b/include/migration/register.h
> > > > > > > > > > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
> > > > > > > > > > >           int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
> > > > > > > > > > >                                    Error **errp);
> > > > > > > > > > > +    /**
> > > > > > > > > > > +     * @load_finish
> > > > > > > > > > > +     *
> > > > > > > > > > > +     * Poll whether all asynchronous device state loading had finished.
> > > > > > > > > > > +     * Not called on the load failure path.
> > > > > > > > > > > +     *
> > > > > > > > > > > +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > > > +     *
> > > > > > > > > > > +     * If this method signals "not ready" then it might not be called
> > > > > > > > > > > +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> > > > > > > > > > > +     * while holding qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > > 
> > > > > > > > > > [1]
> > > > > > > > > > 
> > > > > > > > > > > +     *
> > > > > > > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > > > > > > +     * @is_finished: whether the loading had finished (output parameter)
> > > > > > > > > > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > > > > > > > > > +     *
> > > > > > > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > > > > > > +     * It's not an error that the loading still hasn't finished.
> > > > > > > > > > > +     */
> > > > > > > > > > > +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > > > > > > > > > 
> > > > > > > > > > The load_finish() semantics is a bit weird, especially above [1] on "only
> > > > > > > > > > allowed to be called once if ..." and also on the locks.
> > > > > > > > > 
> > > > > > > > > The point of this remark is that a driver needs to call
> > > > > > > > > qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
> > > > > > > > > core to call its load_finish handler again.
> > > > > > > > > 
> > > > > > > > > > It looks to me vfio_load_finish() also does the final load of the device.
> > > > > > > > > > 
> > > > > > > > > > I wonder whether that final load can be done in the threads,
> > > > > > > > > 
> > > > > > > > > Here, the problem is that current VFIO VMState has to be loaded from the main
> > > > > > > > > migration thread as it internally calls QEMU core address space modification
> > > > > > > > > methods which explode if called from another thread(s).
> > > > > > > > 
> > > > > > > > Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
> > > > > > > > BQL if possible, when that's ready then in your case here IIUC you can
> > > > > > > > simply take BQL in whichever thread that loads it.. but yeah it's not ready
> > > > > > > > at least..
> > > > > > > 
> > > > > > > Yeah, long term we might want to work on making these QEMU core address space
> > > > > > > modification methods somehow callable from multiple threads but that's
> > > > > > > definitely not something for the initial patch set.
> > > > > > > 
> > > > > > > > Would it be possible vfio_save_complete_precopy_async_thread_config_state()
> > > > > > > > be done in VFIO's save_live_complete_precopy() through the main channel
> > > > > > > > somehow?  IOW, does it rely on iterative data to be fetched first from
> > > > > > > > kernel, or completely separate states?
> > > > > > > 
> > > > > > > The device state data needs to be fully loaded first before "activating"
> > > > > > > the device by loading its config state.
> > > > > > > 
> > > > > > > > And just curious: how large is it
> > > > > > > > normally (and I suppose this decides whether it's applicable to be sent via
> > > > > > > > the main channel at all..)?
> > > > > > > 
> > > > > > > Config data is *much* smaller than device state data - as far as I remember
> > > > > > > it was on order of kilobytes.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > > then after
> > > > > > > > > > everything loaded the device post a semaphore telling the main thread to
> > > > > > > > > > continue.  See e.g.:
> > > > > > > > > > 
> > > > > > > > > >          if (migrate_switchover_ack()) {
> > > > > > > > > >              qemu_loadvm_state_switchover_ack_needed(mis);
> > > > > > > > > >          }
> > > > > > > > > > 
> > > > > > > > > > IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
> > > > > > > > > > when all things are loaded?  We can then get rid of this slightly awkward
> > > > > > > > > > interface.  I had a feeling that things can be simplified (e.g., if the
> > > > > > > > > > thread will take care of loading the final vmstate then the mutex is also
> > > > > > > > > > not needed? etc.).
> > > > > > > > > 
> > > > > > > > > With just a single call to switchover_ack_needed per VFIO device it would
> > > > > > > > > need to do a blocking wait for the device buffers and config state load
> > > > > > > > > to finish, therefore blocking other VFIO devices from potentially loading
> > > > > > > > > their config state if they are ready to begin this operation earlier.
> > > > > > > > 
> > > > > > > > I am not sure I get you here, loading VFIO device states (I mean, the
> > > > > > > > non-iterable part) will need to be done sequentially IIUC due to what you
> > > > > > > > said and should rely on BQL, so I don't know how that could happen
> > > > > > > > concurrently for now.  But I think indeed BQL is a problem.
> > > > > > > Consider that we have two VFIO devices (A and B), with the following order
> > > > > > > of switchover_ack_needed handler calls for them: first A get this call,
> > > > > > > once the call for A finishes then B gets this call.
> > > > > > > 
> > > > > > > Now consider what happens if B had loaded all its buffers (in the loading
> > > > > > > thread) and it is ready for its config load before A finished loading its
> > > > > > > buffers.
> > > > > > > 
> > > > > > > B has to wait idle in this situation (even though it could have been already
> > > > > > > loading its config) since the switchover_ack_needed handler for A won't
> > > > > > > return until A is fully done.
> > > > > > 
> > > > > > This sounds like a performance concern, and I wonder how much this impacts
> > > > > > the real workload (that you run a test and measure, with/without such
> > > > > > concurrency) when we can save two devices in parallel anyway; I would
> > > > > > expect the real diff is small due to the fact I mentioned that we save >1
> > > > > > VFIO devices concurrently via multifd.
> > > > > > 
> > > > > > Do you think we can start with a simpler approach?
> > > > > 
> > > > > I don't think introducing a performance/scalability issue like that is
> > > > > a good thing, especially that we already have a design that avoids it.
> > > > > 
> > > > > Unfortunately, my current setup does not allow live migrating VMs with
> > > > > more than 4 VFs so I can't benchmark that.
> > > > 
> > > > /me wonders why benchmarking it requires more than 4 VFs.
> > > 
> > > My point here was that the scalability problem will most likely get more
> > > pronounced with more VFs.
> > > 
> > > > > 
> > > > > But I almost certain that with more VFs the situation with devices being
> > > > > ready out-of-order will get even more likely.
> > > > 
> > > > If the config space is small, why loading it in sequence would be a
> > > > problem?
> > > > 
> > > > Have you measured how much time it needs to load one VF's config space that
> > > > you're using?  I suppose that's vfio_load_device_config_state() alone?
> > > 
> > > It's not the amount of data to load matters here but that these address
> > > space operations are slow.
> > > 
> > > The whole config load takes ~70 ms per device - that's time equivalent
> > > of transferring 875 MiB of device state via a 100 GBit/s link.
> > 
> > What's the downtime of migration with 1/2/4 VFs?  I remember I saw some
> > data somewhere but it's not in the cover letter.  It'll be good to mention
> > these results in the cover letter when repost.
> 
> Downtimes with the device state transfer being disabled / enabled:
>             4 VFs   2 VFs    1 VF
> Disabled: 1783 ms  614 ms  283 ms
> Enabled:  1068 ms  434 ms  274 ms
> 
> Will add these numbers to the cover letter of the next patch set version.

Thanks.

> 
> > I'm guessing 70ms isn't a huge deal here, if your NIC has 128GB internal
> > device state to migrate.. but maybe I'm wrong.
> 
> It's ~100 MiB of device state per VF here.

Ouch..

I watched your kvm forum talk recording, I remember that's where I get that
128 number but probably get the unit wrong.. ok that makes sense.

> 
> And it's 70ms of downtime *per device*:
> so with 4 VF it's ~280ms of downtime taken by the config loads.
> That's a lot - with perfect parallelization this downtime should
> *reduce by* 210ms.

Yes, in this case it's a lot.  I wonder why it won't scale as good even
with your patchset.

Did you profile why?  I highly doubt in your case network is an issue, as
there's only 100MB per-dev data, so even on 10gbps it takes 100ms only to
transfer for each, while now assuming it can run concurrently.  I think you
mentioned you were using 100gbps, right?

Logically when with multiple threads, VFIO read()s should happen at least
concurrently per-device.  Have you checked that there's no kernel-side
global VFIO lock etc. that serializes portions of the threads read()s /
write()s on the VFIO fds?

It's just a pity that you went this far, added all these logics, but
without making it fully concurrent at least per device.

I'm OK if you want this in without that figured out, but if I were you I'll
probably try to dig a bit to at least know why.

> 
> > I also wonder whether you profiled a bit on how that 70ms contributes to
> > what is slow.
> 
> I think that's something we can do after we have parallel config loads
> and it turns out their downtime for some reason still scales strongly
> linearly with the number of VFIO devices (rather than taking roughly
> constant time regardless of the count of these devices if running perfectly
> in parallel).

Similarly, I wonder whether the config space load() can involves something
globally shared.  I'd also dig a bit here, but I'll leave that to you to
decide.

> 
> > > 
> > > > > 
> > > > > > So what I'm thinking could be very clean is, we just discussed about
> > > > > > MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
> > > > > > wonder when with it why not we move one step further to have
> > > > > > MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
> > > > > > ready to send non-iterable".  It can be controlled by the same migration
> > > > > > property so we only send these two flags in 9.2+ machine types.
> > > > > > 
> > > > > > Then IIUC VFIO can send config data through main wire (just like most of
> > > > > > other pci devices! which is IMHO a good fit..) and on destination VFIO
> > > > > > holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.
> > > > > 
> > > > > Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
> > > > > to the considerations above) also delay starting the config load until all
> > > > > iterable devices were read/transferred/loaded and also would complicate
> > > > > future efforts at loading that config data in parallel.
> > > > 
> > > > However I wonder whether we can keep it simple in that VFIO's config space
> > > > is still always saved in vfio_save_state().  I still think it's easier we
> > > > stick with the main channel whenever possible.  For this specific case, if
> > > > the config space is small I think it's tricky you bypass this with:
> > > > 
> > > >       if (migration->multifd_transfer) {
> > > >           /* Emit dummy NOP data */
> > > >           qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > >           return;
> > > >       }
> > > > 
> > > > Then squash this as the tail of the iterable data.
> > > > 
> > > > On the src, I think it could use a per-device semaphore, so that iterable
> > > > save() thread will post() only if it finishes dumping all the data, then
> > > > that orders VFIO iterable data v.s. config space save().
> > > 
> > > In the future we want to not only transfer but also load the config data
> > > in parallel.
> > 
> > How feasible do you think this idea is?  E.g. does it involve BQL so far
> > (e.g. memory updates, others)?  What's still missing to make it concurrent?
> 
> My gut feeling is that is feasible overall but it's too much of a rabbit
> hole for the first version of this device state transfer feature.
> 
> I think it will need some deeper QEMU core address space management changes,
> which need to be researched/developed/tested/reviewed/etc. on their own.
> 
> If it was an easy task I would have gladly included such support in this
> patch set version already for extra downtime reduction :)

Yes I understand.

Note that it doesn't need to be implemented and resolved in one shot, but I
wonder if it'll still be good to debug the issue and know where is not
scaling.

Considering that your design is fully concurrent as of now on iterable data
from QEMU side, it's less persuasive to provide perf numbers that still
doesn't scale that much; 1.78s -> 1.06s is a good improvement, but it
doesn't seem to solve the scalability issue that this whole series wanted
to address in general.

An extreme (bad) example is if VFIO has all ioctl()/read()/write() take a
global lock, then any work in QEMU trying to run things in parallel will be
a vain.  Such patchset cannot be accepted because the other issue needs to
be resolved first.

Now it's in the middle of best/worst condition, where it did improve but it
still doesn't scale that well.  I think it can be accepted, but still I
feel like we're ignoring some of the real issues.  We can choose to ignore
the kernel saying that "it's too much to do together", but IMHO the issues
should be tackled in the other way round.. the normal case is one should
work out the kernel scalability issues, then QEMU should be on top.. Simply
because any kernel change that might scale >1 device save()/load() can
affect future QEMU change and design, not vice versa.

Again, I know you wished we make some progress, so I don't have a strong
opinion.  Just FYI.

> 
> > > 
> > > So going back to transferring this data serialized via the main migration
> > > channel would be taking a step back here.
> > 
> > If below holds true:
> > 
> >    - 70ms is still very small amount in the total downtime, and,
> > 
> >    - this can avoid the below load_finish() API
> > 
> > Then I'd go for it.. or again, at least the load_finish() needs change,
> > IMHO..
> 
> As I wrote above, it's not 70 ms total but 70 ms per device.
> 
> Also, even 70 ms is a lot, considering that the default downtime limit
> is 300 ms - with a single device that's nearly 1/4 of the limit already.
> 
> > > 
> > > By the way, we already have a serialization point in
> > > qemu_savevm_state_complete_precopy_iterable() after iterables have been sent -
> > > waiting for device state sending threads to finish their work.
> > > 
> > > Whether this thread_pool_wait() operation will be implemented using
> > > semaphores I'm not sure yet - will depend on how well this will fit other
> > > GThreadPool internals.
> > > 
> > > > On the dst, after a 2nd thought, MIG_CMD_SEND_NON_ITERABE may not work or
> > > > needed indeed, because multifd bypasses the main channel, so if we send
> > > > anything like MIG_CMD_SEND_NON_ITERABE on the main channel it won't
> > > > guarantee multifd load all complete.  However IIUC that can be used in a
> > > > similar way as the src qemu I mentioned above with a per-device semaphore,
> > > > so that only all the iterable data of this device loaded and applied to the
> > > > HW would it post(), before that, vfio_load_state() should wait() on that
> > > > sem waiting for data to ready (while multifd threads will be doing that
> > > > part).  I wonder whether we may reuse the multifd recv thread in the
> > > > initial version, so maybe we don't need any other threads on destination.
> > > > 
> > > > The load_finish() interface is currently not able to be reused right,
> > > > afaict.  Just have a look at its definition:
> > > > 
> > > >       /**
> > > >        * @load_finish
> > > >        *
> > > >        * Poll whether all asynchronous device state loading had finished.
> > > >        * Not called on the load failure path.
> > > >        *
> > > >        * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > >        *
> > > >        * If this method signals "not ready" then it might not be called
> > > >        * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> > > >        * while holding qemu_loadvm_load_finish_ready_lock.
> > > >        *
> > > >        * @opaque: data pointer passed to register_savevm_live()
> > > >        * @is_finished: whether the loading had finished (output parameter)
> > > >        * @errp: pointer to Error*, to store an error if it happens.
> > > >        *
> > > >        * Returns zero to indicate success and negative for error
> > > >        * It's not an error that the loading still hasn't finished.
> > > >        */
> > > >       int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > > > 
> > > > It's over complicated on defining all its details:
> > > > 
> > > >     - Not re-entrant by default..
> > > 
> > > What do you mean by "re-entrant" here?
> > > 
> > > This handler is called only from single migration thread, so it cannot
> > > be re-entered anyway since the control doesn't return to the migration
> > > code until this handler exits (and obviously the handler won't call
> > > itself recursively).
> > 
> > I think it's not a good design to say "you can call this function once, but
> > not the 2nd time until you wait on a semaphore".
> 
> That's not exactly how this API is supposed to work.
> 
> I suspect that you took that "it might not be called again until
> qemu_loadvm_load_finish_ready_broadcast() is invoked" as prohibition
> from being called again until that signal is broadcast.
> 
> The intended meaning of that sentence was "it is possible that it won't
> be called again until qemu_loadvm_load_finish_ready_broadcast() is invoked".
> 
> In other words, the migration core is free to call this handler how
> many times the migration core wants.
> 
> But if the handler wants be *sure* that it will get called by the
> migration core after the handler has returned "not ready" then it needs
> to arrange for load_finish_ready_broadcast() to be invoked somehow.

OK I see.

> 
> (..)
> > > 
> > > >       that I feel like perhaps can be replaced by a sem (then to drop the
> > > >       condvar)?
> > > 
> > > Once we have ability to load device config state outside main migration
> > > thread replacing "load_finish" handler with a semaphore should indeed be
> > > possible (that's internal migration API so there should be no issue
> > > removing it as not necessary anymore at this point).
> > > 
> > > But for now, the devices need to have ability to run their config load
> > > code on the main migration thread, and for that they need to be called
> > > from this handler "load_finish".
> > 
> > A sem seems a must here to notify the iterable data finished loading, but
> > that doesn't need to hook to the vmstate handler, but some post-process
> > tasks, like what we do around cpu_synchronize_all_post_init() time.
> > 
> > If per-device vmstate handler hook version of load_finish() is destined to
> > look as weird in this case, I'd rather consider a totally separate way to
> > enqueue some jobs that needs to be run after all vmstates loaded.  Then
> > after one VFIO device fully loads its data, it enqueues the task and post()
> > to one migration sem saying that "there's one post-process task, please run
> > it in migration thread".  There can be a total number of tasks registered
> > so that migration thread knows not to continue until these number of tasks
> > processed.  That counter can be part of vmstate handler, maybe, reporting
> > that "this vmstate handler has one post-process task".
> > 
> > Maybe you have other ideas, but please no, let's avoid this load_finish()
> > thing..
> 
> I can certainly implement the task-queuing approach instead of the
> load_finish() handler API if you like such approach more.

I have an even simpler solution now.  I think you can reuse precopy
notifiers.

You can add one new PRECOPY_NOTIFY_INCOMING_COMPLETE event, invoke it after
vmstate load all done.

As long as VFIO devices exist, VFIO can register with that event, then it
can do whatever it wants in the main loader thread with BQL held.

You can hide that sem post() / wait() all there, then it's completely VFIO
internal.  Then we leave vmstate handler alone; it just doesn't sound
suitable when the hooks need to be called out of order.

> 
> > I'd rather still see justifications showing that this 70ms really is
> > helpful.. I'd rather wish we have +70ms*Ndev downtime but drop this hook
> > until we have a clearer mind when all config space can be loaded
> > concurrently, for example.  So we start from simple.
> 
> As I wrote above, even 70ms for a single device is a lot considering the
> default downtime limit - and that's even more true if multiplied by
> multiple devices.
> 
> > > 
> > > >     - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
> > > >       above..
> > > > 
> > > > So if you really think it matters to load whatever VFIO device who's
> > > > iterable data is ready first, then let's try come up with some better
> > > > interface..  I can try to think about it too, but please answer me
> > > > questions above so I can understand what I am missing on why that's
> > > > important.  Numbers could help, even if 4 VF and I wonder how much diff
> > > > there can be.  Mostly, I don't know why it's slow right now if it is; I
> > > > thought it should be pretty fast, at least not a concern in VFIO migration
> > > > world (which can take seconds of downtime or more..).
> > > > 
> > > > IOW, it sounds more reasonalbe to me that no matter whether vfio will
> > > > support multifd, it'll be nice we stick with vfio_load_state() /
> > > > vfio_save_state() for config space, and hopefully it's also easier it
> > > > always go via the main channel to everyone.  In these two hooks, VFIO can
> > > > do whatever it wants to sync with other things (on src, sync with
> > > > concurrent thread pool saving iterable data and dumping things to multifd
> > > > channels; on dst, sync with multifd concurrent loads). I think it can
> > > > remove the requirement on the load_finish() interface completely.  Yes,
> > > > this can only load VFIO's pci config space one by one, but I think this is
> > > > much simpler, and I hope it's also not that slow, but I'm not sure.
> > > 
> > > To be clear, I made a following diagram describing how the patch set
> > > is supposed to work right now, including changing per-device
> > > VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.
> > > 
> > > Time flows on it left to right (->).
> > > 
> > > ----------- DIAGRAM START -----------
> > > Source overall flow:
> > > Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
> > > Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
> > > 
> > > Target overall flow:
> > > Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable -> config data load operations
> > > Multifd channels:                                       \ multifd device state (1) -> multifd config data read (1)
> > > 
> > > Target config data load operations flow:
> > > multifd config data read (1) -> config data load (2)
> > > 
> > > Notes:
> > > (1): per device threads running in parallel
> > 
> > Here I raised this question before, but I'll ask again: do you think we can
> > avoid using a separate thread on dest qemu, but reuse multifd recv threads?
> > 
> > Src probably needs its own threads because multifd sender threads takes
> > request, so it can't block on its own.
> > 
> > However dest qemu isn't like that, it's packet driven so I think maybe it's
> > ok VFIO directly loads the data in the multifd threads.  We may want to
> > have enough multifd threads to make sure IO still don't block much on the
> > NIC, but I think tuning the num of multifd threads should work in this
> > case.
> 
> We need to have the receiving threads decoupled from the VFIO device state
> loading threads at least because otherwise:
> 1) You can have a deadlock if device state for multiple devices arrives
> out of order, like here:
> 
> Time flows left to right (->).
> Multifd channel 1: (VFIO device 1 buffer 2) (VFIO device 2 buffer 1)
> Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 1)
> 
> Both channel receive/load threads would be stuck forever in this case,
> since they can't load buffer 2 for devices 1 and 2 until they load
> buffer 1 for each of these devices.
> 
> 2) If devices are loading buffers at different speeds you don't want
> to block the faster device from receiving new buffer just because
> the slower one hasn't finished its loading yet.

I don't see why it can't be avoided.  Let me draw this in columns.

How I picture this is:

   multifd recv thread 1                     multifd recv thread 2
   ---------------------                     ---------------------
   recv VFIO device 1 buffer 2             recv VFIO device 2 buffer 2
    -> found that (dev1, buf1) missing,      -> found that (dev2, buf1) missing,
       skip load                                skip load
   recv VFIO device 2 buffer 1             recv VFIO device 1 buffer 1 
    -> found that (dev2, buf1+buf2) ready,   -> found that (dev1, buf1+buf2) ready,
       load buf1+2 for dev2 here                load buf1+2 for dev1 here
                                               
Here right after one multifd thread recvs a buffer, it needs to be injected
into the cache array (with proper locking), so that whoever receives a full
series of those buffers will do the load (again, with proper locking..).

Would this not work?

> 
> > > (2): currently serialized (only one such operation running at a particular time), will hopefully be parallelized in the future
> > > ----------- DIAGRAM END -----------
> > > 
> > > Hope the diagram survived being pasted into an e-mail message.
> > > 
> > > One can see, that even now there's a bit of "low hanging fruit" of missing
> > > possible parallelism:
> > > It seems that the source could wait for multifd device state + multifd config
> > > data *after* non-iterables are sent rather than before as it is done
> > > currently - so they will be sent in parallel with multifd data.
> > 
> > Currently it's blocked by this chunk of code of yours:
> > 
> >      if (multifd_device_state) {
> >          ret = multifd_join_device_state_save_threads();
> >          if (ret) {
> >              qemu_file_set_error(f, ret);
> >              return -1;
> >          }
> >      }
> > 
> > If with your proposal that vfio config space sent via multifd channels,
> > indeed I don't see why it can't be moved to be after non-iterable save()
> > completes.  Is that what you implied as "low hanging fruit"?
> 
> Yes, exactly - to wait for save threads to finish only after non-iterables
> have already been saved.
> 
> By "low hanging fruit" I meant it should be a fairly easy change.
> 
> > [***]
> > 
> > > 
> > > Since written description is often prone to misunderstanding
> > > could you please annotate that diagram with your proposed new flow?
> > 
> > What I was suggesting (removing load_finish()) is mostly the same as what
> > you drew I think, especially on src:
> > 
> > ===============
> > Source overall flow:
> > Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
> > Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
> > ===============
> > 
> > In this case we can't do the optimization above [***], since what I
> > suggested requires VFIO's vfio_save_state() to dump the config space, so
> > the original order will be needed here.
> > 
> > While on dest, config data load will need to also load using vfio's
> > vfio_load_state() so it'll be invoked just like what we normally do with
> > non-iterable device states (so here "config data load operations" is part
> > of loading all non-iterable devices):
> > 
> > ===============
> > Target overall flow:                                                              (X)
> > Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable (multifd config data read -> config data load operations)
> > Multifd channels:                                       \ multifd device state load                                /
> >                                          (lower part done via multifd recv threads, not separate threads)
> > ===============
> > 
> > So here the ordering of (X) is not guarded by anything, however in
> > vfio_load_state() the device can sem_wait() on a semaphore that only be
> > posted until this device's device state is fully loaded.  So it's not
> > completely serialized - "config data load operations" of DEV1 can still
> > happen concurrently with "multifd device state load" of DEV2.
> > 
> > Sorry, this might not be as clear as it's not easy to draw in the graph,
> > but I hope the words can help clarify what I meant.
> > 
> > If 70ms is not a major deal, I suggest we consider above approach, I think
> > it can simplify at least the vmstate handler API.  If 70ms matters, let's
> > try refactor load_finish() to something usable.
> 
> I understand your point here, however as I wrote above, I think that's too
> much downtime to "waste" so I will try to rework the load_finish() handler
> into the task-queuing approach as you suggested earlier.

Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-10-01 21:30                       ` Peter Xu
@ 2024-10-02 20:11                         ` Maciej S. Szmigiero
  2024-10-02 21:25                           ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-10-02 20:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 1.10.2024 23:30, Peter Xu wrote:
> On Tue, Oct 01, 2024 at 10:41:14PM +0200, Maciej S. Szmigiero wrote:
>> On 30.09.2024 23:57, Peter Xu wrote:
>>> On Mon, Sep 30, 2024 at 09:25:54PM +0200, Maciej S. Szmigiero wrote:
>>>> On 27.09.2024 02:53, Peter Xu wrote:
>>>>> On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 20.09.2024 18:45, Peter Xu wrote:
>>>>>>> On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> On 19.09.2024 23:11, Peter Xu wrote:
>>>>>>>>> On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>> On 9.09.2024 22:03, Peter Xu wrote:
>>>>>>>>>>> On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>
>>>>>>>>>>>> load_finish SaveVMHandler allows migration code to poll whether
>>>>>>>>>>>> a device-specific asynchronous device state loading operation had finished.
>>>>>>>>>>>>
>>>>>>>>>>>> In order to avoid calling this handler needlessly the device is supposed
>>>>>>>>>>>> to notify the migration code of its possible readiness via a call to
>>>>>>>>>>>> qemu_loadvm_load_finish_ready_broadcast() while holding
>>>>>>>>>>>> qemu_loadvm_load_finish_ready_lock.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        include/migration/register.h | 21 +++++++++++++++
>>>>>>>>>>>>        migration/migration.c        |  6 +++++
>>>>>>>>>>>>        migration/migration.h        |  3 +++
>>>>>>>>>>>>        migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>        migration/savevm.h           |  4 +++
>>>>>>>>>>>>        5 files changed, 86 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>>>>>>>>> index 4a578f140713..44d8cf5192ae 100644
>>>>>>>>>>>> --- a/include/migration/register.h
>>>>>>>>>>>> +++ b/include/migration/register.h
>>>>>>>>>>>> @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
>>>>>>>>>>>>            int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
>>>>>>>>>>>>                                     Error **errp);
>>>>>>>>>>>> +    /**
>>>>>>>>>>>> +     * @load_finish
>>>>>>>>>>>> +     *
>>>>>>>>>>>> +     * Poll whether all asynchronous device state loading had finished.
>>>>>>>>>>>> +     * Not called on the load failure path.
>>>>>>>>>>>> +     *
>>>>>>>>>>>> +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
>>>>>>>>>>>> +     *
>>>>>>>>>>>> +     * If this method signals "not ready" then it might not be called
>>>>>>>>>>>> +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
>>>>>>>>>>>> +     * while holding qemu_loadvm_load_finish_ready_lock.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>>
>>>>>>>>>>>> +     *
>>>>>>>>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>>>>>>>>> +     * @is_finished: whether the loading had finished (output parameter)
>>>>>>>>>>>> +     * @errp: pointer to Error*, to store an error if it happens.
>>>>>>>>>>>> +     *
>>>>>>>>>>>> +     * Returns zero to indicate success and negative for error
>>>>>>>>>>>> +     * It's not an error that the loading still hasn't finished.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
>>>>>>>>>>>
>>>>>>>>>>> The load_finish() semantics is a bit weird, especially above [1] on "only
>>>>>>>>>>> allowed to be called once if ..." and also on the locks.
>>>>>>>>>>
>>>>>>>>>> The point of this remark is that a driver needs to call
>>>>>>>>>> qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
>>>>>>>>>> core to call its load_finish handler again.
>>>>>>>>>>
>>>>>>>>>>> It looks to me vfio_load_finish() also does the final load of the device.
>>>>>>>>>>>
>>>>>>>>>>> I wonder whether that final load can be done in the threads,
>>>>>>>>>>
>>>>>>>>>> Here, the problem is that current VFIO VMState has to be loaded from the main
>>>>>>>>>> migration thread as it internally calls QEMU core address space modification
>>>>>>>>>> methods which explode if called from another thread(s).
>>>>>>>>>
>>>>>>>>> Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
>>>>>>>>> BQL if possible, when that's ready then in your case here IIUC you can
>>>>>>>>> simply take BQL in whichever thread that loads it.. but yeah it's not ready
>>>>>>>>> at least..
>>>>>>>>
>>>>>>>> Yeah, long term we might want to work on making these QEMU core address space
>>>>>>>> modification methods somehow callable from multiple threads but that's
>>>>>>>> definitely not something for the initial patch set.
>>>>>>>>
>>>>>>>>> Would it be possible vfio_save_complete_precopy_async_thread_config_state()
>>>>>>>>> be done in VFIO's save_live_complete_precopy() through the main channel
>>>>>>>>> somehow?  IOW, does it rely on iterative data to be fetched first from
>>>>>>>>> kernel, or completely separate states?
>>>>>>>>
>>>>>>>> The device state data needs to be fully loaded first before "activating"
>>>>>>>> the device by loading its config state.
>>>>>>>>
>>>>>>>>> And just curious: how large is it
>>>>>>>>> normally (and I suppose this decides whether it's applicable to be sent via
>>>>>>>>> the main channel at all..)?
>>>>>>>>
>>>>>>>> Config data is *much* smaller than device state data - as far as I remember
>>>>>>>> it was on order of kilobytes.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> then after
>>>>>>>>>>> everything loaded the device post a semaphore telling the main thread to
>>>>>>>>>>> continue.  See e.g.:
>>>>>>>>>>>
>>>>>>>>>>>           if (migrate_switchover_ack()) {
>>>>>>>>>>>               qemu_loadvm_state_switchover_ack_needed(mis);
>>>>>>>>>>>           }
>>>>>>>>>>>
>>>>>>>>>>> IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
>>>>>>>>>>> when all things are loaded?  We can then get rid of this slightly awkward
>>>>>>>>>>> interface.  I had a feeling that things can be simplified (e.g., if the
>>>>>>>>>>> thread will take care of loading the final vmstate then the mutex is also
>>>>>>>>>>> not needed? etc.).
>>>>>>>>>>
>>>>>>>>>> With just a single call to switchover_ack_needed per VFIO device it would
>>>>>>>>>> need to do a blocking wait for the device buffers and config state load
>>>>>>>>>> to finish, therefore blocking other VFIO devices from potentially loading
>>>>>>>>>> their config state if they are ready to begin this operation earlier.
>>>>>>>>>
>>>>>>>>> I am not sure I get you here, loading VFIO device states (I mean, the
>>>>>>>>> non-iterable part) will need to be done sequentially IIUC due to what you
>>>>>>>>> said and should rely on BQL, so I don't know how that could happen
>>>>>>>>> concurrently for now.  But I think indeed BQL is a problem.
>>>>>>>> Consider that we have two VFIO devices (A and B), with the following order
>>>>>>>> of switchover_ack_needed handler calls for them: first A get this call,
>>>>>>>> once the call for A finishes then B gets this call.
>>>>>>>>
>>>>>>>> Now consider what happens if B had loaded all its buffers (in the loading
>>>>>>>> thread) and it is ready for its config load before A finished loading its
>>>>>>>> buffers.
>>>>>>>>
>>>>>>>> B has to wait idle in this situation (even though it could have been already
>>>>>>>> loading its config) since the switchover_ack_needed handler for A won't
>>>>>>>> return until A is fully done.
>>>>>>>
>>>>>>> This sounds like a performance concern, and I wonder how much this impacts
>>>>>>> the real workload (that you run a test and measure, with/without such
>>>>>>> concurrency) when we can save two devices in parallel anyway; I would
>>>>>>> expect the real diff is small due to the fact I mentioned that we save >1
>>>>>>> VFIO devices concurrently via multifd.
>>>>>>>
>>>>>>> Do you think we can start with a simpler approach?
>>>>>>
>>>>>> I don't think introducing a performance/scalability issue like that is
>>>>>> a good thing, especially that we already have a design that avoids it.
>>>>>>
>>>>>> Unfortunately, my current setup does not allow live migrating VMs with
>>>>>> more than 4 VFs so I can't benchmark that.
>>>>>
>>>>> /me wonders why benchmarking it requires more than 4 VFs.
>>>>
>>>> My point here was that the scalability problem will most likely get more
>>>> pronounced with more VFs.
>>>>
>>>>>>
>>>>>> But I almost certain that with more VFs the situation with devices being
>>>>>> ready out-of-order will get even more likely.
>>>>>
>>>>> If the config space is small, why loading it in sequence would be a
>>>>> problem?
>>>>>
>>>>> Have you measured how much time it needs to load one VF's config space that
>>>>> you're using?  I suppose that's vfio_load_device_config_state() alone?
>>>>
>>>> It's not the amount of data to load matters here but that these address
>>>> space operations are slow.
>>>>
>>>> The whole config load takes ~70 ms per device - that's time equivalent
>>>> of transferring 875 MiB of device state via a 100 GBit/s link.
>>>
>>> What's the downtime of migration with 1/2/4 VFs?  I remember I saw some
>>> data somewhere but it's not in the cover letter.  It'll be good to mention
>>> these results in the cover letter when repost.
>>
>> Downtimes with the device state transfer being disabled / enabled:
>>              4 VFs   2 VFs    1 VF
>> Disabled: 1783 ms  614 ms  283 ms
>> Enabled:  1068 ms  434 ms  274 ms
>>
>> Will add these numbers to the cover letter of the next patch set version.
> 
> Thanks.
> 
>>
>>> I'm guessing 70ms isn't a huge deal here, if your NIC has 128GB internal
>>> device state to migrate.. but maybe I'm wrong.
>>
>> It's ~100 MiB of device state per VF here.
> 
> Ouch..
> 
> I watched your kvm forum talk recording, I remember that's where I get that
> 128 number but probably get the unit wrong.. ok that makes sense.
> 
>>
>> And it's 70ms of downtime *per device*:
>> so with 4 VF it's ~280ms of downtime taken by the config loads.
>> That's a lot - with perfect parallelization this downtime should
>> *reduce by* 210ms.
> 
> Yes, in this case it's a lot.  I wonder why it won't scale as good even
> with your patchset.
> 
> Did you profile why?  I highly doubt in your case network is an issue, as
> there's only 100MB per-dev data, so even on 10gbps it takes 100ms only to
> transfer for each, while now assuming it can run concurrently.  I think you
> mentioned you were using 100gbps, right?

Right, these 2 test machines are connected via a 100 GBbps network.

> Logically when with multiple threads, VFIO read()s should happen at least
> concurrently per-device.  Have you checked that there's no kernel-side
> global VFIO lock etc. that serializes portions of the threads read()s /
> write()s on the VFIO fds?

For these devices the kernel side has been significantly improved a year ago:
https://lore.kernel.org/kvm/20230911093856.81910-1-yishaih@nvidia.com/

In the mlx5 driver the in-kernel device reading task (work) is separated
from the userspace (QEMU) read()ing task via a double/multi buffering scheme.

If there was indeed some global lock serializing all device accesses we
wouldn't be seeing that much improvement from this patch set as we are
seeing - especially that the improvement seems to *increase* with the
increased VF count in a single PF.

> It's just a pity that you went this far, added all these logics, but
> without making it fully concurrent at least per device.

AFAIK NVIDIA/Mellanox are continuously working on improving the mlx5 driver,
but to benefit from the driver parallelism we need parallelism in QEMU
too so the userspace won't become the serialization point/bottleneck.

In other words, it's kind of a chicken and egg problem.

That's why I want to preserve as much parallelism in this patch set as
possible to avoid accidental serialization which (even if not a problem
right now) may become the bottleneck at some point.

> I'm OK if you want this in without that figured out, but if I were you I'll
> probably try to dig a bit to at least know why.
> 
>>
>>> I also wonder whether you profiled a bit on how that 70ms contributes to
>>> what is slow.
>>
>> I think that's something we can do after we have parallel config loads
>> and it turns out their downtime for some reason still scales strongly
>> linearly with the number of VFIO devices (rather than taking roughly
>> constant time regardless of the count of these devices if running perfectly
>> in parallel).
> 
> Similarly, I wonder whether the config space load() can involves something
> globally shared.  I'd also dig a bit here, but I'll leave that to you to
> decide.

Making config loads thread-safe/parallelizable is definitely on my future
TODO list.

Just wanted to keep the amount of changes in the first version of this
patch set within reasonable bounds - one has to draw a line somewhere
otherwise we'll keep working on this patch set forever, with the
QEMU code being a moving target meanwhile.

>>
>>>>
>>>>>>
>>>>>>> So what I'm thinking could be very clean is, we just discussed about
>>>>>>> MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
>>>>>>> wonder when with it why not we move one step further to have
>>>>>>> MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
>>>>>>> ready to send non-iterable".  It can be controlled by the same migration
>>>>>>> property so we only send these two flags in 9.2+ machine types.
>>>>>>>
>>>>>>> Then IIUC VFIO can send config data through main wire (just like most of
>>>>>>> other pci devices! which is IMHO a good fit..) and on destination VFIO
>>>>>>> holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.
>>>>>>
>>>>>> Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
>>>>>> to the considerations above) also delay starting the config load until all
>>>>>> iterable devices were read/transferred/loaded and also would complicate
>>>>>> future efforts at loading that config data in parallel.
>>>>>
>>>>> However I wonder whether we can keep it simple in that VFIO's config space
>>>>> is still always saved in vfio_save_state().  I still think it's easier we
>>>>> stick with the main channel whenever possible.  For this specific case, if
>>>>> the config space is small I think it's tricky you bypass this with:
>>>>>
>>>>>        if (migration->multifd_transfer) {
>>>>>            /* Emit dummy NOP data */
>>>>>            qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>>>            return;
>>>>>        }
>>>>>
>>>>> Then squash this as the tail of the iterable data.
>>>>>
>>>>> On the src, I think it could use a per-device semaphore, so that iterable
>>>>> save() thread will post() only if it finishes dumping all the data, then
>>>>> that orders VFIO iterable data v.s. config space save().
>>>>
>>>> In the future we want to not only transfer but also load the config data
>>>> in parallel.
>>>
>>> How feasible do you think this idea is?  E.g. does it involve BQL so far
>>> (e.g. memory updates, others)?  What's still missing to make it concurrent?
>>
>> My gut feeling is that is feasible overall but it's too much of a rabbit
>> hole for the first version of this device state transfer feature.
>>
>> I think it will need some deeper QEMU core address space management changes,
>> which need to be researched/developed/tested/reviewed/etc. on their own.
>>
>> If it was an easy task I would have gladly included such support in this
>> patch set version already for extra downtime reduction :)
> 
> Yes I understand.
> 
> Note that it doesn't need to be implemented and resolved in one shot, but I
> wonder if it'll still be good to debug the issue and know where is not
> scaling.
> 
> Considering that your design is fully concurrent as of now on iterable data
> from QEMU side, it's less persuasive to provide perf numbers that still
> doesn't scale that much; 1.78s -> 1.06s is a good improvement, but it
> doesn't seem to solve the scalability issue that this whole series wanted
> to address in general.
> 
> An extreme (bad) example is if VFIO has all ioctl()/read()/write() take a
> global lock, then any work in QEMU trying to run things in parallel will be
> a vain.  Such patchset cannot be accepted because the other issue needs to
> be resolved first.
> 
> Now it's in the middle of best/worst condition, where it did improve but it
> still doesn't scale that well.  I think it can be accepted, but still I
> feel like we're ignoring some of the real issues.  We can choose to ignore
> the kernel saying that "it's too much to do together", but IMHO the issues
> should be tackled in the other way round.. the normal case is one should
> work out the kernel scalability issues, then QEMU should be on top.. Simply
> because any kernel change that might scale >1 device save()/load() can
> affect future QEMU change and design, not vice versa.
> 
> Again, I know you wished we make some progress, so I don't have a strong
> opinion.  Just FYI.
> 

As I wrote above, the kernel side of things are being taken care of by
the mlx5 driver maintainers.

And these performance numbers suggest that there isn't some global lock
serializing all device accesses as otherwise it would quickly become
the bottleneck and we would be seeing diminishing improvement from
increased VF count instead of increased improvement.

(..)
>>>>
>>>>>        that I feel like perhaps can be replaced by a sem (then to drop the
>>>>>        condvar)?
>>>>
>>>> Once we have ability to load device config state outside main migration
>>>> thread replacing "load_finish" handler with a semaphore should indeed be
>>>> possible (that's internal migration API so there should be no issue
>>>> removing it as not necessary anymore at this point).
>>>>
>>>> But for now, the devices need to have ability to run their config load
>>>> code on the main migration thread, and for that they need to be called
>>>> from this handler "load_finish".
>>>
>>> A sem seems a must here to notify the iterable data finished loading, but
>>> that doesn't need to hook to the vmstate handler, but some post-process
>>> tasks, like what we do around cpu_synchronize_all_post_init() time.
>>>
>>> If per-device vmstate handler hook version of load_finish() is destined to
>>> look as weird in this case, I'd rather consider a totally separate way to
>>> enqueue some jobs that needs to be run after all vmstates loaded.  Then
>>> after one VFIO device fully loads its data, it enqueues the task and post()
>>> to one migration sem saying that "there's one post-process task, please run
>>> it in migration thread".  There can be a total number of tasks registered
>>> so that migration thread knows not to continue until these number of tasks
>>> processed.  That counter can be part of vmstate handler, maybe, reporting
>>> that "this vmstate handler has one post-process task".
>>>
>>> Maybe you have other ideas, but please no, let's avoid this load_finish()
>>> thing..
>>
>> I can certainly implement the task-queuing approach instead of the
>> load_finish() handler API if you like such approach more.
> 
> I have an even simpler solution now.  I think you can reuse precopy
> notifiers.
> 
> You can add one new PRECOPY_NOTIFY_INCOMING_COMPLETE event, invoke it after
> vmstate load all done.
> 
> As long as VFIO devices exist, VFIO can register with that event, then it
> can do whatever it wants in the main loader thread with BQL held.
> 
> You can hide that sem post() / wait() all there, then it's completely VFIO
> internal.  Then we leave vmstate handler alone; it just doesn't sound
> suitable when the hooks need to be called out of order.

I can certainly implement this functionality via a new
precopy_notify(PRECOPY_NOTIFY_INCOMING_COMPLETE) notifier, for example
by having a single notify handler registered by the VFIO driver, which
handler will be common to all VFIO devices.

This handler on the VFIO driver side will then take care of proper operation
ordering between the existing VFIO devices.

>>>>
>>>>>      - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
>>>>>        above..
>>>>>
>>>>> So if you really think it matters to load whatever VFIO device who's
>>>>> iterable data is ready first, then let's try come up with some better
>>>>> interface..  I can try to think about it too, but please answer me
>>>>> questions above so I can understand what I am missing on why that's
>>>>> important.  Numbers could help, even if 4 VF and I wonder how much diff
>>>>> there can be.  Mostly, I don't know why it's slow right now if it is; I
>>>>> thought it should be pretty fast, at least not a concern in VFIO migration
>>>>> world (which can take seconds of downtime or more..).
>>>>>
>>>>> IOW, it sounds more reasonalbe to me that no matter whether vfio will
>>>>> support multifd, it'll be nice we stick with vfio_load_state() /
>>>>> vfio_save_state() for config space, and hopefully it's also easier it
>>>>> always go via the main channel to everyone.  In these two hooks, VFIO can
>>>>> do whatever it wants to sync with other things (on src, sync with
>>>>> concurrent thread pool saving iterable data and dumping things to multifd
>>>>> channels; on dst, sync with multifd concurrent loads). I think it can
>>>>> remove the requirement on the load_finish() interface completely.  Yes,
>>>>> this can only load VFIO's pci config space one by one, but I think this is
>>>>> much simpler, and I hope it's also not that slow, but I'm not sure.
>>>>
>>>> To be clear, I made a following diagram describing how the patch set
>>>> is supposed to work right now, including changing per-device
>>>> VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.
>>>>
>>>> Time flows on it left to right (->).
>>>>
>>>> ----------- DIAGRAM START -----------
>>>> Source overall flow:
>>>> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
>>>> Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
>>>>
>>>> Target overall flow:
>>>> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable -> config data load operations
>>>> Multifd channels:                                       \ multifd device state (1) -> multifd config data read (1)
>>>>
>>>> Target config data load operations flow:
>>>> multifd config data read (1) -> config data load (2)
>>>>
>>>> Notes:
>>>> (1): per device threads running in parallel
>>>
>>> Here I raised this question before, but I'll ask again: do you think we can
>>> avoid using a separate thread on dest qemu, but reuse multifd recv threads?
>>>
>>> Src probably needs its own threads because multifd sender threads takes
>>> request, so it can't block on its own.
>>>
>>> However dest qemu isn't like that, it's packet driven so I think maybe it's
>>> ok VFIO directly loads the data in the multifd threads.  We may want to
>>> have enough multifd threads to make sure IO still don't block much on the
>>> NIC, but I think tuning the num of multifd threads should work in this
>>> case.
>>
>> We need to have the receiving threads decoupled from the VFIO device state
>> loading threads at least because otherwise:
>> 1) You can have a deadlock if device state for multiple devices arrives
>> out of order, like here:
>>
>> Time flows left to right (->).
>> Multifd channel 1: (VFIO device 1 buffer 2) (VFIO device 2 buffer 1)
>> Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 1)
>>
>> Both channel receive/load threads would be stuck forever in this case,
>> since they can't load buffer 2 for devices 1 and 2 until they load
>> buffer 1 for each of these devices.
>>
>> 2) If devices are loading buffers at different speeds you don't want
>> to block the faster device from receiving new buffer just because
>> the slower one hasn't finished its loading yet.
> 
> I don't see why it can't be avoided.  Let me draw this in columns.
> 
> How I picture this is:
> 
>     multifd recv thread 1                     multifd recv thread 2
>     ---------------------                     ---------------------
>     recv VFIO device 1 buffer 2             recv VFIO device 2 buffer 2
>      -> found that (dev1, buf1) missing,      -> found that (dev2, buf1) missing,
>         skip load                                skip load
>     recv VFIO device 2 buffer 1             recv VFIO device 1 buffer 1
>      -> found that (dev2, buf1+buf2) ready,   -> found that (dev1, buf1+buf2) ready,
>         load buf1+2 for dev2 here                load buf1+2 for dev1 here
>                                                 
> Here right after one multifd thread recvs a buffer, it needs to be injected
> into the cache array (with proper locking), so that whoever receives a full
> series of those buffers will do the load (again, with proper locking..).
> 
> Would this not work?
> 

For sure but that's definitely more complicated logic than just having
a simple device loading thread that naturally loads incoming buffers
for that device in-order.
That thread isn't even in the purview of the migration code since
it's a VFIO driver internal implementation detail.

And we'd still lose parallelism if it happens that two buffers that
are to be loaded next for two devices happen to arrive in the same
multifd channel:
Multifd channel 1: (VFIO device 1 buffer 1) (VFIO device 2 buffer 1)
Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 2)

Now device 2 buffer 1 has to wait until loading device 1 buffer 1
finishes even thought with the decoupled loading thread implementation
from this patch set these would be loaded in parallel.

> 
> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-10-02 20:11                         ` Maciej S. Szmigiero
@ 2024-10-02 21:25                           ` Peter Xu
  2024-10-03 20:34                             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Xu @ 2024-10-02 21:25 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Oct 02, 2024 at 10:11:33PM +0200, Maciej S. Szmigiero wrote:
> On 1.10.2024 23:30, Peter Xu wrote:
> > On Tue, Oct 01, 2024 at 10:41:14PM +0200, Maciej S. Szmigiero wrote:
> > > On 30.09.2024 23:57, Peter Xu wrote:
> > > > On Mon, Sep 30, 2024 at 09:25:54PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 27.09.2024 02:53, Peter Xu wrote:
> > > > > > On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 20.09.2024 18:45, Peter Xu wrote:
> > > > > > > > On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > On 19.09.2024 23:11, Peter Xu wrote:
> > > > > > > > > > On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > > > On 9.09.2024 22:03, Peter Xu wrote:
> > > > > > > > > > > > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > load_finish SaveVMHandler allows migration code to poll whether
> > > > > > > > > > > > > a device-specific asynchronous device state loading operation had finished.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In order to avoid calling this handler needlessly the device is supposed
> > > > > > > > > > > > > to notify the migration code of its possible readiness via a call to
> > > > > > > > > > > > > qemu_loadvm_load_finish_ready_broadcast() while holding
> > > > > > > > > > > > > qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >        include/migration/register.h | 21 +++++++++++++++
> > > > > > > > > > > > >        migration/migration.c        |  6 +++++
> > > > > > > > > > > > >        migration/migration.h        |  3 +++
> > > > > > > > > > > > >        migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
> > > > > > > > > > > > >        migration/savevm.h           |  4 +++
> > > > > > > > > > > > >        5 files changed, 86 insertions(+)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > > > > > > > > > > index 4a578f140713..44d8cf5192ae 100644
> > > > > > > > > > > > > --- a/include/migration/register.h
> > > > > > > > > > > > > +++ b/include/migration/register.h
> > > > > > > > > > > > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
> > > > > > > > > > > > >            int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
> > > > > > > > > > > > >                                     Error **errp);
> > > > > > > > > > > > > +    /**
> > > > > > > > > > > > > +     * @load_finish
> > > > > > > > > > > > > +     *
> > > > > > > > > > > > > +     * Poll whether all asynchronous device state loading had finished.
> > > > > > > > > > > > > +     * Not called on the load failure path.
> > > > > > > > > > > > > +     *
> > > > > > > > > > > > > +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > > > > > +     *
> > > > > > > > > > > > > +     * If this method signals "not ready" then it might not be called
> > > > > > > > > > > > > +     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
> > > > > > > > > > > > > +     * while holding qemu_loadvm_load_finish_ready_lock.
> > > > > > > > > > > > 
> > > > > > > > > > > > [1]
> > > > > > > > > > > > 
> > > > > > > > > > > > > +     *
> > > > > > > > > > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > > > > > > > > > +     * @is_finished: whether the loading had finished (output parameter)
> > > > > > > > > > > > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > > > > > > > > > > > +     *
> > > > > > > > > > > > > +     * Returns zero to indicate success and negative for error
> > > > > > > > > > > > > +     * It's not an error that the loading still hasn't finished.
> > > > > > > > > > > > > +     */
> > > > > > > > > > > > > +    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
> > > > > > > > > > > > 
> > > > > > > > > > > > The load_finish() semantics is a bit weird, especially above [1] on "only
> > > > > > > > > > > > allowed to be called once if ..." and also on the locks.
> > > > > > > > > > > 
> > > > > > > > > > > The point of this remark is that a driver needs to call
> > > > > > > > > > > qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
> > > > > > > > > > > core to call its load_finish handler again.
> > > > > > > > > > > 
> > > > > > > > > > > > It looks to me vfio_load_finish() also does the final load of the device.
> > > > > > > > > > > > 
> > > > > > > > > > > > I wonder whether that final load can be done in the threads,
> > > > > > > > > > > 
> > > > > > > > > > > Here, the problem is that current VFIO VMState has to be loaded from the main
> > > > > > > > > > > migration thread as it internally calls QEMU core address space modification
> > > > > > > > > > > methods which explode if called from another thread(s).
> > > > > > > > > > 
> > > > > > > > > > Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
> > > > > > > > > > BQL if possible, when that's ready then in your case here IIUC you can
> > > > > > > > > > simply take BQL in whichever thread that loads it.. but yeah it's not ready
> > > > > > > > > > at least..
> > > > > > > > > 
> > > > > > > > > Yeah, long term we might want to work on making these QEMU core address space
> > > > > > > > > modification methods somehow callable from multiple threads but that's
> > > > > > > > > definitely not something for the initial patch set.
> > > > > > > > > 
> > > > > > > > > > Would it be possible vfio_save_complete_precopy_async_thread_config_state()
> > > > > > > > > > be done in VFIO's save_live_complete_precopy() through the main channel
> > > > > > > > > > somehow?  IOW, does it rely on iterative data to be fetched first from
> > > > > > > > > > kernel, or completely separate states?
> > > > > > > > > 
> > > > > > > > > The device state data needs to be fully loaded first before "activating"
> > > > > > > > > the device by loading its config state.
> > > > > > > > > 
> > > > > > > > > > And just curious: how large is it
> > > > > > > > > > normally (and I suppose this decides whether it's applicable to be sent via
> > > > > > > > > > the main channel at all..)?
> > > > > > > > > 
> > > > > > > > > Config data is *much* smaller than device state data - as far as I remember
> > > > > > > > > it was on order of kilobytes.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > then after
> > > > > > > > > > > > everything loaded the device post a semaphore telling the main thread to
> > > > > > > > > > > > continue.  See e.g.:
> > > > > > > > > > > > 
> > > > > > > > > > > >           if (migrate_switchover_ack()) {
> > > > > > > > > > > >               qemu_loadvm_state_switchover_ack_needed(mis);
> > > > > > > > > > > >           }
> > > > > > > > > > > > 
> > > > > > > > > > > > IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
> > > > > > > > > > > > when all things are loaded?  We can then get rid of this slightly awkward
> > > > > > > > > > > > interface.  I had a feeling that things can be simplified (e.g., if the
> > > > > > > > > > > > thread will take care of loading the final vmstate then the mutex is also
> > > > > > > > > > > > not needed? etc.).
> > > > > > > > > > > 
> > > > > > > > > > > With just a single call to switchover_ack_needed per VFIO device it would
> > > > > > > > > > > need to do a blocking wait for the device buffers and config state load
> > > > > > > > > > > to finish, therefore blocking other VFIO devices from potentially loading
> > > > > > > > > > > their config state if they are ready to begin this operation earlier.
> > > > > > > > > > 
> > > > > > > > > > I am not sure I get you here, loading VFIO device states (I mean, the
> > > > > > > > > > non-iterable part) will need to be done sequentially IIUC due to what you
> > > > > > > > > > said and should rely on BQL, so I don't know how that could happen
> > > > > > > > > > concurrently for now.  But I think indeed BQL is a problem.
> > > > > > > > > Consider that we have two VFIO devices (A and B), with the following order
> > > > > > > > > of switchover_ack_needed handler calls for them: first A get this call,
> > > > > > > > > once the call for A finishes then B gets this call.
> > > > > > > > > 
> > > > > > > > > Now consider what happens if B had loaded all its buffers (in the loading
> > > > > > > > > thread) and it is ready for its config load before A finished loading its
> > > > > > > > > buffers.
> > > > > > > > > 
> > > > > > > > > B has to wait idle in this situation (even though it could have been already
> > > > > > > > > loading its config) since the switchover_ack_needed handler for A won't
> > > > > > > > > return until A is fully done.
> > > > > > > > 
> > > > > > > > This sounds like a performance concern, and I wonder how much this impacts
> > > > > > > > the real workload (that you run a test and measure, with/without such
> > > > > > > > concurrency) when we can save two devices in parallel anyway; I would
> > > > > > > > expect the real diff is small due to the fact I mentioned that we save >1
> > > > > > > > VFIO devices concurrently via multifd.
> > > > > > > > 
> > > > > > > > Do you think we can start with a simpler approach?
> > > > > > > 
> > > > > > > I don't think introducing a performance/scalability issue like that is
> > > > > > > a good thing, especially that we already have a design that avoids it.
> > > > > > > 
> > > > > > > Unfortunately, my current setup does not allow live migrating VMs with
> > > > > > > more than 4 VFs so I can't benchmark that.
> > > > > > 
> > > > > > /me wonders why benchmarking it requires more than 4 VFs.
> > > > > 
> > > > > My point here was that the scalability problem will most likely get more
> > > > > pronounced with more VFs.
> > > > > 
> > > > > > > 
> > > > > > > But I almost certain that with more VFs the situation with devices being
> > > > > > > ready out-of-order will get even more likely.
> > > > > > 
> > > > > > If the config space is small, why loading it in sequence would be a
> > > > > > problem?
> > > > > > 
> > > > > > Have you measured how much time it needs to load one VF's config space that
> > > > > > you're using?  I suppose that's vfio_load_device_config_state() alone?
> > > > > 
> > > > > It's not the amount of data to load matters here but that these address
> > > > > space operations are slow.
> > > > > 
> > > > > The whole config load takes ~70 ms per device - that's time equivalent
> > > > > of transferring 875 MiB of device state via a 100 GBit/s link.
> > > > 
> > > > What's the downtime of migration with 1/2/4 VFs?  I remember I saw some
> > > > data somewhere but it's not in the cover letter.  It'll be good to mention
> > > > these results in the cover letter when repost.
> > > 
> > > Downtimes with the device state transfer being disabled / enabled:
> > >              4 VFs   2 VFs    1 VF
> > > Disabled: 1783 ms  614 ms  283 ms
> > > Enabled:  1068 ms  434 ms  274 ms
> > > 
> > > Will add these numbers to the cover letter of the next patch set version.
> > 
> > Thanks.
> > 
> > > 
> > > > I'm guessing 70ms isn't a huge deal here, if your NIC has 128GB internal
> > > > device state to migrate.. but maybe I'm wrong.
> > > 
> > > It's ~100 MiB of device state per VF here.
> > 
> > Ouch..
> > 
> > I watched your kvm forum talk recording, I remember that's where I get that
> > 128 number but probably get the unit wrong.. ok that makes sense.
> > 
> > > 
> > > And it's 70ms of downtime *per device*:
> > > so with 4 VF it's ~280ms of downtime taken by the config loads.
> > > That's a lot - with perfect parallelization this downtime should
> > > *reduce by* 210ms.
> > 
> > Yes, in this case it's a lot.  I wonder why it won't scale as good even
> > with your patchset.
> > 
> > Did you profile why?  I highly doubt in your case network is an issue, as
> > there's only 100MB per-dev data, so even on 10gbps it takes 100ms only to
> > transfer for each, while now assuming it can run concurrently.  I think you
> > mentioned you were using 100gbps, right?
> 
> Right, these 2 test machines are connected via a 100 GBbps network.
> 
> > Logically when with multiple threads, VFIO read()s should happen at least
> > concurrently per-device.  Have you checked that there's no kernel-side
> > global VFIO lock etc. that serializes portions of the threads read()s /
> > write()s on the VFIO fds?
> 
> For these devices the kernel side has been significantly improved a year ago:
> https://lore.kernel.org/kvm/20230911093856.81910-1-yishaih@nvidia.com/
> 
> In the mlx5 driver the in-kernel device reading task (work) is separated
> from the userspace (QEMU) read()ing task via a double/multi buffering scheme.
> 
> If there was indeed some global lock serializing all device accesses we
> wouldn't be seeing that much improvement from this patch set as we are
> seeing - especially that the improvement seems to *increase* with the
> increased VF count in a single PF.
> 
> > It's just a pity that you went this far, added all these logics, but
> > without making it fully concurrent at least per device.
> 
> AFAIK NVIDIA/Mellanox are continuously working on improving the mlx5 driver,
> but to benefit from the driver parallelism we need parallelism in QEMU
> too so the userspace won't become the serialization point/bottleneck.
> 
> In other words, it's kind of a chicken and egg problem.
> 
> That's why I want to preserve as much parallelism in this patch set as
> possible to avoid accidental serialization which (even if not a problem
> right now) may become the bottleneck at some point.
> 
> > I'm OK if you want this in without that figured out, but if I were you I'll
> > probably try to dig a bit to at least know why.
> > 
> > > 
> > > > I also wonder whether you profiled a bit on how that 70ms contributes to
> > > > what is slow.
> > > 
> > > I think that's something we can do after we have parallel config loads
> > > and it turns out their downtime for some reason still scales strongly
> > > linearly with the number of VFIO devices (rather than taking roughly
> > > constant time regardless of the count of these devices if running perfectly
> > > in parallel).
> > 
> > Similarly, I wonder whether the config space load() can involves something
> > globally shared.  I'd also dig a bit here, but I'll leave that to you to
> > decide.
> 
> Making config loads thread-safe/parallelizable is definitely on my future
> TODO list.
> 
> Just wanted to keep the amount of changes in the first version of this
> patch set within reasonable bounds - one has to draw a line somewhere
> otherwise we'll keep working on this patch set forever, with the
> QEMU code being a moving target meanwhile.
> 
> > > 
> > > > > 
> > > > > > > 
> > > > > > > > So what I'm thinking could be very clean is, we just discussed about
> > > > > > > > MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
> > > > > > > > wonder when with it why not we move one step further to have
> > > > > > > > MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
> > > > > > > > ready to send non-iterable".  It can be controlled by the same migration
> > > > > > > > property so we only send these two flags in 9.2+ machine types.
> > > > > > > > 
> > > > > > > > Then IIUC VFIO can send config data through main wire (just like most of
> > > > > > > > other pci devices! which is IMHO a good fit..) and on destination VFIO
> > > > > > > > holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.
> > > > > > > 
> > > > > > > Starting the config load only on MIG_CMD_SEND_NON_ITERABE would (in addition
> > > > > > > to the considerations above) also delay starting the config load until all
> > > > > > > iterable devices were read/transferred/loaded and also would complicate
> > > > > > > future efforts at loading that config data in parallel.
> > > > > > 
> > > > > > However I wonder whether we can keep it simple in that VFIO's config space
> > > > > > is still always saved in vfio_save_state().  I still think it's easier we
> > > > > > stick with the main channel whenever possible.  For this specific case, if
> > > > > > the config space is small I think it's tricky you bypass this with:
> > > > > > 
> > > > > >        if (migration->multifd_transfer) {
> > > > > >            /* Emit dummy NOP data */
> > > > > >            qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > > > >            return;
> > > > > >        }
> > > > > > 
> > > > > > Then squash this as the tail of the iterable data.
> > > > > > 
> > > > > > On the src, I think it could use a per-device semaphore, so that iterable
> > > > > > save() thread will post() only if it finishes dumping all the data, then
> > > > > > that orders VFIO iterable data v.s. config space save().
> > > > > 
> > > > > In the future we want to not only transfer but also load the config data
> > > > > in parallel.
> > > > 
> > > > How feasible do you think this idea is?  E.g. does it involve BQL so far
> > > > (e.g. memory updates, others)?  What's still missing to make it concurrent?
> > > 
> > > My gut feeling is that is feasible overall but it's too much of a rabbit
> > > hole for the first version of this device state transfer feature.
> > > 
> > > I think it will need some deeper QEMU core address space management changes,
> > > which need to be researched/developed/tested/reviewed/etc. on their own.
> > > 
> > > If it was an easy task I would have gladly included such support in this
> > > patch set version already for extra downtime reduction :)
> > 
> > Yes I understand.
> > 
> > Note that it doesn't need to be implemented and resolved in one shot, but I
> > wonder if it'll still be good to debug the issue and know where is not
> > scaling.
> > 
> > Considering that your design is fully concurrent as of now on iterable data
> > from QEMU side, it's less persuasive to provide perf numbers that still
> > doesn't scale that much; 1.78s -> 1.06s is a good improvement, but it
> > doesn't seem to solve the scalability issue that this whole series wanted
> > to address in general.
> > 
> > An extreme (bad) example is if VFIO has all ioctl()/read()/write() take a
> > global lock, then any work in QEMU trying to run things in parallel will be
> > a vain.  Such patchset cannot be accepted because the other issue needs to
> > be resolved first.
> > 
> > Now it's in the middle of best/worst condition, where it did improve but it
> > still doesn't scale that well.  I think it can be accepted, but still I
> > feel like we're ignoring some of the real issues.  We can choose to ignore
> > the kernel saying that "it's too much to do together", but IMHO the issues
> > should be tackled in the other way round.. the normal case is one should
> > work out the kernel scalability issues, then QEMU should be on top.. Simply
> > because any kernel change that might scale >1 device save()/load() can
> > affect future QEMU change and design, not vice versa.
> > 
> > Again, I know you wished we make some progress, so I don't have a strong
> > opinion.  Just FYI.
> > 
> 
> As I wrote above, the kernel side of things are being taken care of by
> the mlx5 driver maintainers.
> 
> And these performance numbers suggest that there isn't some global lock
> serializing all device accesses as otherwise it would quickly become
> the bottleneck and we would be seeing diminishing improvement from
> increased VF count instead of increased improvement.

Personally I am not satisfied with scaling with these numbers..

  1VF       2VFs      4VFs
  274 ms -> 434 ms -> 1068 ms

The lock doesn't need to be as stupid as a global lock that all ioctl()s
take and it might not be as obvious that we can easily see.  It can hide
internally, it can be not in the form of a lock at all.

1068 is almost 4x of 274 here, that's really not scalable at all even if it
is improvement for sure..  I still feel like something is off.  If you
think kernel isn't the bottleneck, I am actually more curious on why,
especially if that could be relevant to the qemu design.

> 
> (..)
> > > > > 
> > > > > >        that I feel like perhaps can be replaced by a sem (then to drop the
> > > > > >        condvar)?
> > > > > 
> > > > > Once we have ability to load device config state outside main migration
> > > > > thread replacing "load_finish" handler with a semaphore should indeed be
> > > > > possible (that's internal migration API so there should be no issue
> > > > > removing it as not necessary anymore at this point).
> > > > > 
> > > > > But for now, the devices need to have ability to run their config load
> > > > > code on the main migration thread, and for that they need to be called
> > > > > from this handler "load_finish".
> > > > 
> > > > A sem seems a must here to notify the iterable data finished loading, but
> > > > that doesn't need to hook to the vmstate handler, but some post-process
> > > > tasks, like what we do around cpu_synchronize_all_post_init() time.
> > > > 
> > > > If per-device vmstate handler hook version of load_finish() is destined to
> > > > look as weird in this case, I'd rather consider a totally separate way to
> > > > enqueue some jobs that needs to be run after all vmstates loaded.  Then
> > > > after one VFIO device fully loads its data, it enqueues the task and post()
> > > > to one migration sem saying that "there's one post-process task, please run
> > > > it in migration thread".  There can be a total number of tasks registered
> > > > so that migration thread knows not to continue until these number of tasks
> > > > processed.  That counter can be part of vmstate handler, maybe, reporting
> > > > that "this vmstate handler has one post-process task".
> > > > 
> > > > Maybe you have other ideas, but please no, let's avoid this load_finish()
> > > > thing..
> > > 
> > > I can certainly implement the task-queuing approach instead of the
> > > load_finish() handler API if you like such approach more.
> > 
> > I have an even simpler solution now.  I think you can reuse precopy
> > notifiers.
> > 
> > You can add one new PRECOPY_NOTIFY_INCOMING_COMPLETE event, invoke it after
> > vmstate load all done.
> > 
> > As long as VFIO devices exist, VFIO can register with that event, then it
> > can do whatever it wants in the main loader thread with BQL held.
> > 
> > You can hide that sem post() / wait() all there, then it's completely VFIO
> > internal.  Then we leave vmstate handler alone; it just doesn't sound
> > suitable when the hooks need to be called out of order.
> 
> I can certainly implement this functionality via a new
> precopy_notify(PRECOPY_NOTIFY_INCOMING_COMPLETE) notifier, for example
> by having a single notify handler registered by the VFIO driver, which
> handler will be common to all VFIO devices.
> 
> This handler on the VFIO driver side will then take care of proper operation
> ordering between the existing VFIO devices.

Great!

> 
> > > > > 
> > > > > >      - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
> > > > > >        above..
> > > > > > 
> > > > > > So if you really think it matters to load whatever VFIO device who's
> > > > > > iterable data is ready first, then let's try come up with some better
> > > > > > interface..  I can try to think about it too, but please answer me
> > > > > > questions above so I can understand what I am missing on why that's
> > > > > > important.  Numbers could help, even if 4 VF and I wonder how much diff
> > > > > > there can be.  Mostly, I don't know why it's slow right now if it is; I
> > > > > > thought it should be pretty fast, at least not a concern in VFIO migration
> > > > > > world (which can take seconds of downtime or more..).
> > > > > > 
> > > > > > IOW, it sounds more reasonalbe to me that no matter whether vfio will
> > > > > > support multifd, it'll be nice we stick with vfio_load_state() /
> > > > > > vfio_save_state() for config space, and hopefully it's also easier it
> > > > > > always go via the main channel to everyone.  In these two hooks, VFIO can
> > > > > > do whatever it wants to sync with other things (on src, sync with
> > > > > > concurrent thread pool saving iterable data and dumping things to multifd
> > > > > > channels; on dst, sync with multifd concurrent loads). I think it can
> > > > > > remove the requirement on the load_finish() interface completely.  Yes,
> > > > > > this can only load VFIO's pci config space one by one, but I think this is
> > > > > > much simpler, and I hope it's also not that slow, but I'm not sure.
> > > > > 
> > > > > To be clear, I made a following diagram describing how the patch set
> > > > > is supposed to work right now, including changing per-device
> > > > > VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.
> > > > > 
> > > > > Time flows on it left to right (->).
> > > > > 
> > > > > ----------- DIAGRAM START -----------
> > > > > Source overall flow:
> > > > > Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
> > > > > Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
> > > > > 
> > > > > Target overall flow:
> > > > > Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable -> config data load operations
> > > > > Multifd channels:                                       \ multifd device state (1) -> multifd config data read (1)
> > > > > 
> > > > > Target config data load operations flow:
> > > > > multifd config data read (1) -> config data load (2)
> > > > > 
> > > > > Notes:
> > > > > (1): per device threads running in parallel
> > > > 
> > > > Here I raised this question before, but I'll ask again: do you think we can
> > > > avoid using a separate thread on dest qemu, but reuse multifd recv threads?
> > > > 
> > > > Src probably needs its own threads because multifd sender threads takes
> > > > request, so it can't block on its own.
> > > > 
> > > > However dest qemu isn't like that, it's packet driven so I think maybe it's
> > > > ok VFIO directly loads the data in the multifd threads.  We may want to
> > > > have enough multifd threads to make sure IO still don't block much on the
> > > > NIC, but I think tuning the num of multifd threads should work in this
> > > > case.
> > > 
> > > We need to have the receiving threads decoupled from the VFIO device state
> > > loading threads at least because otherwise:
> > > 1) You can have a deadlock if device state for multiple devices arrives
> > > out of order, like here:
> > > 
> > > Time flows left to right (->).
> > > Multifd channel 1: (VFIO device 1 buffer 2) (VFIO device 2 buffer 1)
> > > Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 1)
> > > 
> > > Both channel receive/load threads would be stuck forever in this case,
> > > since they can't load buffer 2 for devices 1 and 2 until they load
> > > buffer 1 for each of these devices.
> > > 
> > > 2) If devices are loading buffers at different speeds you don't want
> > > to block the faster device from receiving new buffer just because
> > > the slower one hasn't finished its loading yet.
> > 
> > I don't see why it can't be avoided.  Let me draw this in columns.
> > 
> > How I picture this is:
> > 
> >     multifd recv thread 1                     multifd recv thread 2
> >     ---------------------                     ---------------------
> >     recv VFIO device 1 buffer 2             recv VFIO device 2 buffer 2
> >      -> found that (dev1, buf1) missing,      -> found that (dev2, buf1) missing,
> >         skip load                                skip load
> >     recv VFIO device 2 buffer 1             recv VFIO device 1 buffer 1
> >      -> found that (dev2, buf1+buf2) ready,   -> found that (dev1, buf1+buf2) ready,
> >         load buf1+2 for dev2 here                load buf1+2 for dev1 here
> > Here right after one multifd thread recvs a buffer, it needs to be injected
> > into the cache array (with proper locking), so that whoever receives a full
> > series of those buffers will do the load (again, with proper locking..).
> > 
> > Would this not work?
> > 
> 
> For sure but that's definitely more complicated logic than just having
> a simple device loading thread that naturally loads incoming buffers
> for that device in-order.

I thought it was mostly your logic that was implemented, but yeah I didn't
check too much details on VFIO side.

> That thread isn't even in the purview of the migration code since
> it's a VFIO driver internal implementation detail.
> 
> And we'd still lose parallelism if it happens that two buffers that
> are to be loaded next for two devices happen to arrive in the same
> multifd channel:
> Multifd channel 1: (VFIO device 1 buffer 1) (VFIO device 2 buffer 1)
> Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 2)
> 
> Now device 2 buffer 1 has to wait until loading device 1 buffer 1
> finishes even thought with the decoupled loading thread implementation
> from this patch set these would be loaded in parallel.

Well it's possible indeed, but with normally 8 or more threads being there,
possibility of having such dependency is low.

Cedric has similar comment on starting from simple on the thread model.
I'd still suggest if ever possible we try reuse multifd recv threads; I do
expect the results should be similar.

I am sorry to ask for this, Fabiano already blames me for this, but..
logically it'll be best we use no new thread in the series, then one patch
on top with your new thread solution to justify its performance benefits
and worthwhile to having those threads at all.

PS: I'd suggest if you really need those threads it should still be managed
by migration framework like the src thread pool.  Sorry I'm pretty stubborn
on this, especially after I notice we have query-migrationthreads API just
recently.. even if now I'm not sure whether we should remove that API.  I
assume that shouldn't need much change, even if necessary.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-10-02 21:25                           ` Peter Xu
@ 2024-10-03 20:34                             ` Maciej S. Szmigiero
  2024-10-03 21:17                               ` Peter Xu
  0 siblings, 1 reply; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-10-03 20:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2.10.2024 23:25, Peter Xu wrote:
> On Wed, Oct 02, 2024 at 10:11:33PM +0200, Maciej S. Szmigiero wrote:
>> On 1.10.2024 23:30, Peter Xu wrote:
>>> On Tue, Oct 01, 2024 at 10:41:14PM +0200, Maciej S. Szmigiero wrote:
>>>> On 30.09.2024 23:57, Peter Xu wrote:
>>>>> On Mon, Sep 30, 2024 at 09:25:54PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 27.09.2024 02:53, Peter Xu wrote:
>>>>>>> On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> On 20.09.2024 18:45, Peter Xu wrote:
>>>>>>>>> On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>> On 19.09.2024 23:11, Peter Xu wrote:
>>>>>>>>>>> On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> On 9.09.2024 22:03, Peter Xu wrote:
>>>>>>>>>>>>> On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> load_finish SaveVMHandler allows migration code to poll whether
>>>>>>>>>>>>>> a device-specific asynchronous device state loading operation had finished.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to avoid calling this handler needlessly the device is supposed
>>>>>>>>>>>>>> to notify the migration code of its possible readiness via a call to
>>>>>>>>>>>>>> qemu_loadvm_load_finish_ready_broadcast() while holding
>>>>>>>>>>>>>> qemu_loadvm_load_finish_ready_lock.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>>> ---
(..)
>> As I wrote above, the kernel side of things are being taken care of by
>> the mlx5 driver maintainers.
>>
>> And these performance numbers suggest that there isn't some global lock
>> serializing all device accesses as otherwise it would quickly become
>> the bottleneck and we would be seeing diminishing improvement from
>> increased VF count instead of increased improvement.
> 
> Personally I am not satisfied with scaling with these numbers..
> 
>    1VF       2VFs      4VFs
>    274 ms -> 434 ms -> 1068 ms
> 
> The lock doesn't need to be as stupid as a global lock that all ioctl()s
> take and it might not be as obvious that we can easily see.  It can hide
> internally, it can be not in the form of a lock at all.
> 
> 1068 is almost 4x of 274 here, that's really not scalable at all even if it
> is improvement for sure..  I still feel like something is off.  If you
> think kernel isn't the bottleneck, I am actually more curious on why,
> especially if that could be relevant to the qemu design.
> 

These are 4 VFs of a single PF NIC, so it's not only kernel driver
involved here but also the whole physical device itself.

Without the userspace/QEMU side being parallelized it was hard to even
measure the driver/device-side bottlenecks.

However, even with the current state of things we still get a nice 67%
improvement in downtime.

As I wrote yesterday, AFAIK it is a WIP also on the mlx5/device side of
things.

(..)
>>>>>>
>>>>>>>       - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
>>>>>>>         above..
>>>>>>>
>>>>>>> So if you really think it matters to load whatever VFIO device who's
>>>>>>> iterable data is ready first, then let's try come up with some better
>>>>>>> interface..  I can try to think about it too, but please answer me
>>>>>>> questions above so I can understand what I am missing on why that's
>>>>>>> important.  Numbers could help, even if 4 VF and I wonder how much diff
>>>>>>> there can be.  Mostly, I don't know why it's slow right now if it is; I
>>>>>>> thought it should be pretty fast, at least not a concern in VFIO migration
>>>>>>> world (which can take seconds of downtime or more..).
>>>>>>>
>>>>>>> IOW, it sounds more reasonalbe to me that no matter whether vfio will
>>>>>>> support multifd, it'll be nice we stick with vfio_load_state() /
>>>>>>> vfio_save_state() for config space, and hopefully it's also easier it
>>>>>>> always go via the main channel to everyone.  In these two hooks, VFIO can
>>>>>>> do whatever it wants to sync with other things (on src, sync with
>>>>>>> concurrent thread pool saving iterable data and dumping things to multifd
>>>>>>> channels; on dst, sync with multifd concurrent loads). I think it can
>>>>>>> remove the requirement on the load_finish() interface completely.  Yes,
>>>>>>> this can only load VFIO's pci config space one by one, but I think this is
>>>>>>> much simpler, and I hope it's also not that slow, but I'm not sure.
>>>>>>
>>>>>> To be clear, I made a following diagram describing how the patch set
>>>>>> is supposed to work right now, including changing per-device
>>>>>> VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.
>>>>>>
>>>>>> Time flows on it left to right (->).
>>>>>>
>>>>>> ----------- DIAGRAM START -----------
>>>>>> Source overall flow:
>>>>>> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                                                                          -> non iterable
>>>>>> Multifd channels:                                       \ multifd device state read and queue (1) -> multifd config data read and queue (1) /
>>>>>>
>>>>>> Target overall flow:
>>>>>> Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable -> config data load operations
>>>>>> Multifd channels:                                       \ multifd device state (1) -> multifd config data read (1)
>>>>>>
>>>>>> Target config data load operations flow:
>>>>>> multifd config data read (1) -> config data load (2)
>>>>>>
>>>>>> Notes:
>>>>>> (1): per device threads running in parallel
>>>>>
>>>>> Here I raised this question before, but I'll ask again: do you think we can
>>>>> avoid using a separate thread on dest qemu, but reuse multifd recv threads?
>>>>>
>>>>> Src probably needs its own threads because multifd sender threads takes
>>>>> request, so it can't block on its own.
>>>>>
>>>>> However dest qemu isn't like that, it's packet driven so I think maybe it's
>>>>> ok VFIO directly loads the data in the multifd threads.  We may want to
>>>>> have enough multifd threads to make sure IO still don't block much on the
>>>>> NIC, but I think tuning the num of multifd threads should work in this
>>>>> case.
>>>>
>>>> We need to have the receiving threads decoupled from the VFIO device state
>>>> loading threads at least because otherwise:
>>>> 1) You can have a deadlock if device state for multiple devices arrives
>>>> out of order, like here:
>>>>
>>>> Time flows left to right (->).
>>>> Multifd channel 1: (VFIO device 1 buffer 2) (VFIO device 2 buffer 1)
>>>> Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 1)
>>>>
>>>> Both channel receive/load threads would be stuck forever in this case,
>>>> since they can't load buffer 2 for devices 1 and 2 until they load
>>>> buffer 1 for each of these devices.
>>>>
>>>> 2) If devices are loading buffers at different speeds you don't want
>>>> to block the faster device from receiving new buffer just because
>>>> the slower one hasn't finished its loading yet.
>>>
>>> I don't see why it can't be avoided.  Let me draw this in columns.
>>>
>>> How I picture this is:
>>>
>>>      multifd recv thread 1                     multifd recv thread 2
>>>      ---------------------                     ---------------------
>>>      recv VFIO device 1 buffer 2             recv VFIO device 2 buffer 2
>>>       -> found that (dev1, buf1) missing,      -> found that (dev2, buf1) missing,
>>>          skip load                                skip load
>>>      recv VFIO device 2 buffer 1             recv VFIO device 1 buffer 1
>>>       -> found that (dev2, buf1+buf2) ready,   -> found that (dev1, buf1+buf2) ready,
>>>          load buf1+2 for dev2 here                load buf1+2 for dev1 here
>>> Here right after one multifd thread recvs a buffer, it needs to be injected
>>> into the cache array (with proper locking), so that whoever receives a full
>>> series of those buffers will do the load (again, with proper locking..).
>>>
>>> Would this not work?
>>>
>>
>> For sure but that's definitely more complicated logic than just having
>> a simple device loading thread that naturally loads incoming buffers
>> for that device in-order.
> 
> I thought it was mostly your logic that was implemented, but yeah I didn't
> check too much details on VFIO side.
> 
>> That thread isn't even in the purview of the migration code since
>> it's a VFIO driver internal implementation detail.
>>
>> And we'd still lose parallelism if it happens that two buffers that
>> are to be loaded next for two devices happen to arrive in the same
>> multifd channel:
>> Multifd channel 1: (VFIO device 1 buffer 1) (VFIO device 2 buffer 1)
>> Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 2)
>>
>> Now device 2 buffer 1 has to wait until loading device 1 buffer 1
>> finishes even thought with the decoupled loading thread implementation
>> from this patch set these would be loaded in parallel.
> 
> Well it's possible indeed, but with normally 8 or more threads being there,
> possibility of having such dependency is low.
> 
> Cedric has similar comment on starting from simple on the thread model.
> I'd still suggest if ever possible we try reuse multifd recv threads; I do
> expect the results should be similar.
> 
> I am sorry to ask for this, Fabiano already blames me for this, but..
> logically it'll be best we use no new thread in the series, then one patch
> on top with your new thread solution to justify its performance benefits
> and worthwhile to having those threads at all.

To be clear, these loading threads are mostly blocking I/O threads, NOT
compute threads.
This means that the usual "rule of thumb" that the count of threads should
not exceed the total number of logical CPUs does NOT apply to them.

They are similar to what glibc uses under the hood to simulate POSIX AIO
(aio_read(), aio_write()), to implement an async DNS resolver (getaddrinfo_a())
and what Glib's GIO uses to simulate its own async file operations.
Using helper threads for turning blocking I/O into "AIO" is a pretty common
thing.

To show that these loading threads mostly spend their time sleeping (waiting
for I/O) I made a quick patch at [1] tracing how much time they spend waiting
for incoming buffers and how much time they spend waiting for these buffers
to be loaded into the device.

The results (without patch [2] described later) are like this:
> 5919@1727974993.403280:vfio_load_state_device_buffer_start  (0000:af:00.2)
> 5921@1727974993.407932:vfio_load_state_device_buffer_start  (0000:af:00.4)
> 5922@1727974993.407964:vfio_load_state_device_buffer_start  (0000:af:00.5)
> 5920@1727974993.408480:vfio_load_state_device_buffer_start  (0000:af:00.3)
> 5920@1727974993.666843:vfio_load_state_device_buffer_end  (0000:af:00.3) wait 43 ms load 217 ms
> 5921@1727974993.686005:vfio_load_state_device_buffer_end  (0000:af:00.4) wait 75 ms load 206 ms
> 5919@1727974993.686054:vfio_load_state_device_buffer_end  (0000:af:00.2) wait 69 ms load 210 ms
> 5922@1727974993.689919:vfio_load_state_device_buffer_end  (0000:af:00.5) wait 79 ms load 204 ms

Summing up:
0000:af:00.2 total loading time 283 ms, wait 69 ms load 210 ms
0000:af:00.3 total loading time 258 ms, wait 43 ms load 217 ms
0000:af:00.4 total loading time 278 ms, wait 75 ms load 206 ms
0000:af:00.5 total loading time 282 ms, wait 79 ms load 204 ms

In other words, these threads spend ~100% of their total runtime waiting
for I/O, 70%-75% of that time waiting for buffers to get loaded into their
target device.

So having more threads here won't negatively affect the host CPU
consumption since these threads barely use the host CPU at all.
Also, their count is capped at the number of VFIO devices in the VM.

I also did a quick test with the same config as usual: 4 VFs, 6 multifd
channels, but with patch at [2] simulating forced coupling of loading
threads to multifd receive channel threads.

With this patch load_state_buffer() handler will return to the multifd
channel thread only when the loading thread finishes loading available
buffers and is about to wait for the next buffers to arrive - just as
loading buffers directly from these channel threads would do.

The resulting lowest downtime from 115 live migration runs was 1295ms -
that's 21% worse than 1068ms of downtime with these loading threads running
on their own.

I expect that this performance penalty to get even worse with more VFs
than 4.

So no, we can't load buffers directly from multifd channel receive threads.

> PS: I'd suggest if you really need those threads it should still be managed
> by migration framework like the src thread pool.  Sorry I'm pretty stubborn
> on this, especially after I notice we have query-migrationthreads API just
> recently.. even if now I'm not sure whether we should remove that API.  I
> assume that shouldn't need much change, even if necessary.

I can certainly make these loading threads managed in a thread pool if that's
easier for you.

> Thanks,
> 

Thanks,
Maciej

[1]: https://github.com/maciejsszmigiero/qemu/commit/b0833053359715c604070f64fc058f90ec61d180
[2]: https://github.com/maciejsszmigiero/qemu/commit/0c9b4072eaebf8e7bd9560dd27a14cd048097565



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions
  2024-10-03 20:34                             ` Maciej S. Szmigiero
@ 2024-10-03 21:17                               ` Peter Xu
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Xu @ 2024-10-03 21:17 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Oct 03, 2024 at 10:34:28PM +0200, Maciej S. Szmigiero wrote:
> To be clear, these loading threads are mostly blocking I/O threads, NOT
> compute threads.
> This means that the usual "rule of thumb" that the count of threads should
> not exceed the total number of logical CPUs does NOT apply to them.
> 
> They are similar to what glibc uses under the hood to simulate POSIX AIO
> (aio_read(), aio_write()), to implement an async DNS resolver (getaddrinfo_a())
> and what Glib's GIO uses to simulate its own async file operations.
> Using helper threads for turning blocking I/O into "AIO" is a pretty common
> thing.

Fair enough.  Yes I could be over-cautious due to the previous experience
on managing all kinds of migration threads.

> 
> To show that these loading threads mostly spend their time sleeping (waiting
> for I/O) I made a quick patch at [1] tracing how much time they spend waiting
> for incoming buffers and how much time they spend waiting for these buffers
> to be loaded into the device.
> 
> The results (without patch [2] described later) are like this:
> > 5919@1727974993.403280:vfio_load_state_device_buffer_start  (0000:af:00.2)
> > 5921@1727974993.407932:vfio_load_state_device_buffer_start  (0000:af:00.4)
> > 5922@1727974993.407964:vfio_load_state_device_buffer_start  (0000:af:00.5)
> > 5920@1727974993.408480:vfio_load_state_device_buffer_start  (0000:af:00.3)
> > 5920@1727974993.666843:vfio_load_state_device_buffer_end  (0000:af:00.3) wait 43 ms load 217 ms
> > 5921@1727974993.686005:vfio_load_state_device_buffer_end  (0000:af:00.4) wait 75 ms load 206 ms
> > 5919@1727974993.686054:vfio_load_state_device_buffer_end  (0000:af:00.2) wait 69 ms load 210 ms
> > 5922@1727974993.689919:vfio_load_state_device_buffer_end  (0000:af:00.5) wait 79 ms load 204 ms
> 
> Summing up:
> 0000:af:00.2 total loading time 283 ms, wait 69 ms load 210 ms
> 0000:af:00.3 total loading time 258 ms, wait 43 ms load 217 ms
> 0000:af:00.4 total loading time 278 ms, wait 75 ms load 206 ms
> 0000:af:00.5 total loading time 282 ms, wait 79 ms load 204 ms
> 
> In other words, these threads spend ~100% of their total runtime waiting
> for I/O, 70%-75% of that time waiting for buffers to get loaded into their
> target device.
> 
> So having more threads here won't negatively affect the host CPU
> consumption since these threads barely use the host CPU at all.
> Also, their count is capped at the number of VFIO devices in the VM.
> 
> I also did a quick test with the same config as usual: 4 VFs, 6 multifd
> channels, but with patch at [2] simulating forced coupling of loading
> threads to multifd receive channel threads.
> 
> With this patch load_state_buffer() handler will return to the multifd
> channel thread only when the loading thread finishes loading available
> buffers and is about to wait for the next buffers to arrive - just as
> loading buffers directly from these channel threads would do.
> 
> The resulting lowest downtime from 115 live migration runs was 1295ms -
> that's 21% worse than 1068ms of downtime with these loading threads running
> on their own.
> 
> I expect that this performance penalty to get even worse with more VFs
> than 4.
> 
> So no, we can't load buffers directly from multifd channel receive threads.

6 channels can be a bit less in this test case with 4 VFs, but indeed
adding such dependency on number of multifd threads isn't as good either, I
agree.  I'm ok as long as VFIO reviewers are fine.

> 
> > PS: I'd suggest if you really need those threads it should still be managed
> > by migration framework like the src thread pool.  Sorry I'm pretty stubborn
> > on this, especially after I notice we have query-migrationthreads API just
> > recently.. even if now I'm not sure whether we should remove that API.  I
> > assume that shouldn't need much change, even if necessary.
> 
> I can certainly make these loading threads managed in a thread pool if that's
> easier for you.

Yes, if you want to use separate thread it'll be great to match on the src
thread model with similar pool.  I hope the pool interface you have is
easily applicable on both sides.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer
  2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (17 preceding siblings ...)
  2024-08-28 20:46 ` [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
@ 2024-10-11 13:58 ` Cédric Le Goater
  2024-10-15 21:12   ` Maciej S. Szmigiero
  18 siblings, 1 reply; 128+ messages in thread
From: Cédric Le Goater @ 2024-10-11 13:58 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

Hello Maciej,

On 8/27/24 19:54, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This is an updated v2 patch series of the v1 series located here:
> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
> 
> Changes from v1:
> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
> implemented automatic memory management support for its work element
> function argument.
> 
> * Introduced a multifd device state save thread pool, ported the VFIO
> multifd device state save implementation to use this thread pool instead
> of VFIO internally managed individual threads.
> 
> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
> 
> * Moved device state related multifd code to new multifd-device-state.c
> file where it made sense.
> 
> * Implemented a max in-flight VFIO device state buffer count limit to
> allow capping the maximum recipient memory usage.
> 
> * Removed unnecessary explicit memory barriers from multifd_send().
> 
> * A few small changes like updated comments, code formatting,
> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
> 
> 
> For convenience, this patch set is also available as a git tree:
> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
> 
> Based-on: <20240823173911.6712-1-farosas@suse.de>

I must admit, I’m a bit lost in all the discussions. Could you please
resend a v3 on top of the master branch, incorporating the points
discussed and agreed upon ? Many thanks to Peter for leading the
discussion, his expertise in this area is invaluable.

Please include a summary of the proposed design (and alternatives) in
the cover letter. Diagrams of the communication flows between src/dest
threads would be a plus to understand better the proposal. Such level
of details should go under docs/devel/migration at end. So, it might
good to invest some time for that.

Performance figures would be good to have in the cover. The ones from
your presentation at KVM forum 2024 should be fine unless you have
changed the design since.

 From there, we can test and stress to evaluate the benefits of the
changes for mlx5 VF and vGPU migration. Once we have the results,
we can determine how to upstream the changes, either all at once
or splitting the series.

Quoting Peter,

   "I am sorry to ask for this, Fabiano already blames me for this,
   but.. logically it'll be best we use no new thread in the series,
   then one patch on top with your new thread solution to justify its
   performance benefits and worthwhile to having those threads at all."

I fully support this step-by-step approach. VFIO migration is a recent
feature. It can be stressed in a complex environment and is not fully
optimized for certain workloads. However, I would prefer to introduce
changes progressively to ensure stability is maintained. It is now
acceptable to introduce experimental knobs to explore alternative
solutions.

Also, quoting again Peter,

   "PS: I'd suggest if you really need those threads it should still be
    managed by migration framework like the src thread pool. "

yes, I would prefer to see the VFIO subsystem rely on common QEMU APIs
and in this case, a QEMU multifd API too.

Thanks,

C.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer
  2024-10-11 13:58 ` Cédric Le Goater
@ 2024-10-15 21:12   ` Maciej S. Szmigiero
  0 siblings, 0 replies; 128+ messages in thread
From: Maciej S. Szmigiero @ 2024-10-15 21:12 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Fabiano Rosas

Hi Cédric,

On 11.10.2024 15:58, Cédric Le Goater wrote:
> Hello Maciej,
> 
> On 8/27/24 19:54, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is an updated v2 patch series of the v1 series located here:
>> https://lore.kernel.org/qemu-devel/cover.1718717584.git.maciej.szmigiero@oracle.com/
>>
>> Changes from v1:
>> * Extended the QEMU thread-pool with non-AIO (generic) pool support,
>> implemented automatic memory management support for its work element
>> function argument.
>>
>> * Introduced a multifd device state save thread pool, ported the VFIO
>> multifd device state save implementation to use this thread pool instead
>> of VFIO internally managed individual threads.
>>
>> * Re-implemented on top of Fabiano's v4 multifd sender refactor patch set from
>> https://lore.kernel.org/qemu-devel/20240823173911.6712-1-farosas@suse.de/
>>
>> * Moved device state related multifd code to new multifd-device-state.c
>> file where it made sense.
>>
>> * Implemented a max in-flight VFIO device state buffer count limit to
>> allow capping the maximum recipient memory usage.
>>
>> * Removed unnecessary explicit memory barriers from multifd_send().
>>
>> * A few small changes like updated comments, code formatting,
>> fixed zero-copy RAM multifd bytes transferred counter under-counting, etc.
>>
>>
>> For convenience, this patch set is also available as a git tree:
>> https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio
>>
>> Based-on: <20240823173911.6712-1-farosas@suse.de>
> 
> 
> I must admit, I’m a bit lost in all the discussions. Could you please
> resend a v3 on top of the master branch, incorporating the points
> discussed and agreed upon ? Many thanks to Peter for leading the
> discussion, his expertise in this area is invaluable.
> 
> Please include a summary of the proposed design (and alternatives) in
> the cover letter. Diagrams of the communication flows between src/dest
> threads would be a plus to understand better the proposal. Such level
> of details should go under docs/devel/migration at end. So, it might
> good to invest some time for that.
> 
> Performance figures would be good to have in the cover. The ones from
> your presentation at KVM forum 2024 should be fine unless you have
> changed the design since.
> 
>  From there, we can test and stress to evaluate the benefits of the
> changes for mlx5 VF and vGPU migration. Once we have the results,
> we can determine how to upstream the changes, either all at once
> or splitting the series.
> 
> 
> Quoting Peter,
> 
>    "I am sorry to ask for this, Fabiano already blames me for this,
>    but.. logically it'll be best we use no new thread in the series,
>    then one patch on top with your new thread solution to justify its
>    performance benefits and worthwhile to having those threads at all."
> 
> I fully support this step-by-step approach. VFIO migration is a recent
> feature. It can be stressed in a complex environment and is not fully
> optimized for certain workloads. However, I would prefer to introduce
> changes progressively to ensure stability is maintained. It is now
> acceptable to introduce experimental knobs to explore alternative
> solutions.
> 
> Also, quoting again Peter,
> 
>    "PS: I'd suggest if you really need those threads it should still be
>     managed by migration framework like the src thread pool. "
> 
> yes, I would prefer to see the VFIO subsystem rely on common QEMU APIs
> and in this case, a QEMU multifd API too.

I am preparing a v3 right now with the review comments (hopefully)
addressed.

The updated patch set will be based on the current QEMU git master
since I see that Fabiano's patches have been already merged there.

> Thanks,
> 
> C.

Thanks,
Maciej


^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2024-10-15 21:14 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-27 17:54 [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 01/17] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-09-05 13:08   ` [PATCH v2 01/17] vfio/migration: Add save_{iterate,complete_precopy}_started " Avihai Horon
2024-09-09 18:04     ` Maciej S. Szmigiero
2024-09-11 14:50       ` Avihai Horon
2024-08-27 17:54 ` [PATCH v2 02/17] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-08-28 18:44   ` Fabiano Rosas
2024-08-28 20:21     ` Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 03/17] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-08-28 18:50   ` Fabiano Rosas
2024-09-09 15:41   ` Peter Xu
2024-08-27 17:54 ` [PATCH v2 04/17] thread-pool: Add a DestroyNotify parameter to thread_pool_submit{, _aio)() Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 05/17] thread-pool: Implement non-AIO (generic) pool support Maciej S. Szmigiero
2024-09-02 22:07   ` Fabiano Rosas
2024-09-03 12:02     ` Maciej S. Szmigiero
2024-09-03 14:26       ` Fabiano Rosas
2024-09-03 18:14         ` Maciej S. Szmigiero
2024-09-03 13:55   ` Stefan Hajnoczi
2024-09-03 16:54     ` Maciej S. Szmigiero
2024-09-03 19:04       ` Stefan Hajnoczi
2024-09-09 16:45         ` Peter Xu
2024-09-09 18:38           ` Maciej S. Szmigiero
2024-09-09 19:12             ` Peter Xu
2024-09-09 19:16               ` Maciej S. Szmigiero
2024-09-09 19:24                 ` Peter Xu
2024-08-27 17:54 ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
2024-08-28 19:03   ` [PATCH v2 06/17] migration: Add save_live_complete_precopy_{begin,end} handlers Fabiano Rosas
2024-09-05 13:45   ` Avihai Horon
2024-09-09 17:59     ` Peter Xu
2024-09-09 18:32       ` Maciej S. Szmigiero
2024-09-09 19:08         ` Peter Xu
2024-09-09 19:32           ` Peter Xu
2024-09-19 19:48             ` Maciej S. Szmigiero
2024-09-19 19:47           ` Maciej S. Szmigiero
2024-09-19 20:54             ` Peter Xu
2024-09-20 15:22               ` Maciej S. Szmigiero
2024-09-20 16:08                 ` Peter Xu
2024-09-09 18:05     ` Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 07/17] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-08-30 19:05   ` Fabiano Rosas
2024-09-05 14:15   ` Avihai Horon
2024-09-09 18:05     ` Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 08/17] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-08-30 19:28   ` Fabiano Rosas
2024-09-05 15:13   ` Avihai Horon
2024-09-09 18:05     ` Maciej S. Szmigiero
2024-09-09 20:03   ` Peter Xu
2024-09-19 19:49     ` Maciej S. Szmigiero
2024-09-19 21:11       ` Peter Xu
2024-09-20 15:23         ` Maciej S. Szmigiero
2024-09-20 16:45           ` Peter Xu
2024-09-26 22:34             ` Maciej S. Szmigiero
2024-09-27  0:53               ` Peter Xu
2024-09-30 19:25                 ` Maciej S. Szmigiero
2024-09-30 21:57                   ` Peter Xu
2024-10-01 20:41                     ` Maciej S. Szmigiero
2024-10-01 21:30                       ` Peter Xu
2024-10-02 20:11                         ` Maciej S. Szmigiero
2024-10-02 21:25                           ` Peter Xu
2024-10-03 20:34                             ` Maciej S. Szmigiero
2024-10-03 21:17                               ` Peter Xu
2024-08-27 17:54 ` [PATCH v2 09/17] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-08-30 20:22   ` Fabiano Rosas
2024-09-02 20:12     ` Maciej S. Szmigiero
2024-09-03 14:42       ` Fabiano Rosas
2024-09-03 18:41         ` Maciej S. Szmigiero
2024-09-09 19:52       ` Peter Xu
2024-09-19 19:49         ` Maciej S. Szmigiero
2024-09-05 16:47   ` Avihai Horon
2024-09-09 18:05     ` Maciej S. Szmigiero
2024-09-12  8:13       ` Avihai Horon
2024-09-12 13:52         ` Fabiano Rosas
2024-09-19 19:59           ` Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 10/17] migration/multifd: Convert multifd_send()::next_channel to atomic Maciej S. Szmigiero
2024-08-30 18:13   ` Fabiano Rosas
2024-09-02 20:11     ` Maciej S. Szmigiero
2024-09-03 15:01       ` Fabiano Rosas
2024-09-03 20:04         ` Maciej S. Szmigiero
2024-09-10 14:13   ` Peter Xu
2024-08-27 17:54 ` [PATCH v2 11/17] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
2024-08-30 13:12   ` Fabiano Rosas
2024-08-27 17:54 ` [PATCH v2 12/17] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-08-29  0:41   ` Fabiano Rosas
2024-08-29 20:03     ` Maciej S. Szmigiero
2024-08-30 13:02       ` Fabiano Rosas
2024-09-09 19:40         ` Peter Xu
2024-09-19 19:50           ` Maciej S. Szmigiero
2024-09-10 19:48     ` Peter Xu
2024-09-12 18:43       ` Fabiano Rosas
2024-09-13  0:23         ` Peter Xu
2024-09-13 13:21           ` Fabiano Rosas
2024-09-13 14:19             ` Peter Xu
2024-09-13 15:04               ` Fabiano Rosas
2024-09-13 15:22                 ` Peter Xu
2024-09-13 18:26                   ` Fabiano Rosas
2024-09-17 15:39                     ` Peter Xu
2024-09-17 17:07                   ` Cédric Le Goater
2024-09-17 17:50                     ` Peter Xu
2024-09-19 19:51                       ` Maciej S. Szmigiero
2024-09-19 19:49       ` Maciej S. Szmigiero
2024-09-19 21:17         ` Peter Xu
2024-09-20 15:23           ` Maciej S. Szmigiero
2024-09-20 17:09             ` Peter Xu
2024-09-10 16:06   ` Peter Xu
2024-09-19 19:49     ` Maciej S. Szmigiero
2024-09-19 21:18       ` Peter Xu
2024-08-27 17:54 ` [PATCH v2 13/17] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-08-30 18:55   ` Fabiano Rosas
2024-09-02 20:11     ` Maciej S. Szmigiero
2024-09-03 15:09       ` Fabiano Rosas
2024-08-27 17:54 ` [PATCH v2 14/17] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-09-09  8:55   ` Avihai Horon
2024-09-09 18:06     ` Maciej S. Szmigiero
2024-09-12  8:20       ` Avihai Horon
2024-09-12  8:45         ` Cédric Le Goater
2024-08-27 17:54 ` [PATCH v2 16/17] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
2024-08-27 17:54 ` [PATCH v2 17/17] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-09-09 11:41   ` Avihai Horon
2024-09-09 18:07     ` Maciej S. Szmigiero
2024-09-12  8:26       ` Avihai Horon
2024-09-12  8:57         ` Cédric Le Goater
2024-08-28 20:46 ` [PATCH v2 00/17] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
2024-08-28 21:58   ` Maciej S. Szmigiero
2024-08-29  0:51     ` Fabiano Rosas
2024-08-29 20:02       ` Maciej S. Szmigiero
2024-10-11 13:58 ` Cédric Le Goater
2024-10-15 21:12   ` Maciej S. Szmigiero

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).