qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer
@ 2025-02-19 20:33 Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 01/36] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
                   ` (35 more replies)
  0 siblings, 36 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is an updated v5 patch series of the v4 series located here:
https://lore.kernel.org/qemu-devel/cover.1738171076.git.maciej.szmigiero@oracle.com/

Changes from v3:
* Use "unsigned long" for VFIO bytes transferred counter to fixes
test issues on 32-bit platforms.

* Instead of adding BQL holding around qemu_loadvm_state_main() in
postcopy_ram_listen_thread() just add a TODO comment there.

* Drop patches for gracefully handling improperly terminated TLS session
and rely on recent Fabiano's changes to handle this case instead.

* Adapt how MULTIFD_FLAG_DEVICE_STATE value is defined for consistency with
neighboring flags.

* Return Error type and use migrate_set_error() to set it also for save
threads, much like it was previously done for load threads.

* Export SaveLiveCompletePrecopyThreadData and make save threads
take it directly instead of passing individual parameters stored there
to a thread entry point.

* Group all multifd device state save variables in
multifd_send_device_state variable allocated on demand instead of
using multifd-device-state.c globals.

* Export save threads abort flag via
multifd_device_state_save_thread_should_exit() getter function rather
than passing it directly.

* Separate VFIO multifd stuff into migration-multifd.{c,h} files.
Needed moving VFIO migration channel flags to vfio-common.h header file.

* Move x-migration-load-config-after-iter feature to a separate patch near
the end of the series.

* Move x-migration-max-queued-buffers feature to a yet another separate
patch near the end of the series.

* Introduce save/load common vfio_multifd_transfer_setup() and a getter
function for multifd transfer switch called vfio_multifd_transfer_enabled().

* Move introduction of VFIOMigration->multifd_transfer and
VFIODevice->migration_multifd_transfer into the very patch that introduces
the x-migration-multifd-transfer property.

* Introduce vfio_multifd_cleanup() for clearing migration->multifd.

* Split making x-migration-multifd-transfer mutable at runtime into
a separate patch.

* Rename vfio_switchover_start() to vfio_multifd_switchover_start() and
add a new vfio_switchover_start() in migration.c that calls that
vfio_multifd_switchover_start().

* Introduce VFIO_DEVICE_STATE_PACKET_VER_CURRENT.

* Don't print UINT32_MAX value in vfio_load_state_buffer().

* Introduce a new routine for parsing VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY.

* Add an Error parameter to vfio_save_complete_precopy_thread_config_state()
and propagate it from vfio_save_device_config_state() function that it calls.

* Update the VFIO developer doc in docs/devel/migration/vfio.rst.

* Add comments about how VFIO multifd threads are launched and from where
this happens. Also add comments how they are terminated.

* Other small changes, like renamed functions, added review tags, code
formatting, rebased on top of the latest QEMU git master, etc.

========================================================================

This patch set is targeting QEMU 10.0.

========================================================================

Maciej S. Szmigiero (35):
  migration: Clarify that {load,save}_cleanup handlers can run without
    setup
  thread-pool: Remove thread_pool_submit() function
  thread-pool: Rename AIO pool functions to *_aio() and data types to
    *Aio
  thread-pool: Implement generic (non-AIO) pool support
  migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: postcopy_ram_listen_thread() should take BQL for some calls
  error: define g_autoptr() cleanup function for the Error type
  migration: Add thread pool of optional load threads
  migration/multifd: Split packet into header and RAM data
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Make multifd_send() thread safe
  migration/multifd: Add an explicit MultiFDSendData destructor
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add multifd_device_state_supported()
  migration: Add save_live_complete_precopy_thread handler
  vfio/migration: Add load_device_config_state_start trace event
  vfio/migration: Convert bytes_transferred counter to atomic
  vfio/migration: Add vfio_add_bytes_transferred()
  vfio/migration: Move migration channel flags to vfio-common.h header
    file
  vfio/migration: Multifd device state transfer support - basic types
  vfio/migration: Multifd device state transfer support -
    VFIOStateBuffer(s)
  vfio/migration: Multifd device state transfer - add support checking
    function
  vfio/migration: Multifd device state transfer support - receive
    init/cleanup
  vfio/migration: Multifd device state transfer support - received
    buffers queuing
  vfio/migration: Multifd device state transfer support - load thread
  vfio/migration: Multifd device state transfer support - config loading
    support
  migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  vfio/migration: Multifd device state transfer support - send side
  vfio/migration: Add x-migration-multifd-transfer VFIO property
  vfio/migration: Make x-migration-multifd-transfer VFIO property
    mutable
  hw/core/machine: Add compat for x-migration-multifd-transfer VFIO
    property
  vfio/migration: Max in-flight VFIO device state buffer count limit
  vfio/migration: Add x-migration-load-config-after-iter VFIO property
  vfio/migration: Update VFIO migration documentation

Peter Xu (1):
  migration/multifd: Make MultiFDSendData a struct

 docs/devel/migration/vfio.rst      |  80 ++-
 hw/core/machine.c                  |   2 +
 hw/vfio/meson.build                |   1 +
 hw/vfio/migration-multifd.c        | 757 +++++++++++++++++++++++++++++
 hw/vfio/migration-multifd.h        |  38 ++
 hw/vfio/migration.c                | 119 +++--
 hw/vfio/pci.c                      |  14 +
 hw/vfio/trace-events               |  11 +-
 include/block/aio.h                |   8 +-
 include/block/thread-pool.h        |  62 ++-
 include/hw/vfio/vfio-common.h      |  36 ++
 include/migration/client-options.h |   4 +
 include/migration/misc.h           |  25 +
 include/migration/register.h       |  52 +-
 include/qapi/error.h               |   2 +
 include/qemu/typedefs.h            |   5 +
 migration/colo.c                   |   3 +
 migration/meson.build              |   1 +
 migration/migration-hmp-cmds.c     |   2 +
 migration/migration.c              |   4 +-
 migration/migration.h              |   7 +
 migration/multifd-device-state.c   | 202 ++++++++
 migration/multifd-nocomp.c         |  30 +-
 migration/multifd.c                | 246 ++++++++--
 migration/multifd.h                |  74 ++-
 migration/options.c                |   9 +
 migration/qemu-file.h              |   2 +
 migration/savevm.c                 | 190 +++++++-
 migration/savevm.h                 |   6 +-
 migration/trace-events             |   1 +
 scripts/analyze-migration.py       |  11 +
 tests/unit/test-thread-pool.c      |   6 +-
 util/async.c                       |   6 +-
 util/thread-pool.c                 | 184 +++++--
 util/trace-events                  |   6 +-
 35 files changed, 2039 insertions(+), 167 deletions(-)
 create mode 100644 hw/vfio/migration-multifd.c
 create mode 100644 hw/vfio/migration-multifd.h
 create mode 100644 migration/multifd-device-state.c



^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 01/36] migration: Clarify that {load, save}_cleanup handlers can run without setup
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 02/36] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
                   ` (34 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

It's possible for {load,save}_cleanup SaveVMHandlers to get called without
the corresponding {load,save}_setup handler being called first.

One such example is if {load,save}_setup handler of a proceeding device
returns error.
In this case the migration core cleanup code will call all corresponding
cleanup handlers, even for these devices which haven't had its setup
handler called.

Since this behavior can generate some surprises let's clearly document it
in these SaveVMHandlers description.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index f60e797894e5..0b0292738320 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -69,7 +69,9 @@ typedef struct SaveVMHandlers {
     /**
      * @save_cleanup
      *
-     * Uninitializes the data structures on the source
+     * Uninitializes the data structures on the source.
+     * Note that this handler can be called even if save_setup
+     * wasn't called earlier.
      *
      * @opaque: data pointer passed to register_savevm_live()
      */
@@ -244,6 +246,8 @@ typedef struct SaveVMHandlers {
      * @load_cleanup
      *
      * Uninitializes the data structures on the destination.
+     * Note that this handler can be called even if load_setup
+     * wasn't called earlier.
      *
      * @opaque: data pointer passed to register_savevm_live()
      *


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 02/36] thread-pool: Remove thread_pool_submit() function
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 01/36] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 03/36] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
                   ` (33 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This function name conflicts with one used by a future generic thread pool
function and it was only used by one test anyway.

Update the trace event name in thread_pool_submit_aio() accordingly.

Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/thread-pool.h   | 3 +--
 tests/unit/test-thread-pool.c | 6 +++---
 util/thread-pool.c            | 7 +------
 util/trace-events             | 2 +-
 4 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 948ff5f30c31..4f6694026123 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -30,13 +30,12 @@ ThreadPool *thread_pool_new(struct AioContext *ctx);
 void thread_pool_free(ThreadPool *pool);
 
 /*
- * thread_pool_submit* API: submit I/O requests in the thread's
+ * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
  * current AioContext.
  */
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
-void thread_pool_submit(ThreadPoolFunc *func, void *arg);
 
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
index 1483e53473db..33407b595d35 100644
--- a/tests/unit/test-thread-pool.c
+++ b/tests/unit/test-thread-pool.c
@@ -43,10 +43,10 @@ static void done_cb(void *opaque, int ret)
     active--;
 }
 
-static void test_submit(void)
+static void test_submit_no_complete(void)
 {
     WorkerTestData data = { .n = 0 };
-    thread_pool_submit(worker_cb, &data);
+    thread_pool_submit_aio(worker_cb, &data, NULL, NULL);
     while (data.n == 0) {
         aio_poll(ctx, true);
     }
@@ -236,7 +236,7 @@ int main(int argc, char **argv)
     ctx = qemu_get_current_aio_context();
 
     g_test_init(&argc, &argv, NULL);
-    g_test_add_func("/thread-pool/submit", test_submit);
+    g_test_add_func("/thread-pool/submit-no-complete", test_submit_no_complete);
     g_test_add_func("/thread-pool/submit-aio", test_submit_aio);
     g_test_add_func("/thread-pool/submit-co", test_submit_co);
     g_test_add_func("/thread-pool/submit-many", test_submit_many);
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 27eb777e855b..2f751d55b33f 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -256,7 +256,7 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 
     QLIST_INSERT_HEAD(&pool->head, req, all);
 
-    trace_thread_pool_submit(pool, req, arg);
+    trace_thread_pool_submit_aio(pool, req, arg);
 
     qemu_mutex_lock(&pool->lock);
     if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
@@ -290,11 +290,6 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
     return tpc.ret;
 }
 
-void thread_pool_submit(ThreadPoolFunc *func, void *arg)
-{
-    thread_pool_submit_aio(func, arg, NULL, NULL);
-}
-
 void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
 {
     qemu_mutex_lock(&pool->lock);
diff --git a/util/trace-events b/util/trace-events
index 49a4962e1886..5be12d7fab89 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -14,7 +14,7 @@ aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
 reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
 
 # thread-pool.c
-thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
+thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 03/36] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 01/36] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 02/36] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 04/36] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
                   ` (32 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

These names conflict with ones used by future generic thread pool
equivalents.
Generic names should belong to the generic pool type, not specific (AIO)
type.

Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/aio.h         |  8 ++---
 include/block/thread-pool.h |  8 ++---
 util/async.c                |  6 ++--
 util/thread-pool.c          | 58 ++++++++++++++++++-------------------
 util/trace-events           |  4 +--
 5 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 43883a8a33a8..b2ab3514de23 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -54,7 +54,7 @@ typedef void QEMUBHFunc(void *opaque);
 typedef bool AioPollFn(void *opaque);
 typedef void IOHandler(void *opaque);
 
-struct ThreadPool;
+struct ThreadPoolAio;
 struct LinuxAioState;
 typedef struct LuringState LuringState;
 
@@ -207,7 +207,7 @@ struct AioContext {
     /* Thread pool for performing work and receiving completion callbacks.
      * Has its own locking.
      */
-    struct ThreadPool *thread_pool;
+    struct ThreadPoolAio *thread_pool;
 
 #ifdef CONFIG_LINUX_AIO
     struct LinuxAioState *linux_aio;
@@ -500,8 +500,8 @@ void aio_set_event_notifier_poll(AioContext *ctx,
  */
 GSource *aio_get_g_source(AioContext *ctx);
 
-/* Return the ThreadPool bound to this AioContext */
-struct ThreadPool *aio_get_thread_pool(AioContext *ctx);
+/* Return the ThreadPoolAio bound to this AioContext */
+struct ThreadPoolAio *aio_get_thread_pool(AioContext *ctx);
 
 /* Setup the LinuxAioState bound to this AioContext */
 struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 4f6694026123..6f27eb085b45 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -24,10 +24,10 @@
 
 typedef int ThreadPoolFunc(void *opaque);
 
-typedef struct ThreadPool ThreadPool;
+typedef struct ThreadPoolAio ThreadPoolAio;
 
-ThreadPool *thread_pool_new(struct AioContext *ctx);
-void thread_pool_free(ThreadPool *pool);
+ThreadPoolAio *thread_pool_new_aio(struct AioContext *ctx);
+void thread_pool_free_aio(ThreadPoolAio *pool);
 
 /*
  * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
@@ -36,7 +36,7 @@ void thread_pool_free(ThreadPool *pool);
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
+void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
 
-void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
 #endif
diff --git a/util/async.c b/util/async.c
index 0fe29436090d..47e3d35a263f 100644
--- a/util/async.c
+++ b/util/async.c
@@ -369,7 +369,7 @@ aio_ctx_finalize(GSource     *source)
     QEMUBH *bh;
     unsigned flags;
 
-    thread_pool_free(ctx->thread_pool);
+    thread_pool_free_aio(ctx->thread_pool);
 
 #ifdef CONFIG_LINUX_AIO
     if (ctx->linux_aio) {
@@ -435,10 +435,10 @@ GSource *aio_get_g_source(AioContext *ctx)
     return &ctx->source;
 }
 
-ThreadPool *aio_get_thread_pool(AioContext *ctx)
+ThreadPoolAio *aio_get_thread_pool(AioContext *ctx)
 {
     if (!ctx->thread_pool) {
-        ctx->thread_pool = thread_pool_new(ctx);
+        ctx->thread_pool = thread_pool_new_aio(ctx);
     }
     return ctx->thread_pool;
 }
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 2f751d55b33f..908194dc070f 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -23,9 +23,9 @@
 #include "block/thread-pool.h"
 #include "qemu/main-loop.h"
 
-static void do_spawn_thread(ThreadPool *pool);
+static void do_spawn_thread(ThreadPoolAio *pool);
 
-typedef struct ThreadPoolElement ThreadPoolElement;
+typedef struct ThreadPoolElementAio ThreadPoolElementAio;
 
 enum ThreadState {
     THREAD_QUEUED,
@@ -33,9 +33,9 @@ enum ThreadState {
     THREAD_DONE,
 };
 
-struct ThreadPoolElement {
+struct ThreadPoolElementAio {
     BlockAIOCB common;
-    ThreadPool *pool;
+    ThreadPoolAio *pool;
     ThreadPoolFunc *func;
     void *arg;
 
@@ -47,13 +47,13 @@ struct ThreadPoolElement {
     int ret;
 
     /* Access to this list is protected by lock.  */
-    QTAILQ_ENTRY(ThreadPoolElement) reqs;
+    QTAILQ_ENTRY(ThreadPoolElementAio) reqs;
 
     /* This list is only written by the thread pool's mother thread.  */
-    QLIST_ENTRY(ThreadPoolElement) all;
+    QLIST_ENTRY(ThreadPoolElementAio) all;
 };
 
-struct ThreadPool {
+struct ThreadPoolAio {
     AioContext *ctx;
     QEMUBH *completion_bh;
     QemuMutex lock;
@@ -62,10 +62,10 @@ struct ThreadPool {
     QEMUBH *new_thread_bh;
 
     /* The following variables are only accessed from one AioContext. */
-    QLIST_HEAD(, ThreadPoolElement) head;
+    QLIST_HEAD(, ThreadPoolElementAio) head;
 
     /* The following variables are protected by lock.  */
-    QTAILQ_HEAD(, ThreadPoolElement) request_list;
+    QTAILQ_HEAD(, ThreadPoolElementAio) request_list;
     int cur_threads;
     int idle_threads;
     int new_threads;     /* backlog of threads we need to create */
@@ -76,14 +76,14 @@ struct ThreadPool {
 
 static void *worker_thread(void *opaque)
 {
-    ThreadPool *pool = opaque;
+    ThreadPoolAio *pool = opaque;
 
     qemu_mutex_lock(&pool->lock);
     pool->pending_threads--;
     do_spawn_thread(pool);
 
     while (pool->cur_threads <= pool->max_threads) {
-        ThreadPoolElement *req;
+        ThreadPoolElementAio *req;
         int ret;
 
         if (QTAILQ_EMPTY(&pool->request_list)) {
@@ -131,7 +131,7 @@ static void *worker_thread(void *opaque)
     return NULL;
 }
 
-static void do_spawn_thread(ThreadPool *pool)
+static void do_spawn_thread(ThreadPoolAio *pool)
 {
     QemuThread t;
 
@@ -148,14 +148,14 @@ static void do_spawn_thread(ThreadPool *pool)
 
 static void spawn_thread_bh_fn(void *opaque)
 {
-    ThreadPool *pool = opaque;
+    ThreadPoolAio *pool = opaque;
 
     qemu_mutex_lock(&pool->lock);
     do_spawn_thread(pool);
     qemu_mutex_unlock(&pool->lock);
 }
 
-static void spawn_thread(ThreadPool *pool)
+static void spawn_thread(ThreadPoolAio *pool)
 {
     pool->cur_threads++;
     pool->new_threads++;
@@ -173,8 +173,8 @@ static void spawn_thread(ThreadPool *pool)
 
 static void thread_pool_completion_bh(void *opaque)
 {
-    ThreadPool *pool = opaque;
-    ThreadPoolElement *elem, *next;
+    ThreadPoolAio *pool = opaque;
+    ThreadPoolElementAio *elem, *next;
 
     defer_call_begin(); /* cb() may use defer_call() to coalesce work */
 
@@ -184,8 +184,8 @@ restart:
             continue;
         }
 
-        trace_thread_pool_complete(pool, elem, elem->common.opaque,
-                                   elem->ret);
+        trace_thread_pool_complete_aio(pool, elem, elem->common.opaque,
+                                       elem->ret);
         QLIST_REMOVE(elem, all);
 
         if (elem->common.cb) {
@@ -217,10 +217,10 @@ restart:
 
 static void thread_pool_cancel(BlockAIOCB *acb)
 {
-    ThreadPoolElement *elem = (ThreadPoolElement *)acb;
-    ThreadPool *pool = elem->pool;
+    ThreadPoolElementAio *elem = (ThreadPoolElementAio *)acb;
+    ThreadPoolAio *pool = elem->pool;
 
-    trace_thread_pool_cancel(elem, elem->common.opaque);
+    trace_thread_pool_cancel_aio(elem, elem->common.opaque);
 
     QEMU_LOCK_GUARD(&pool->lock);
     if (elem->state == THREAD_QUEUED) {
@@ -234,16 +234,16 @@ static void thread_pool_cancel(BlockAIOCB *acb)
 }
 
 static const AIOCBInfo thread_pool_aiocb_info = {
-    .aiocb_size         = sizeof(ThreadPoolElement),
+    .aiocb_size         = sizeof(ThreadPoolElementAio),
     .cancel_async       = thread_pool_cancel,
 };
 
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque)
 {
-    ThreadPoolElement *req;
+    ThreadPoolElementAio *req;
     AioContext *ctx = qemu_get_current_aio_context();
-    ThreadPool *pool = aio_get_thread_pool(ctx);
+    ThreadPoolAio *pool = aio_get_thread_pool(ctx);
 
     /* Assert that the thread submitting work is the same running the pool */
     assert(pool->ctx == qemu_get_current_aio_context());
@@ -290,7 +290,7 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
     return tpc.ret;
 }
 
-void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
+void thread_pool_update_params(ThreadPoolAio *pool, AioContext *ctx)
 {
     qemu_mutex_lock(&pool->lock);
 
@@ -317,7 +317,7 @@ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
     qemu_mutex_unlock(&pool->lock);
 }
 
-static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
+static void thread_pool_init_one(ThreadPoolAio *pool, AioContext *ctx)
 {
     if (!ctx) {
         ctx = qemu_get_aio_context();
@@ -337,14 +337,14 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
     thread_pool_update_params(pool, ctx);
 }
 
-ThreadPool *thread_pool_new(AioContext *ctx)
+ThreadPoolAio *thread_pool_new_aio(AioContext *ctx)
 {
-    ThreadPool *pool = g_new(ThreadPool, 1);
+    ThreadPoolAio *pool = g_new(ThreadPoolAio, 1);
     thread_pool_init_one(pool, ctx);
     return pool;
 }
 
-void thread_pool_free(ThreadPool *pool)
+void thread_pool_free_aio(ThreadPoolAio *pool)
 {
     if (!pool) {
         return;
diff --git a/util/trace-events b/util/trace-events
index 5be12d7fab89..bd8f25fb5920 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -15,8 +15,8 @@ reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
 
 # thread-pool.c
 thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
-thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
+thread_pool_complete_aio(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
+thread_pool_cancel_aio(void *req, void *opaque) "req %p opaque %p"
 
 # buffer.c
 buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 04/36] thread-pool: Implement generic (non-AIO) pool support
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 03/36] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 05/36] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
                   ` (31 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Migration code wants to manage device data sending threads in one place.

QEMU has an existing thread pool implementation, however it is limited
to queuing AIO operations only and essentially has a 1:1 mapping between
the current AioContext and the AIO ThreadPool in use.

Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
GThreadPool.

This brings a few new operations on a pool:
* thread_pool_wait() operation waits until all the submitted work requests
have finished.

* thread_pool_set_max_threads() explicitly sets the maximum thread count
in the pool.

* thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
in the pool to equal the number of still waiting in queue or unfinished work.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/thread-pool.h |  51 ++++++++++++++++
 util/thread-pool.c          | 119 ++++++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 6f27eb085b45..dd48cf07e85f 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -38,5 +38,56 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
 void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
 
+/* ------------------------------------------- */
+/* Generic thread pool types and methods below */
+typedef struct ThreadPool ThreadPool;
+
+/* Create a new thread pool. Never returns NULL. */
+ThreadPool *thread_pool_new(void);
+
+/*
+ * Free the thread pool.
+ * Waits for all the previously submitted work to complete before performing
+ * the actual freeing operation.
+ */
+void thread_pool_free(ThreadPool *pool);
+
+/*
+ * Submit a new work (task) for the pool.
+ *
+ * @opaque_destroy is an optional GDestroyNotify for the @opaque argument
+ * to the work function at @func.
+ */
+void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                        void *opaque, GDestroyNotify opaque_destroy);
+
+/*
+ * Submit a new work (task) for the pool, making sure it starts getting
+ * processed immediately, launching a new thread for it if necessary.
+ *
+ * @opaque_destroy is an optional GDestroyNotify for the @opaque argument
+ * to the work function at @func.
+ */
+void thread_pool_submit_immediate(ThreadPool *pool, ThreadPoolFunc *func,
+                                  void *opaque, GDestroyNotify opaque_destroy);
+
+/*
+ * Wait for all previously submitted work to complete before returning.
+ *
+ * Can be used as a barrier between two sets of tasks executed on a thread
+ * pool without destroying it or in a performance sensitive path where the
+ * caller just wants to wait for all tasks to complete while deferring the
+ * pool free operation for later, less performance sensitive time.
+ */
+void thread_pool_wait(ThreadPool *pool);
+
+/* Set the maximum number of threads in the pool. */
+bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
+
+/*
+ * Adjust the maximum number of threads in the pool to give each task its
+ * own thread (exactly one thread per task).
+ */
+bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
 
 #endif
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 908194dc070f..d2ead6b72857 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -374,3 +374,122 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
     qemu_mutex_destroy(&pool->lock);
     g_free(pool);
 }
+
+struct ThreadPool {
+    GThreadPool *t;
+    size_t cur_work;
+    QemuMutex cur_work_lock;
+    QemuCond all_finished_cond;
+};
+
+typedef struct {
+    ThreadPoolFunc *func;
+    void *opaque;
+    GDestroyNotify opaque_destroy;
+} ThreadPoolElement;
+
+static void thread_pool_func(gpointer data, gpointer user_data)
+{
+    ThreadPool *pool = user_data;
+    g_autofree ThreadPoolElement *el = data;
+
+    el->func(el->opaque);
+
+    if (el->opaque_destroy) {
+        el->opaque_destroy(el->opaque);
+    }
+
+    QEMU_LOCK_GUARD(&pool->cur_work_lock);
+
+    assert(pool->cur_work > 0);
+    pool->cur_work--;
+
+    if (pool->cur_work == 0) {
+        qemu_cond_signal(&pool->all_finished_cond);
+    }
+}
+
+ThreadPool *thread_pool_new(void)
+{
+    ThreadPool *pool = g_new(ThreadPool, 1);
+
+    pool->cur_work = 0;
+    qemu_mutex_init(&pool->cur_work_lock);
+    qemu_cond_init(&pool->all_finished_cond);
+
+    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
+    /*
+     * g_thread_pool_new() can only return errors if initial thread(s)
+     * creation fails but we ask for 0 initial threads above.
+     */
+    assert(pool->t);
+
+    return pool;
+}
+
+void thread_pool_free(ThreadPool *pool)
+{
+    /*
+     * With _wait = TRUE this effectively waits for all
+     * previously submitted work to complete first.
+     */
+    g_thread_pool_free(pool->t, FALSE, TRUE);
+
+    qemu_cond_destroy(&pool->all_finished_cond);
+    qemu_mutex_destroy(&pool->cur_work_lock);
+
+    g_free(pool);
+}
+
+void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                        void *opaque, GDestroyNotify opaque_destroy)
+{
+    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
+
+    el->func = func;
+    el->opaque = opaque;
+    el->opaque_destroy = opaque_destroy;
+
+    WITH_QEMU_LOCK_GUARD(&pool->cur_work_lock) {
+        pool->cur_work++;
+    }
+
+    /*
+     * Ignore the return value since this function can only return errors
+     * if creation of an additional thread fails but even in this case the
+     * provided work is still getting queued (just for the existing threads).
+     */
+    g_thread_pool_push(pool->t, el, NULL);
+}
+
+void thread_pool_submit_immediate(ThreadPool *pool, ThreadPoolFunc *func,
+                                  void *opaque, GDestroyNotify opaque_destroy)
+{
+    thread_pool_submit(pool, func, opaque, opaque_destroy);
+    thread_pool_adjust_max_threads_to_work(pool);
+}
+
+void thread_pool_wait(ThreadPool *pool)
+{
+    QEMU_LOCK_GUARD(&pool->cur_work_lock);
+
+    while (pool->cur_work > 0) {
+        qemu_cond_wait(&pool->all_finished_cond,
+                       &pool->cur_work_lock);
+    }
+}
+
+bool thread_pool_set_max_threads(ThreadPool *pool,
+                                 int max_threads)
+{
+    assert(max_threads > 0);
+
+    return g_thread_pool_set_max_threads(pool->t, max_threads, NULL);
+}
+
+bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool)
+{
+    QEMU_LOCK_GUARD(&pool->cur_work_lock);
+
+    return thread_pool_set_max_threads(pool, pool->cur_work);
+}


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 05/36] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 04/36] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 06/36] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
                   ` (30 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
used to mark the switchover point in main migration stream.

It can be used to inform the destination that all pre-switchover main
migration stream data has been sent/received so it can start to process
post-switchover data that it might have received via other migration
channels like the multifd ones.

Add also the relevant MigrationState bit stream compatibility property and
its hw_compat entry.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Zhang Chen <zhangckid@gmail.com> # for the COLO part
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/machine.c                  |  1 +
 include/migration/client-options.h |  4 +++
 include/migration/register.h       | 12 +++++++++
 migration/colo.c                   |  3 +++
 migration/migration-hmp-cmds.c     |  2 ++
 migration/migration.c              |  2 ++
 migration/migration.h              |  2 ++
 migration/options.c                |  9 +++++++
 migration/savevm.c                 | 39 ++++++++++++++++++++++++++++++
 migration/savevm.h                 |  1 +
 migration/trace-events             |  1 +
 scripts/analyze-migration.py       | 11 +++++++++
 12 files changed, 87 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 02cff735b3fb..21c3bde92f08 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -43,6 +43,7 @@ GlobalProperty hw_compat_9_2[] = {
     { "virtio-balloon-pci-non-transitional", "vectors", "0" },
     { "virtio-mem-pci", "vectors", "0" },
     { "migration", "multifd-clean-tls-termination", "false" },
+    { "migration", "send-switchover-start", "off"},
 };
 const size_t hw_compat_9_2_len = G_N_ELEMENTS(hw_compat_9_2);
 
diff --git a/include/migration/client-options.h b/include/migration/client-options.h
index 59f4b55cf4f7..289c9d776221 100644
--- a/include/migration/client-options.h
+++ b/include/migration/client-options.h
@@ -10,6 +10,10 @@
 #ifndef QEMU_MIGRATION_CLIENT_OPTIONS_H
 #define QEMU_MIGRATION_CLIENT_OPTIONS_H
 
+
+/* properties */
+bool migrate_send_switchover_start(void);
+
 /* capabilities */
 
 bool migrate_background_snapshot(void);
diff --git a/include/migration/register.h b/include/migration/register.h
index 0b0292738320..ff0faf5f68c8 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -279,6 +279,18 @@ typedef struct SaveVMHandlers {
      * otherwise
      */
     bool (*switchover_ack_needed)(void *opaque);
+
+    /**
+     * @switchover_start
+     *
+     * Notifies that the switchover has started. Called only on
+     * the destination.
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*switchover_start)(void *opaque);
 } SaveVMHandlers;
 
 /**
diff --git a/migration/colo.c b/migration/colo.c
index 9a8e5fbe9b94..c976b3ff344d 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         bql_unlock();
         goto out;
     }
+
+    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
+
     /* Note: device state is saved into buffer */
     ret = qemu_save_device_state(fb);
 
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 3347e34c4891..49c26daed359 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -46,6 +46,8 @@ static void migration_global_dump(Monitor *mon)
                    ms->send_configuration ? "on" : "off");
     monitor_printf(mon, "send-section-footer: %s\n",
                    ms->send_section_footer ? "on" : "off");
+    monitor_printf(mon, "send-switchover-start: %s\n",
+                   ms->send_switchover_start ? "on" : "off");
     monitor_printf(mon, "clear-bitmap-shift: %u\n",
                    ms->clear_bitmap_shift);
 }
diff --git a/migration/migration.c b/migration/migration.c
index c597aa707e57..9e9db26667f1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2891,6 +2891,8 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
 
     precopy_notify_complete();
 
+    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
+
     return true;
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index 4639e2a7e42f..7b4278e2a32b 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -400,6 +400,8 @@ struct MigrationState {
     bool send_configuration;
     /* Whether we send section footer during migration */
     bool send_section_footer;
+    /* Whether we send switchover start notification during migration */
+    bool send_switchover_start;
 
     /* Needed by postcopy-pause state */
     QemuSemaphore postcopy_pause_sem;
diff --git a/migration/options.c b/migration/options.c
index bb259d192a93..b0ac2ea4083f 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -93,6 +93,8 @@ const Property migration_properties[] = {
                      send_configuration, true),
     DEFINE_PROP_BOOL("send-section-footer", MigrationState,
                      send_section_footer, true),
+    DEFINE_PROP_BOOL("send-switchover-start", MigrationState,
+                     send_switchover_start, true),
     DEFINE_PROP_BOOL("multifd-flush-after-each-section", MigrationState,
                       multifd_flush_after_each_section, false),
     DEFINE_PROP_UINT8("x-clear-bitmap-shift", MigrationState,
@@ -209,6 +211,13 @@ bool migrate_auto_converge(void)
     return s->capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
 }
 
+bool migrate_send_switchover_start(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->send_switchover_start;
+}
+
 bool migrate_background_snapshot(void)
 {
     MigrationState *s = migrate_get_current();
diff --git a/migration/savevm.c b/migration/savevm.c
index 4046faf0091e..faebf47ef51f 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -90,6 +90,7 @@ enum qemu_vm_cmd {
     MIG_CMD_ENABLE_COLO,       /* Enable COLO */
     MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
     MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
+    MIG_CMD_SWITCHOVER_START,  /* Switchover start notification */
     MIG_CMD_MAX
 };
 
@@ -109,6 +110,7 @@ static struct mig_cmd_args {
     [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
     [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
     [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
+    [MIG_CMD_SWITCHOVER_START] = { .len =  0, .name = "SWITCHOVER_START" },
     [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
 };
 
@@ -1201,6 +1203,19 @@ void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name)
     qemu_savevm_command_send(f, MIG_CMD_RECV_BITMAP, len + 1, (uint8_t *)buf);
 }
 
+static void qemu_savevm_send_switchover_start(QEMUFile *f)
+{
+    trace_savevm_send_switchover_start();
+    qemu_savevm_command_send(f, MIG_CMD_SWITCHOVER_START, 0, NULL);
+}
+
+void qemu_savevm_maybe_send_switchover_start(QEMUFile *f)
+{
+    if (migrate_send_switchover_start()) {
+        qemu_savevm_send_switchover_start(f);
+    }
+}
+
 bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
@@ -1687,6 +1702,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
 
     ret = qemu_file_get_error(f);
     if (ret == 0) {
+        qemu_savevm_maybe_send_switchover_start(f);
         qemu_savevm_state_complete_precopy(f, false);
         ret = qemu_file_get_error(f);
     }
@@ -2383,6 +2399,26 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
     return ret;
 }
 
+static int loadvm_postcopy_handle_switchover_start(void)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        int ret;
+
+        if (!se->ops || !se->ops->switchover_start) {
+            continue;
+        }
+
+        ret = se->ops->switchover_start(se->opaque);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * 0           just a normal return
@@ -2481,6 +2517,9 @@ static int loadvm_process_command(QEMUFile *f)
 
     case MIG_CMD_ENABLE_COLO:
         return loadvm_process_enable_colo(mis);
+
+    case MIG_CMD_SWITCHOVER_START:
+        return loadvm_postcopy_handle_switchover_start();
     }
 
     return 0;
diff --git a/migration/savevm.h b/migration/savevm.h
index 7957460062ca..58f871a7ed9c 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -53,6 +53,7 @@ void qemu_savevm_send_postcopy_listen(QEMUFile *f);
 void qemu_savevm_send_postcopy_run(QEMUFile *f);
 void qemu_savevm_send_postcopy_resume(QEMUFile *f);
 void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name);
+void qemu_savevm_maybe_send_switchover_start(QEMUFile *f);
 
 void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
                                            uint16_t len,
diff --git a/migration/trace-events b/migration/trace-events
index 58c0f07f5b2d..c506e11a2e1d 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -39,6 +39,7 @@ savevm_send_postcopy_run(void) ""
 savevm_send_postcopy_resume(void) ""
 savevm_send_colo_enable(void) ""
 savevm_send_recv_bitmap(char *name) "%s"
+savevm_send_switchover_start(void) ""
 savevm_state_setup(void) ""
 savevm_state_resume_prepare(void) ""
 savevm_state_header(void) ""
diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
index 8e1fbf4c9d9f..67631ac43e9f 100755
--- a/scripts/analyze-migration.py
+++ b/scripts/analyze-migration.py
@@ -620,7 +620,9 @@ class MigrationDump(object):
     QEMU_VM_SUBSECTION    = 0x05
     QEMU_VM_VMDESCRIPTION = 0x06
     QEMU_VM_CONFIGURATION = 0x07
+    QEMU_VM_COMMAND       = 0x08
     QEMU_VM_SECTION_FOOTER= 0x7e
+    QEMU_MIG_CMD_SWITCHOVER_START = 0x0b
 
     def __init__(self, filename):
         self.section_classes = {
@@ -685,6 +687,15 @@ def read(self, desc_only = False, dump_memory = False,
             elif section_type == self.QEMU_VM_SECTION_PART or section_type == self.QEMU_VM_SECTION_END:
                 section_id = file.read32()
                 self.sections[section_id].read()
+            elif section_type == self.QEMU_VM_COMMAND:
+                command_type = file.read16()
+                command_data_len = file.read16()
+                if command_type != self.QEMU_MIG_CMD_SWITCHOVER_START:
+                    raise Exception("Unknown QEMU_VM_COMMAND: %x" %
+                                    (command_type))
+                if command_data_len != 0:
+                    raise Exception("Invalid SWITCHOVER_START length: %x" %
+                                    (command_data_len))
             elif section_type == self.QEMU_VM_SECTION_FOOTER:
                 read_section_id = file.read32()
                 if read_section_id != section_id:


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 06/36] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 05/36] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls Maciej S. Szmigiero
                   ` (29 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 15 +++++++++++++++
 migration/savevm.c           | 23 +++++++++++++++++++++++
 migration/savevm.h           |  3 +++
 3 files changed, 41 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index ff0faf5f68c8..58891aa54b76 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -229,6 +229,21 @@ typedef struct SaveVMHandlers {
      */
     int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+    /**
+     * @load_state_buffer (invoked outside the BQL)
+     *
+     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @buf: the data buffer to load
+     * @len: the data length in buffer
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns true to indicate success and false for errors.
+     */
+    bool (*load_state_buffer)(void *opaque, char *buf, size_t len,
+                              Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/savevm.c b/migration/savevm.c
index faebf47ef51f..7c1aa8ad7b9d 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3060,6 +3060,29 @@ int qemu_loadvm_approve_switchover(void)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+bool qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                   char *buf, size_t len, Error **errp)
+{
+    SaveStateEntry *se;
+
+    se = find_se(idstr, instance_id);
+    if (!se) {
+        error_setg(errp,
+                   "Unknown idstr %s or instance id %u for load state buffer",
+                   idstr, instance_id);
+        return false;
+    }
+
+    if (!se->ops || !se->ops->load_state_buffer) {
+        error_setg(errp,
+                   "idstr %s / instance %u has no load state buffer operation",
+                   idstr, instance_id);
+        return false;
+    }
+
+    return se->ops->load_state_buffer(se->opaque, buf, len, errp);
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 58f871a7ed9c..cb58434a9437 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -71,4 +71,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy);
 
+bool qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                   char *buf, size_t len, Error **errp);
+
 #endif


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (5 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 06/36] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-25 17:16   ` Peter Xu
  2025-02-19 20:33 ` [PATCH v5 08/36] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
                   ` (28 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

All callers to migration_incoming_state_destroy() other than
postcopy_ram_listen_thread() do this call with BQL held.

Since migration_incoming_state_destroy() ultimately calls "load_cleanup"
SaveVMHandlers and it will soon call BQL-sensitive code it makes sense
to always call that function under BQL rather than to have it deal with
both cases (with BQL and without BQL).
Add the necessary bql_lock() and bql_unlock() to
postcopy_ram_listen_thread().

qemu_loadvm_state_main() in postcopy_ram_listen_thread() could call
"load_state" SaveVMHandlers that are expecting BQL to be held.

In principle, the only devices that should be arriving on migration
channel serviced by postcopy_ram_listen_thread() are those that are
postcopiable and whose load handlers are safe to be called without BQL
being held.

But nothing currently prevents the source from sending data for "unsafe"
devices which would cause trouble there.
Add a TODO comment there so it's clear that it would be good to improve
handling of such (erroneous) case in the future.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/savevm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index 7c1aa8ad7b9d..3e86b572cfa8 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1986,6 +1986,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
      * in qemu_file, and thus we must be blocking now.
      */
     qemu_file_set_blocking(f, true);
+
+    /* TODO: sanity check that only postcopiable data will be loaded here */
     load_res = qemu_loadvm_state_main(f, mis);
 
     /*
@@ -2046,7 +2048,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
      * (If something broke then qemu will have to exit anyway since it's
      * got a bad migration state).
      */
+    bql_lock();
     migration_incoming_state_destroy();
+    bql_unlock();
 
     rcu_unregister_thread();
     mis->have_listen_thread = false;


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 08/36] error: define g_autoptr() cleanup function for the Error type
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (6 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 09/36] migration: Add thread pool of optional load threads Maciej S. Szmigiero
                   ` (27 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Automatic memory management helps avoid memory safety issues.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/qapi/error.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qapi/error.h b/include/qapi/error.h
index f5fe2162623e..41e381638049 100644
--- a/include/qapi/error.h
+++ b/include/qapi/error.h
@@ -437,6 +437,8 @@ Error *error_copy(const Error *err);
  */
 void error_free(Error *err);
 
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(Error, error_free)
+
 /*
  * Convenience function to assert that *@errp is set, then silently free it.
  */


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 09/36] migration: Add thread pool of optional load threads
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (7 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 08/36] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 10/36] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
                   ` (26 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Some drivers might want to make use of auxiliary helper threads during VM
state loading, for example to make sure that their blocking (sync) I/O
operations don't block the rest of the migration process.

Add a migration core managed thread pool to facilitate this use case.

The migration core will wait for these threads to finish before
(re)starting the VM at destination.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h |  3 ++
 include/qemu/typedefs.h  |  2 +
 migration/migration.c    |  2 +-
 migration/migration.h    |  5 +++
 migration/savevm.c       | 89 +++++++++++++++++++++++++++++++++++++++-
 migration/savevm.h       |  2 +-
 6 files changed, 99 insertions(+), 4 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index c660be80954a..4c171f4e897e 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
 /* migration/block.c */
 
 AnnounceParameters *migrate_announce_params(void);
+
 /* migration/savevm.c */
 
 void dump_vmstate_json_to_file(FILE *out_fp);
+void qemu_loadvm_start_load_thread(MigrationLoadThread function,
+                                   void *opaque);
 
 /* migration/migration.c */
 void migration_object_init(void);
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 3d84efcac47a..fd23ff7771b1 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -131,5 +131,7 @@ typedef struct IRQState *qemu_irq;
  * Function types
  */
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
+typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
+                                    Error **errp);
 
 #endif /* QEMU_TYPEDEFS_H */
diff --git a/migration/migration.c b/migration/migration.c
index 9e9db26667f1..40e1c92a6c30 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -406,7 +406,7 @@ void migration_incoming_state_destroy(void)
      * RAM state cleanup needs to happen after multifd cleanup, because
      * multifd threads can use some of its states (receivedmap).
      */
-    qemu_loadvm_state_cleanup();
+    qemu_loadvm_state_cleanup(mis);
 
     if (mis->to_src_file) {
         /* Tell source that we are done */
diff --git a/migration/migration.h b/migration/migration.h
index 7b4278e2a32b..d53f7cad84d8 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -43,6 +43,7 @@
 #define  MIGRATION_THREAD_DST_PREEMPT       "mig/dst/preempt"
 
 struct PostcopyBlocktimeContext;
+typedef struct ThreadPool ThreadPool;
 
 #define  MIGRATION_RESUME_ACK_VALUE  (1)
 
@@ -187,6 +188,10 @@ struct MigrationIncomingState {
     Coroutine *colo_incoming_co;
     QemuSemaphore colo_incoming_sem;
 
+    /* Optional load threads pool and its thread exit request flag */
+    ThreadPool *load_threads;
+    bool load_threads_abort;
+
     /*
      * PostcopyBlocktimeContext to keep information for postcopy
      * live migration, to calculate vCPU block time
diff --git a/migration/savevm.c b/migration/savevm.c
index 3e86b572cfa8..e412d05657a1 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -54,6 +54,7 @@
 #include "qemu/job.h"
 #include "qemu/main-loop.h"
 #include "block/snapshot.h"
+#include "block/thread-pool.h"
 #include "qemu/cutils.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
@@ -131,6 +132,35 @@ static struct mig_cmd_args {
  * generic extendable format with an exception for two old entities.
  */
 
+/***********************************************************/
+/* Optional load threads pool support */
+
+static void qemu_loadvm_thread_pool_create(MigrationIncomingState *mis)
+{
+    assert(!mis->load_threads);
+    mis->load_threads = thread_pool_new();
+    mis->load_threads_abort = false;
+}
+
+static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
+{
+    qatomic_set(&mis->load_threads_abort, true);
+
+    bql_unlock(); /* Load threads might be waiting for BQL */
+    g_clear_pointer(&mis->load_threads, thread_pool_free);
+    bql_lock();
+}
+
+static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
+                                         MigrationIncomingState *mis)
+{
+    bql_unlock(); /* Let load threads do work requiring BQL */
+    thread_pool_wait(mis->load_threads);
+    bql_lock();
+
+    return !migrate_has_error(s);
+}
+
 /***********************************************************/
 /* savevm/loadvm support */
 
@@ -2783,16 +2813,62 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
     return 0;
 }
 
-void qemu_loadvm_state_cleanup(void)
+struct LoadThreadData {
+    MigrationLoadThread function;
+    void *opaque;
+};
+
+static int qemu_loadvm_load_thread(void *thread_opaque)
+{
+    struct LoadThreadData *data = thread_opaque;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    g_autoptr(Error) local_err = NULL;
+
+    if (!data->function(data->opaque, &mis->load_threads_abort, &local_err)) {
+        MigrationState *s = migrate_get_current();
+
+        assert(local_err);
+
+        /*
+         * In case of multiple load threads failing which thread error
+         * return we end setting is purely arbitrary.
+         */
+        migrate_set_error(s, local_err);
+    }
+
+    return 0;
+}
+
+void qemu_loadvm_start_load_thread(MigrationLoadThread function,
+                                   void *opaque)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    struct LoadThreadData *data;
+
+    /* We only set it from this thread so it's okay to read it directly */
+    assert(!mis->load_threads_abort);
+
+    data = g_new(struct LoadThreadData, 1);
+    data->function = function;
+    data->opaque = opaque;
+
+    thread_pool_submit_immediate(mis->load_threads, qemu_loadvm_load_thread,
+                                 data, g_free);
+}
+
+void qemu_loadvm_state_cleanup(MigrationIncomingState *mis)
 {
     SaveStateEntry *se;
 
     trace_loadvm_state_cleanup();
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (se->ops && se->ops->load_cleanup) {
             se->ops->load_cleanup(se->opaque);
         }
     }
+
+    qemu_loadvm_thread_pool_destroy(mis);
 }
 
 /* Return true if we should continue the migration, or false. */
@@ -2943,6 +3019,7 @@ out:
 
 int qemu_loadvm_state(QEMUFile *f)
 {
+    MigrationState *s = migrate_get_current();
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
     int ret;
@@ -2952,6 +3029,8 @@ int qemu_loadvm_state(QEMUFile *f)
         return -EINVAL;
     }
 
+    qemu_loadvm_thread_pool_create(mis);
+
     ret = qemu_loadvm_state_header(f);
     if (ret) {
         return ret;
@@ -2983,12 +3062,18 @@ int qemu_loadvm_state(QEMUFile *f)
 
     /* When reaching here, it must be precopy */
     if (ret == 0) {
-        if (migrate_has_error(migrate_get_current())) {
+        if (migrate_has_error(migrate_get_current()) ||
+            !qemu_loadvm_thread_pool_wait(s, mis)) {
             ret = -EINVAL;
         } else {
             ret = qemu_file_get_error(f);
         }
     }
+    /*
+     * Set this flag unconditionally so we'll catch further attempts to
+     * start additional threads via an appropriate assert()
+     */
+    qatomic_set(&mis->load_threads_abort, true);
 
     /*
      * Try to read in the VMDESC section as well, so that dumping tools that
diff --git a/migration/savevm.h b/migration/savevm.h
index cb58434a9437..138c39a7f9f9 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -64,7 +64,7 @@ void qemu_savevm_live_state(QEMUFile *f);
 int qemu_save_device_state(QEMUFile *f);
 
 int qemu_loadvm_state(QEMUFile *f);
-void qemu_loadvm_state_cleanup(void);
+void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_loadvm_approve_switchover(void);


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 10/36] migration/multifd: Split packet into header and RAM data
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (8 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 09/36] migration: Add thread pool of optional load threads Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 11/36] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
                   ` (25 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Read packet header first so in the future we will be able to
differentiate between a RAM multifd packet and a device state multifd
packet.

Since these two are of different size we can't read the packet body until
we know which packet type it is.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 55 ++++++++++++++++++++++++++++++++++++---------
 migration/multifd.h |  5 +++++
 2 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 215ad0414a79..3b47e63c2c4a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -209,10 +209,10 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
 
     memset(packet, 0, p->packet_len);
 
-    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-    packet->version = cpu_to_be32(MULTIFD_VERSION);
+    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
 
-    packet->flags = cpu_to_be32(p->flags);
+    packet->hdr.flags = cpu_to_be32(p->flags);
     packet->next_packet_size = cpu_to_be32(p->next_packet_size);
 
     packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
@@ -228,12 +228,12 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
                             p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
+                                             const MultiFDPacketHdr_t *hdr,
+                                             Error **errp)
 {
-    const MultiFDPacket_t *packet = p->packet;
-    uint32_t magic = be32_to_cpu(packet->magic);
-    uint32_t version = be32_to_cpu(packet->version);
-    int ret = 0;
+    uint32_t magic = be32_to_cpu(hdr->magic);
+    uint32_t version = be32_to_cpu(hdr->version);
 
     if (magic != MULTIFD_MAGIC) {
         error_setg(errp, "multifd: received packet magic %x, expected %x",
@@ -247,7 +247,16 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
         return -1;
     }
 
-    p->flags = be32_to_cpu(packet->flags);
+    p->flags = be32_to_cpu(hdr->flags);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    const MultiFDPacket_t *packet = p->packet;
+    int ret = 0;
+
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
     p->packets_recved++;
@@ -1165,14 +1174,18 @@ static void *multifd_recv_thread(void *opaque)
     }
 
     while (true) {
+        MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
         bool has_data = false;
+        uint8_t *pkt_buf;
+        size_t pkt_len;
+
         p->normal_num = 0;
 
         if (use_packets) {
             struct iovec iov = {
-                .iov_base = (void *)p->packet,
-                .iov_len = p->packet_len
+                .iov_base = (void *)&hdr,
+                .iov_len = sizeof(hdr)
             };
 
             if (multifd_recv_should_exit()) {
@@ -1191,6 +1204,26 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
+            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
+            if (ret) {
+                break;
+            }
+
+            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+            pkt_len = p->packet_len - sizeof(hdr);
+
+            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
+                                           &local_err);
+            if (!ret) {
+                /* EOF */
+                error_setg(&local_err, "multifd: unexpected EOF after packet header");
+                break;
+            }
+
+            if (ret == -1) {
+                break;
+            }
+
             qemu_mutex_lock(&p->mutex);
             ret = multifd_recv_unfill_packet(p, &local_err);
             if (ret) {
diff --git a/migration/multifd.h b/migration/multifd.h
index cf408ff72140..f7156f66c0f6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -69,6 +69,11 @@ typedef struct {
     uint32_t magic;
     uint32_t version;
     uint32_t flags;
+} __attribute__((packed)) MultiFDPacketHdr_t;
+
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
     /* maximum number of allocated pages */
     uint32_t pages_alloc;
     /* non zero pages */


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 11/36] migration/multifd: Device state transfer support - receive side
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (9 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 10/36] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-03-02 12:42   ` Avihai Horon
  2025-02-19 20:33 ` [PATCH v5 12/36] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
                   ` (24 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 99 ++++++++++++++++++++++++++++++++++++++++-----
 migration/multifd.h | 26 +++++++++++-
 2 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 3b47e63c2c4a..700a385447c7 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -21,6 +21,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -252,14 +253,24 @@ static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
     return 0;
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
+                                                   Error **errp)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+    packet->instance_id = be32_to_cpu(packet->instance_id);
+    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
 {
     const MultiFDPacket_t *packet = p->packet;
     int ret = 0;
 
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
-    p->packets_recved++;
 
     /* Always unfill, old QEMUs (<9.0) send data along with SYNC */
     ret = multifd_ram_unfill_packet(p, errp);
@@ -270,6 +281,17 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     return ret;
 }
 
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    p->packets_recved++;
+
+    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
+        return multifd_recv_unfill_packet_device_state(p, errp);
+    }
+
+    return multifd_recv_unfill_packet_ram(p, errp);
+}
+
 static bool multifd_send_should_exit(void)
 {
     return qatomic_read(&multifd_send_state->exiting);
@@ -1057,6 +1079,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
+    g_clear_pointer(&p->packet_dev_state, g_free);
     g_free(p->normal);
     p->normal = NULL;
     g_free(p->zero);
@@ -1158,6 +1181,32 @@ void multifd_recv_sync_main(void)
     trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
 }
 
+static int multifd_device_state_recv(MultiFDRecvParams *p, Error **errp)
+{
+    g_autofree char *idstr = NULL;
+    g_autofree char *dev_state_buf = NULL;
+    int ret;
+
+    dev_state_buf = g_malloc(p->next_packet_size);
+
+    ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    idstr = g_strndup(p->packet_dev_state->idstr,
+                      sizeof(p->packet_dev_state->idstr));
+
+    if (!qemu_loadvm_load_state_buffer(idstr,
+                                       p->packet_dev_state->instance_id,
+                                       dev_state_buf, p->next_packet_size,
+                                       errp)) {
+        ret = -1;
+    }
+
+    return ret;
+}
+
 static void *multifd_recv_thread(void *opaque)
 {
     MigrationState *s = migrate_get_current();
@@ -1176,6 +1225,7 @@ static void *multifd_recv_thread(void *opaque)
     while (true) {
         MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
+        bool is_device_state = false;
         bool has_data = false;
         uint8_t *pkt_buf;
         size_t pkt_len;
@@ -1209,8 +1259,14 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
-            pkt_len = p->packet_len - sizeof(hdr);
+            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
+            if (is_device_state) {
+                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
+                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
+            } else {
+                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+                pkt_len = p->packet_len - sizeof(hdr);
+            }
 
             ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
                                            &local_err);
@@ -1235,12 +1291,17 @@ static void *multifd_recv_thread(void *opaque)
             /* recv methods don't know how to handle the SYNC flag */
             p->flags &= ~MULTIFD_FLAG_SYNC;
 
-            /*
-             * Even if it's a SYNC packet, this needs to be set
-             * because older QEMUs (<9.0) still send data along with
-             * the SYNC packet.
-             */
-            has_data = p->normal_num || p->zero_num;
+            if (is_device_state) {
+                has_data = p->next_packet_size > 0;
+            } else {
+                /*
+                 * Even if it's a SYNC packet, this needs to be set
+                 * because older QEMUs (<9.0) still send data along with
+                 * the SYNC packet.
+                 */
+                has_data = p->normal_num || p->zero_num;
+            }
+
             qemu_mutex_unlock(&p->mutex);
         } else {
             /*
@@ -1269,14 +1330,29 @@ static void *multifd_recv_thread(void *opaque)
         }
 
         if (has_data) {
-            ret = multifd_recv_state->ops->recv(p, &local_err);
+            if (is_device_state) {
+                assert(use_packets);
+                ret = multifd_device_state_recv(p, &local_err);
+            } else {
+                ret = multifd_recv_state->ops->recv(p, &local_err);
+            }
             if (ret != 0) {
                 break;
             }
+        } else if (is_device_state) {
+            error_setg(&local_err,
+                       "multifd: received empty device state packet");
+            break;
         }
 
         if (use_packets) {
             if (flags & MULTIFD_FLAG_SYNC) {
+                if (is_device_state) {
+                    error_setg(&local_err,
+                               "multifd: received SYNC device state packet");
+                    break;
+                }
+
                 qemu_sem_post(&multifd_recv_state->sem_sync);
                 qemu_sem_wait(&p->sem_sync);
             }
@@ -1345,6 +1421,7 @@ int multifd_recv_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
         }
         p->name = g_strdup_printf(MIGRATION_THREAD_DST_MULTIFD, i);
         p->normal = g_new0(ram_addr_t, page_count);
diff --git a/migration/multifd.h b/migration/multifd.h
index f7156f66c0f6..c2ebef2d319e 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_UADK (8 << 1)
 #define MULTIFD_FLAG_QATZIP (16 << 1)
 
+/*
+ * If set it means that this packet contains device state
+ * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
+ */
+#define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
+
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
 
@@ -94,6 +100,16 @@ typedef struct {
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
+    char idstr[256] QEMU_NONSTRING;
+    uint32_t instance_id;
+
+    /* size of the next packet that contains the actual data */
+    uint32_t next_packet_size;
+} __attribute__((packed)) MultiFDPacketDeviceState_t;
+
 typedef struct {
     /* number of used pages */
     uint32_t num;
@@ -111,6 +127,13 @@ struct MultiFDRecvData {
     off_t file_offset;
 };
 
+typedef struct {
+    char *idstr;
+    uint32_t instance_id;
+    char *buf;
+    size_t buf_len;
+} MultiFDDeviceState_t;
+
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
@@ -227,8 +250,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_dev_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets received through this channel */


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 12/36] migration/multifd: Make multifd_send() thread safe
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (10 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 11/36] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 13/36] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
                   ` (23 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

multifd_send() function is currently not thread safe, make it thread safe
by holding a lock during its execution.

This way it will be possible to safely call it concurrently from multiple
threads.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 700a385447c7..66ae77fbe4f1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -50,6 +50,10 @@ typedef struct {
 
 struct {
     MultiFDSendParams *params;
+
+    /* multifd_send() body is not thread safe, needs serialization */
+    QemuMutex multifd_send_mutex;
+
     /*
      * Global number of generated multifd packets.
      *
@@ -339,6 +343,8 @@ bool multifd_send(MultiFDSendData **send_data)
         return false;
     }
 
+    QEMU_LOCK_GUARD(&multifd_send_state->multifd_send_mutex);
+
     /* We wait here, until at least one channel is ready */
     qemu_sem_wait(&multifd_send_state->channels_ready);
 
@@ -507,6 +513,7 @@ static void multifd_send_cleanup_state(void)
     socket_cleanup_outgoing_migration();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
+    qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
     g_free(multifd_send_state->params);
     multifd_send_state->params = NULL;
     g_free(multifd_send_state);
@@ -887,6 +894,7 @@ bool multifd_send_setup(void)
     thread_count = migrate_multifd_channels();
     multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
+    qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
     qemu_sem_init(&multifd_send_state->channels_created, 0);
     qemu_sem_init(&multifd_send_state->channels_ready, 0);
     qatomic_set(&multifd_send_state->exiting, 0);


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 13/36] migration/multifd: Add an explicit MultiFDSendData destructor
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (11 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 12/36] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 14/36] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
                   ` (22 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way if there are fields there that needs explicit disposal (like, for
example, some attached buffers) they will be handled appropriately.

Add a related assert to multifd_set_payload_type() in order to make sure
that this function is only used to fill a previously empty MultiFDSendData
with some payload, not the other way around.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-nocomp.c |  3 +--
 migration/multifd.c        | 31 ++++++++++++++++++++++++++++---
 migration/multifd.h        |  5 +++++
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 1325dba97cea..e46e79d8b272 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -42,8 +42,7 @@ void multifd_ram_save_setup(void)
 
 void multifd_ram_save_cleanup(void)
 {
-    g_free(multifd_ram_send);
-    multifd_ram_send = NULL;
+    g_clear_pointer(&multifd_ram_send, multifd_send_data_free);
 }
 
 static void multifd_set_file_bitmap(MultiFDSendParams *p)
diff --git a/migration/multifd.c b/migration/multifd.c
index 66ae77fbe4f1..0092547a4f97 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -123,6 +123,32 @@ MultiFDSendData *multifd_send_data_alloc(void)
     return g_malloc0(size_minus_payload + max_payload_size);
 }
 
+void multifd_send_data_clear(MultiFDSendData *data)
+{
+    if (multifd_payload_empty(data)) {
+        return;
+    }
+
+    switch (data->type) {
+    default:
+        /* Nothing to do */
+        break;
+    }
+
+    data->type = MULTIFD_PAYLOAD_NONE;
+}
+
+void multifd_send_data_free(MultiFDSendData *data)
+{
+    if (!data) {
+        return;
+    }
+
+    multifd_send_data_clear(data);
+
+    g_free(data);
+}
+
 static bool multifd_use_packets(void)
 {
     return !migrate_mapped_ram();
@@ -496,8 +522,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     qemu_sem_destroy(&p->sem_sync);
     g_free(p->name);
     p->name = NULL;
-    g_free(p->data);
-    p->data = NULL;
+    g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
@@ -695,7 +720,7 @@ static void *multifd_send_thread(void *opaque)
                        (uint64_t)p->next_packet_size + p->packet_len);
 
             p->next_packet_size = 0;
-            multifd_set_payload_type(p->data, MULTIFD_PAYLOAD_NONE);
+            multifd_send_data_clear(p->data);
 
             /*
              * Making sure p->data is published before saying "we're
diff --git a/migration/multifd.h b/migration/multifd.h
index c2ebef2d319e..20a4bba58ef4 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -156,6 +156,9 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
+    assert(multifd_payload_empty(data));
+    assert(type != MULTIFD_PAYLOAD_NONE);
+
     data->type = type;
 }
 
@@ -372,6 +375,8 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
+void multifd_send_data_clear(MultiFDSendData *data);
+void multifd_send_data_free(MultiFDSendData *data);
 
 static inline uint32_t multifd_ram_page_size(void)
 {


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 14/36] migration/multifd: Device state transfer support - send side
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (12 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 13/36] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-03-02 12:46   ` Avihai Horon
  2025-02-19 20:33 ` [PATCH v5 15/36] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
                   ` (21 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         |   4 ++
 migration/meson.build            |   1 +
 migration/multifd-device-state.c | 115 +++++++++++++++++++++++++++++++
 migration/multifd-nocomp.c       |  14 +++-
 migration/multifd.c              |  42 +++++++++--
 migration/multifd.h              |  27 +++++---
 6 files changed, 187 insertions(+), 16 deletions(-)
 create mode 100644 migration/multifd-device-state.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 4c171f4e897e..bd3b725fa0b7 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -118,4 +118,8 @@ bool migrate_is_uri(const char *uri);
 bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
                        Error **errp);
 
+/* migration/multifd-device-state.c */
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len);
+
 #endif
diff --git a/migration/meson.build b/migration/meson.build
index d3bfe84d6204..9aa48b290e2a 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -25,6 +25,7 @@ system_ss.add(files(
   'migration-hmp-cmds.c',
   'migration.c',
   'multifd.c',
+  'multifd-device-state.c',
   'multifd-nocomp.c',
   'multifd-zlib.c',
   'multifd-zero-page.c',
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
new file mode 100644
index 000000000000..ab83773e2d62
--- /dev/null
+++ b/migration/multifd-device-state.c
@@ -0,0 +1,115 @@
+/*
+ * Multifd device state migration
+ *
+ * Copyright (C) 2024,2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/lockable.h"
+#include "migration/misc.h"
+#include "multifd.h"
+
+static struct {
+    QemuMutex queue_job_mutex;
+
+    MultiFDSendData *send_data;
+} *multifd_send_device_state;
+
+size_t multifd_device_state_payload_size(void)
+{
+    return sizeof(MultiFDDeviceState_t);
+}
+
+void multifd_device_state_send_setup(void)
+{
+    assert(!multifd_send_device_state);
+    multifd_send_device_state = g_malloc(sizeof(*multifd_send_device_state));
+
+    qemu_mutex_init(&multifd_send_device_state->queue_job_mutex);
+
+    multifd_send_device_state->send_data = multifd_send_data_alloc();
+}
+
+void multifd_device_state_send_cleanup(void)
+{
+    g_clear_pointer(&multifd_send_device_state->send_data,
+                    multifd_send_data_free);
+
+    qemu_mutex_destroy(&multifd_send_device_state->queue_job_mutex);
+
+    g_clear_pointer(&multifd_send_device_state, g_free);
+}
+
+void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state)
+{
+    g_clear_pointer(&device_state->idstr, g_free);
+    g_clear_pointer(&device_state->buf, g_free);
+}
+
+static void multifd_device_state_fill_packet(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+    packet->hdr.flags = cpu_to_be32(p->flags);
+    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
+    packet->instance_id = cpu_to_be32(device_state->instance_id);
+    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+static void multifd_prepare_header_device_state(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = sizeof(*p->packet_device_state);
+    p->iov[0].iov_base = p->packet_device_state;
+    p->iovs_num++;
+}
+
+void multifd_device_state_send_prepare(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+
+    assert(multifd_payload_device_state(p->data));
+
+    multifd_prepare_header_device_state(p);
+
+    assert(!(p->flags & MULTIFD_FLAG_SYNC));
+
+    p->next_packet_size = device_state->buf_len;
+    if (p->next_packet_size > 0) {
+        p->iov[p->iovs_num].iov_base = device_state->buf;
+        p->iov[p->iovs_num].iov_len = p->next_packet_size;
+        p->iovs_num++;
+    }
+
+    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
+
+    multifd_device_state_fill_packet(p);
+}
+
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len)
+{
+    /* Device state submissions can come from multiple threads */
+    QEMU_LOCK_GUARD(&multifd_send_device_state->queue_job_mutex);
+    MultiFDDeviceState_t *device_state;
+
+    assert(multifd_payload_empty(multifd_send_device_state->send_data));
+
+    multifd_set_payload_type(multifd_send_device_state->send_data,
+                             MULTIFD_PAYLOAD_DEVICE_STATE);
+    device_state = &multifd_send_device_state->send_data->u.device_state;
+    device_state->idstr = g_strdup(idstr);
+    device_state->instance_id = instance_id;
+    device_state->buf = g_memdup2(data, len);
+    device_state->buf_len = len;
+
+    if (!multifd_send(&multifd_send_device_state->send_data)) {
+        multifd_send_data_clear(multifd_send_device_state->send_data);
+        return false;
+    }
+
+    return true;
+}
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index e46e79d8b272..c00804652383 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -14,6 +14,7 @@
 #include "exec/ramblock.h"
 #include "exec/target_page.h"
 #include "file.h"
+#include "migration-stats.h"
 #include "multifd.h"
 #include "options.h"
 #include "qapi/error.h"
@@ -85,6 +86,13 @@ static void multifd_nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
     return;
 }
 
+static void multifd_ram_prepare_header(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = p->packet_len;
+    p->iov[0].iov_base = p->packet;
+    p->iovs_num++;
+}
+
 static void multifd_send_prepare_iovs(MultiFDSendParams *p)
 {
     MultiFDPages_t *pages = &p->data->u.ram;
@@ -118,7 +126,7 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
          * Only !zerocopy needs the header in IOV; zerocopy will
          * send it separately.
          */
-        multifd_send_prepare_header(p);
+        multifd_ram_prepare_header(p);
     }
 
     multifd_send_prepare_iovs(p);
@@ -133,6 +141,8 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
         if (ret != 0) {
             return -1;
         }
+
+        stat64_add(&mig_stats.multifd_bytes, p->packet_len);
     }
 
     return 0;
@@ -431,7 +441,7 @@ int multifd_ram_flush_and_sync(QEMUFile *f)
 bool multifd_send_prepare_common(MultiFDSendParams *p)
 {
     MultiFDPages_t *pages = &p->data->u.ram;
-    multifd_send_prepare_header(p);
+    multifd_ram_prepare_header(p);
     multifd_send_zero_page_detect(p);
 
     if (!pages->normal_num) {
diff --git a/migration/multifd.c b/migration/multifd.c
index 0092547a4f97..3394c2ae12fd 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "system/system.h"
@@ -19,6 +20,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -111,7 +113,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
      * added to the union in the future are larger than
      * (MultiFDPages_t + flex array).
      */
-    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
+    max_payload_size = MAX(multifd_ram_payload_size(),
+                           multifd_device_state_payload_size());
+    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
 
     /*
      * Account for any holes the compiler might insert. We can't pack
@@ -130,6 +134,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
     }
 
     switch (data->type) {
+    case MULTIFD_PAYLOAD_DEVICE_STATE:
+        multifd_send_data_clear_device_state(&data->u.device_state);
+        break;
     default:
         /* Nothing to do */
         break;
@@ -232,6 +239,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
     return msg.id;
 }
 
+/* Fills a RAM multifd packet */
 void multifd_send_fill_packet(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
@@ -524,6 +532,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     p->name = NULL;
     g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
+    g_clear_pointer(&p->packet_device_state, g_free);
     g_free(p->packet);
     p->packet = NULL;
     multifd_send_state->ops->send_cleanup(p, errp);
@@ -536,6 +545,7 @@ static void multifd_send_cleanup_state(void)
 {
     file_cleanup_outgoing_migration();
     socket_cleanup_outgoing_migration();
+    multifd_device_state_send_cleanup();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
     qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
@@ -694,16 +704,32 @@ static void *multifd_send_thread(void *opaque)
          * qatomic_store_release() in multifd_send().
          */
         if (qatomic_load_acquire(&p->pending_job)) {
+            bool is_device_state = multifd_payload_device_state(p->data);
+            size_t total_size;
+
             p->flags = 0;
             p->iovs_num = 0;
             assert(!multifd_payload_empty(p->data));
 
-            ret = multifd_send_state->ops->send_prepare(p, &local_err);
-            if (ret != 0) {
-                break;
+            if (is_device_state) {
+                multifd_device_state_send_prepare(p);
+            } else {
+                ret = multifd_send_state->ops->send_prepare(p, &local_err);
+                if (ret != 0) {
+                    break;
+                }
             }
 
+            /*
+             * The packet header in the zerocopy RAM case is accounted for
+             * in multifd_nocomp_send_prepare() - where it is actually
+             * being sent.
+             */
+            total_size = iov_size(p->iov, p->iovs_num);
+
             if (migrate_mapped_ram()) {
+                assert(!is_device_state);
+
                 ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
                                               &p->data->u.ram, &local_err);
             } else {
@@ -716,8 +742,7 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
-            stat64_add(&mig_stats.multifd_bytes,
-                       (uint64_t)p->next_packet_size + p->packet_len);
+            stat64_add(&mig_stats.multifd_bytes, total_size);
 
             p->next_packet_size = 0;
             multifd_send_data_clear(p->data);
@@ -938,6 +963,9 @@ bool multifd_send_setup(void)
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
+            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
         }
         p->name = g_strdup_printf(MIGRATION_THREAD_SRC_MULTIFD, i);
         p->write_flags = 0;
@@ -973,6 +1001,8 @@ bool multifd_send_setup(void)
         assert(p->iov);
     }
 
+    multifd_device_state_send_setup();
+
     return true;
 
 err:
diff --git a/migration/multifd.h b/migration/multifd.h
index 20a4bba58ef4..883a43c1d79e 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -137,10 +137,12 @@ typedef struct {
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
+    MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
 typedef union MultiFDPayload {
     MultiFDPages_t ram;
+    MultiFDDeviceState_t device_state;
 } MultiFDPayload;
 
 struct MultiFDSendData {
@@ -153,6 +155,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
     return data->type == MULTIFD_PAYLOAD_NONE;
 }
 
+static inline bool multifd_payload_device_state(MultiFDSendData *data)
+{
+    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
+}
+
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
@@ -205,8 +212,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_device_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
@@ -365,13 +373,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p);
 void multifd_send_zero_page_detect(MultiFDSendParams *p);
 void multifd_recv_zero_page_process(MultiFDRecvParams *p);
 
-static inline void multifd_send_prepare_header(MultiFDSendParams *p)
-{
-    p->iov[0].iov_len = p->packet_len;
-    p->iov[0].iov_base = p->packet;
-    p->iovs_num++;
-}
-
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
@@ -396,4 +397,14 @@ bool multifd_ram_sync_per_section(void);
 size_t multifd_ram_payload_size(void);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
+
+size_t multifd_device_state_payload_size(void);
+
+void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state);
+
+void multifd_device_state_send_setup(void);
+void multifd_device_state_send_cleanup(void);
+
+void multifd_device_state_send_prepare(MultiFDSendParams *p);
+
 #endif


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 15/36] migration/multifd: Make MultiFDSendData a struct
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (13 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 14/36] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 16/36] migration/multifd: Add multifd_device_state_supported() Maciej S. Szmigiero
                   ` (20 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: Peter Xu <peterx@redhat.com>

The newly introduced device state buffer can be used for either storing
VFIO's read() raw data, but already also possible to store generic device
states.  After noticing that device states may not easily provide a max
buffer size (also the fact that RAM MultiFDPages_t after all also want to
have flexibility on managing offset[] array), it may not be a good idea to
stick with union on MultiFDSendData.. as it won't play well with such
flexibility.

Switch MultiFDSendData to a struct.

It won't consume a lot more space in reality, after all the real buffers
were already dynamically allocated, so it's so far only about the two
structs (pages, device_state) that will be duplicated, but they're small.

With this, we can remove the pretty hard to understand alloc size logic.
Because now we can allocate offset[] together with the SendData, and
properly free it when the SendData is freed.

Signed-off-by: Peter Xu <peterx@redhat.com>
[MSS: Make sure to clear possible device state payload before freeing
MultiFDSendData, remove placeholders for other patches not included]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-device-state.c |  5 -----
 migration/multifd-nocomp.c       | 13 ++++++-------
 migration/multifd.c              | 25 +++++++------------------
 migration/multifd.h              | 15 +++++++++------
 4 files changed, 22 insertions(+), 36 deletions(-)

diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index ab83773e2d62..ad631a776da9 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -18,11 +18,6 @@ static struct {
     MultiFDSendData *send_data;
 } *multifd_send_device_state;
 
-size_t multifd_device_state_payload_size(void)
-{
-    return sizeof(MultiFDDeviceState_t);
-}
-
 void multifd_device_state_send_setup(void)
 {
     assert(!multifd_send_device_state);
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index c00804652383..ffe75256c9fb 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -25,15 +25,14 @@
 
 static MultiFDSendData *multifd_ram_send;
 
-size_t multifd_ram_payload_size(void)
+void multifd_ram_payload_alloc(MultiFDPages_t *pages)
 {
-    uint32_t n = multifd_ram_page_count();
+    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
+}
 
-    /*
-     * We keep an array of page offsets at the end of MultiFDPages_t,
-     * add space for it in the allocation.
-     */
-    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
+void multifd_ram_payload_free(MultiFDPages_t *pages)
+{
+    g_clear_pointer(&pages->offset, g_free);
 }
 
 void multifd_ram_save_setup(void)
diff --git a/migration/multifd.c b/migration/multifd.c
index 3394c2ae12fd..b20a61d42c26 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -105,26 +105,12 @@ struct {
 
 MultiFDSendData *multifd_send_data_alloc(void)
 {
-    size_t max_payload_size, size_minus_payload;
+    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
 
-    /*
-     * MultiFDPages_t has a flexible array at the end, account for it
-     * when allocating MultiFDSendData. Use max() in case other types
-     * added to the union in the future are larger than
-     * (MultiFDPages_t + flex array).
-     */
-    max_payload_size = MAX(multifd_ram_payload_size(),
-                           multifd_device_state_payload_size());
-    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
-
-    /*
-     * Account for any holes the compiler might insert. We can't pack
-     * the structure because that misaligns the members and triggers
-     * Waddress-of-packed-member.
-     */
-    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
+    multifd_ram_payload_alloc(&new->u.ram);
+    /* Device state allocates its payload on-demand */
 
-    return g_malloc0(size_minus_payload + max_payload_size);
+    return new;
 }
 
 void multifd_send_data_clear(MultiFDSendData *data)
@@ -151,8 +137,11 @@ void multifd_send_data_free(MultiFDSendData *data)
         return;
     }
 
+    /* This also free's device state payload */
     multifd_send_data_clear(data);
 
+    multifd_ram_payload_free(&data->u.ram);
+
     g_free(data);
 }
 
diff --git a/migration/multifd.h b/migration/multifd.h
index 883a43c1d79e..81c19a591f3e 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -115,9 +115,13 @@ typedef struct {
     uint32_t num;
     /* number of normal pages */
     uint32_t normal_num;
+    /*
+     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
+     * sure the pointer is always valid!
+     */
     RAMBlock *block;
-    /* offset of each page */
-    ram_addr_t offset[];
+    /* offset array of each page, managed by multifd */
+    ram_addr_t *offset;
 } MultiFDPages_t;
 
 struct MultiFDRecvData {
@@ -140,7 +144,7 @@ typedef enum {
     MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
-typedef union MultiFDPayload {
+typedef struct MultiFDPayload {
     MultiFDPages_t ram;
     MultiFDDeviceState_t device_state;
 } MultiFDPayload;
@@ -394,12 +398,11 @@ void multifd_ram_save_cleanup(void);
 int multifd_ram_flush_and_sync(QEMUFile *f);
 bool multifd_ram_sync_per_round(void);
 bool multifd_ram_sync_per_section(void);
-size_t multifd_ram_payload_size(void);
+void multifd_ram_payload_alloc(MultiFDPages_t *pages);
+void multifd_ram_payload_free(MultiFDPages_t *pages);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
 
-size_t multifd_device_state_payload_size(void);
-
 void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state);
 
 void multifd_device_state_send_setup(void);


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 16/36] migration/multifd: Add multifd_device_state_supported()
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (14 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 15/36] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-19 20:33 ` [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
                   ` (19 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         | 1 +
 migration/multifd-device-state.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index bd3b725fa0b7..273ebfca6256 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -121,5 +121,6 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
 /* migration/multifd-device-state.c */
 bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
+bool multifd_device_state_supported(void);
 
 #endif
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index ad631a776da9..5de3cf27d6e8 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -11,6 +11,7 @@
 #include "qemu/lockable.h"
 #include "migration/misc.h"
 #include "multifd.h"
+#include "options.h"
 
 static struct {
     QemuMutex queue_job_mutex;
@@ -108,3 +109,9 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
 
     return true;
 }
+
+bool multifd_device_state_supported(void)
+{
+    return migrate_multifd() && !migrate_mapped_ram() &&
+        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (15 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 16/36] migration/multifd: Add multifd_device_state_supported() Maciej S. Szmigiero
@ 2025-02-19 20:33 ` Maciej S. Szmigiero
  2025-02-26 16:43   ` Peter Xu
  2025-02-19 20:34 ` [PATCH v5 18/36] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
                   ` (18 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:33 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This SaveVMHandler helps device provide its own asynchronous transmission
of the remaining data at the end of a precopy phase via multifd channels,
in parallel with the transfer done by save_live_complete_precopy handlers.

These threads are launched only when multifd device state transfer is
supported.

Management of these threads in done in the multifd migration code,
wrapping them in the generic thread pool.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         | 17 +++++++
 include/migration/register.h     | 19 +++++++
 include/qemu/typedefs.h          |  3 ++
 migration/multifd-device-state.c | 85 ++++++++++++++++++++++++++++++++
 migration/savevm.c               | 35 ++++++++++++-
 5 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 273ebfca6256..8fd36eba1da7 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -119,8 +119,25 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
                        Error **errp);
 
 /* migration/multifd-device-state.c */
+typedef struct SaveLiveCompletePrecopyThreadData {
+    SaveLiveCompletePrecopyThreadHandler hdlr;
+    char *idstr;
+    uint32_t instance_id;
+    void *handler_opaque;
+} SaveLiveCompletePrecopyThreadData;
+
 bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
 bool multifd_device_state_supported(void);
 
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque);
+
+bool multifd_device_state_save_thread_should_exit(void);
+
+void multifd_abort_device_state_save_threads(void);
+bool multifd_join_device_state_save_threads(void);
+
 #endif
diff --git a/include/migration/register.h b/include/migration/register.h
index 58891aa54b76..c041ce32f2fc 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -105,6 +105,25 @@ typedef struct SaveVMHandlers {
      */
     int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+    /**
+     * @save_live_complete_precopy_thread (invoked in a separate thread)
+     *
+     * Called at the end of a precopy phase from a separate worker thread
+     * in configurations where multifd device state transfer is supported
+     * in order to perform asynchronous transmission of the remaining data in
+     * parallel with @save_live_complete_precopy handlers.
+     * When postcopy is enabled, devices that support postcopy will skip this
+     * step.
+     *
+     * @d: a #SaveLiveCompletePrecopyThreadData containing parameters that the
+     * handler may need, including this device section idstr and instance_id,
+     * and opaque data pointer passed to register_savevm_live().
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns true to indicate success and false for errors.
+     */
+    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
+
     /* This runs both outside and inside the BQL.  */
 
     /**
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index fd23ff7771b1..42ed4e6be150 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -108,6 +108,7 @@ typedef struct QString QString;
 typedef struct RAMBlock RAMBlock;
 typedef struct Range Range;
 typedef struct ReservedRegion ReservedRegion;
+typedef struct SaveLiveCompletePrecopyThreadData SaveLiveCompletePrecopyThreadData;
 typedef struct SHPCDevice SHPCDevice;
 typedef struct SSIBus SSIBus;
 typedef struct TCGCPUOps TCGCPUOps;
@@ -133,5 +134,7 @@ typedef struct IRQState *qemu_irq;
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
 typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
                                     Error **errp);
+typedef bool (*SaveLiveCompletePrecopyThreadHandler)(SaveLiveCompletePrecopyThreadData *d,
+                                                     Error **errp);
 
 #endif /* QEMU_TYPEDEFS_H */
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index 5de3cf27d6e8..63f021fb8dad 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -8,7 +8,10 @@
  */
 
 #include "qemu/osdep.h"
+#include "qapi/error.h"
 #include "qemu/lockable.h"
+#include "block/thread-pool.h"
+#include "migration.h"
 #include "migration/misc.h"
 #include "multifd.h"
 #include "options.h"
@@ -17,6 +20,9 @@ static struct {
     QemuMutex queue_job_mutex;
 
     MultiFDSendData *send_data;
+
+    ThreadPool *threads;
+    bool threads_abort;
 } *multifd_send_device_state;
 
 void multifd_device_state_send_setup(void)
@@ -27,10 +33,14 @@ void multifd_device_state_send_setup(void)
     qemu_mutex_init(&multifd_send_device_state->queue_job_mutex);
 
     multifd_send_device_state->send_data = multifd_send_data_alloc();
+
+    multifd_send_device_state->threads = thread_pool_new();
+    multifd_send_device_state->threads_abort = false;
 }
 
 void multifd_device_state_send_cleanup(void)
 {
+    g_clear_pointer(&multifd_send_device_state->threads, thread_pool_free);
     g_clear_pointer(&multifd_send_device_state->send_data,
                     multifd_send_data_free);
 
@@ -115,3 +125,78 @@ bool multifd_device_state_supported(void)
     return migrate_multifd() && !migrate_mapped_ram() &&
         migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
 }
+
+static void multifd_device_state_save_thread_data_free(void *opaque)
+{
+    SaveLiveCompletePrecopyThreadData *data = opaque;
+
+    g_clear_pointer(&data->idstr, g_free);
+    g_free(data);
+}
+
+static int multifd_device_state_save_thread(void *opaque)
+{
+    SaveLiveCompletePrecopyThreadData *data = opaque;
+    g_autoptr(Error) local_err = NULL;
+
+    if (!data->hdlr(data, &local_err)) {
+        MigrationState *s = migrate_get_current();
+
+        assert(local_err);
+
+        /*
+         * In case of multiple save threads failing which thread error
+         * return we end setting is purely arbitrary.
+         */
+        migrate_set_error(s, local_err);
+    }
+
+    return 0;
+}
+
+bool multifd_device_state_save_thread_should_exit(void)
+{
+    return qatomic_read(&multifd_send_device_state->threads_abort);
+}
+
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque)
+{
+    SaveLiveCompletePrecopyThreadData *data;
+
+    assert(multifd_device_state_supported());
+    assert(multifd_send_device_state);
+
+    assert(!qatomic_read(&multifd_send_device_state->threads_abort));
+
+    data = g_new(SaveLiveCompletePrecopyThreadData, 1);
+    data->hdlr = hdlr;
+    data->idstr = g_strdup(idstr);
+    data->instance_id = instance_id;
+    data->handler_opaque = opaque;
+
+    thread_pool_submit_immediate(multifd_send_device_state->threads,
+                                 multifd_device_state_save_thread,
+                                 data,
+                                 multifd_device_state_save_thread_data_free);
+}
+
+void multifd_abort_device_state_save_threads(void)
+{
+    assert(multifd_device_state_supported());
+
+    qatomic_set(&multifd_send_device_state->threads_abort, true);
+}
+
+bool multifd_join_device_state_save_threads(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    assert(multifd_device_state_supported());
+
+    thread_pool_wait(multifd_send_device_state->threads);
+
+    return !migrate_has_error(s);
+}
diff --git a/migration/savevm.c b/migration/savevm.c
index e412d05657a1..9a1e0ac807a0 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -37,6 +37,7 @@
 #include "migration/register.h"
 #include "migration/global_state.h"
 #include "migration/channel-block.h"
+#include "multifd.h"
 #include "ram.h"
 #include "qemu-file.h"
 #include "savevm.h"
@@ -1527,6 +1528,24 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     int64_t start_ts_each, end_ts_each;
     SaveStateEntry *se;
     int ret;
+    bool multifd_device_state = multifd_device_state_supported();
+
+    if (multifd_device_state) {
+        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+            SaveLiveCompletePrecopyThreadHandler hdlr;
+
+            if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+                             se->ops->has_postcopy(se->opaque)) ||
+                !se->ops->save_live_complete_precopy_thread) {
+                continue;
+            }
+
+            hdlr = se->ops->save_live_complete_precopy_thread;
+            multifd_spawn_device_state_save_thread(hdlr,
+                                                   se->idstr, se->instance_id,
+                                                   se->opaque);
+        }
+    }
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
@@ -1552,16 +1571,30 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
         save_section_footer(f, se);
         if (ret < 0) {
             qemu_file_set_error(f, ret);
-            return -1;
+            goto ret_fail_abort_threads;
         }
         end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
         trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id,
                                     end_ts_each - start_ts_each);
     }
 
+    if (multifd_device_state &&
+        !multifd_join_device_state_save_threads()) {
+        qemu_file_set_error(f, -EINVAL);
+        return -1;
+    }
+
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;
+
+ret_fail_abort_threads:
+    if (multifd_device_state) {
+        multifd_abort_device_state_save_threads();
+        multifd_join_device_state_save_threads();
+    }
+
+    return -1;
 }
 
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 18/36] vfio/migration: Add load_device_config_state_start trace event
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (16 preceding siblings ...)
  2025-02-19 20:33 ` [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-19 20:34 ` [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
                   ` (17 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

And rename existing load_device_config_state trace event to
load_device_config_state_end for consistency since it is triggered at the
end of loading of the VFIO device config state.

This way both the start and end points of particular device config
loading operation (a long, BQL-serialized operation) are known.

Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c  | 4 +++-
 hw/vfio/trace-events | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index adfa752db527..03890eaa48a9 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -285,6 +285,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     uint64_t data;
 
+    trace_vfio_load_device_config_state_start(vbasedev->name);
+
     if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
         int ret;
 
@@ -303,7 +305,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         return -EINVAL;
     }
 
-    trace_vfio_load_device_config_state(vbasedev->name);
+    trace_vfio_load_device_config_state_end(vbasedev->name);
     return qemu_file_get_error(f);
 }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index cab1cf1de0a2..1bebe9877d88 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,7 +149,8 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_device_config_state_start(const char *name) " (%s)"
+vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
 vfio_migration_realize(const char *name) " (%s)"


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (17 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 18/36] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26  7:52   ` Cédric Le Goater
  2025-02-26 16:20   ` Cédric Le Goater
  2025-02-19 20:34 ` [PATCH v5 20/36] vfio/migration: Add vfio_add_bytes_transferred() Maciej S. Szmigiero
                   ` (16 subsequent siblings)
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

So it can be safety accessed from multiple threads.

This variable type needs to be changed to unsigned long since
32-bit host platforms lack the necessary addition atomics on 64-bit
variables.

Using 32-bit counters on 32-bit host platforms should not be a problem
in practice since they can't realistically address more memory anyway.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 03890eaa48a9..5532787be63b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -55,7 +55,7 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
-static int64_t bytes_transferred;
+static unsigned long bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
 {
@@ -391,7 +391,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
     qemu_put_be64(f, data_size);
     qemu_put_buffer(f, migration->data_buffer, data_size);
-    bytes_transferred += data_size;
+    qatomic_add(&bytes_transferred, data_size);
 
     trace_vfio_save_block(migration->vbasedev->name, data_size);
 
@@ -1013,12 +1013,12 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
 
 int64_t vfio_mig_bytes_transferred(void)
 {
-    return bytes_transferred;
+    return MIN(qatomic_read(&bytes_transferred), INT64_MAX);
 }
 
 void vfio_reset_bytes_transferred(void)
 {
-    bytes_transferred = 0;
+    qatomic_set(&bytes_transferred, 0);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 20/36] vfio/migration: Add vfio_add_bytes_transferred()
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (18 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26  8:06   ` Cédric Le Goater
  2025-02-19 20:34 ` [PATCH v5 21/36] vfio/migration: Move migration channel flags to vfio-common.h header file Maciej S. Szmigiero
                   ` (15 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way bytes_transferred can also be incremented in other translation
units than migration.c.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 7 ++++++-
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5532787be63b..e9645cb9d088 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -391,7 +391,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
     qemu_put_be64(f, data_size);
     qemu_put_buffer(f, migration->data_buffer, data_size);
-    qatomic_add(&bytes_transferred, data_size);
+    vfio_add_bytes_transferred(data_size);
 
     trace_vfio_save_block(migration->vbasedev->name, data_size);
 
@@ -1021,6 +1021,11 @@ void vfio_reset_bytes_transferred(void)
     qatomic_set(&bytes_transferred, 0);
 }
 
+void vfio_add_bytes_transferred(unsigned long val)
+{
+    qatomic_add(&bytes_transferred, val);
+}
+
 /*
  * Return true when either migration initialized or blocker registered.
  * Currently only return false when adding blocker fails which will
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ac35136a1105..70f2a1891ed1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -274,6 +274,7 @@ void vfio_unblock_multiple_devices_migration(void);
 bool vfio_viommu_preset(VFIODevice *vbasedev);
 int64_t vfio_mig_bytes_transferred(void);
 void vfio_reset_bytes_transferred(void);
+void vfio_add_bytes_transferred(unsigned long val);
 bool vfio_device_state_is_running(VFIODevice *vbasedev);
 bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
 


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 21/36] vfio/migration: Move migration channel flags to vfio-common.h header file
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (19 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 20/36] vfio/migration: Add vfio_add_bytes_transferred() Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26  8:19   ` Cédric Le Goater
  2025-02-19 20:34 ` [PATCH v5 22/36] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
                   ` (14 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way they can also be referenced in other translation
units than migration.c.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 17 -----------------
 include/hw/vfio/vfio-common.h | 17 +++++++++++++++++
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e9645cb9d088..46adb798352f 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -31,23 +31,6 @@
 #include "trace.h"
 #include "hw/hw.h"
 
-/*
- * Flags to be used as unique delimiters for VFIO devices in the migration
- * stream. These flags are composed as:
- * 0xffffffff => MSB 32-bit all 1s
- * 0xef10     => Magic ID, represents emulated (virtual) function IO
- * 0x0000     => 16-bits reserved for flags
- *
- * The beginning of state information is marked by _DEV_CONFIG_STATE,
- * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
- * certain state information is marked by _END_OF_STATE.
- */
-#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
-#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
-#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
-#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
-#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
-
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
  * total device migration size is on the order of 100s of MB. Testing with
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 70f2a1891ed1..64ee3b1a2547 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -36,6 +36,23 @@
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
+/*
+ * Flags to be used as unique delimiters for VFIO devices in the migration
+ * stream. These flags are composed as:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => Magic ID, represents emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ *
+ * The beginning of state information is marked by _DEV_CONFIG_STATE,
+ * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
+ * certain state information is marked by _END_OF_STATE.
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
+
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
     VFIO_DEVICE_TYPE_PLATFORM = 1,


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 22/36] vfio/migration: Multifd device state transfer support - basic types
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (20 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 21/36] vfio/migration: Move migration channel flags to vfio-common.h header file Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26  8:52   ` Cédric Le Goater
  2025-02-19 20:34 ` [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
                   ` (13 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add basic types and flags used by VFIO multifd device state transfer
support.

Since we'll be introducing a lot of multifd transfer specific code,
add a new file migration-multifd.c to home it, wired into main VFIO
migration code (migration.c) via migration-multifd.h header file.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/meson.build         |  1 +
 hw/vfio/migration-multifd.c | 31 +++++++++++++++++++++++++++++++
 hw/vfio/migration-multifd.h | 15 +++++++++++++++
 hw/vfio/migration.c         |  1 +
 4 files changed, 48 insertions(+)
 create mode 100644 hw/vfio/migration-multifd.c
 create mode 100644 hw/vfio/migration-multifd.h

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index bba776f75cc7..260d65febd6b 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,6 +5,7 @@ vfio_ss.add(files(
   'container-base.c',
   'container.c',
   'migration.c',
+  'migration-multifd.c',
   'cpr.c',
 ))
 vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
new file mode 100644
index 000000000000..0c3185a26242
--- /dev/null
+++ b/hw/vfio/migration-multifd.c
@@ -0,0 +1,31 @@
+/*
+ * Multifd VFIO migration
+ *
+ * Copyright (C) 2024,2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-common.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/lockable.h"
+#include "qemu/main-loop.h"
+#include "qemu/thread.h"
+#include "migration/qemu-file.h"
+#include "migration-multifd.h"
+#include "trace.h"
+
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+#define VFIO_DEVICE_STATE_PACKET_VER_CURRENT (0)
+
+typedef struct VFIODeviceStatePacket {
+    uint32_t version;
+    uint32_t idx;
+    uint32_t flags;
+    uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
new file mode 100644
index 000000000000..64d117b27210
--- /dev/null
+++ b/hw/vfio/migration-multifd.h
@@ -0,0 +1,15 @@
+/*
+ * Multifd VFIO migration
+ *
+ * Copyright (C) 2024,2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef HW_VFIO_MIGRATION_MULTIFD_H
+#define HW_VFIO_MIGRATION_MULTIFD_H
+
+#include "hw/vfio/vfio-common.h"
+
+#endif
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 46adb798352f..7b79be6ad293 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -23,6 +23,7 @@
 #include "migration/qemu-file.h"
 #include "migration/register.h"
 #include "migration/blocker.h"
+#include "migration-multifd.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-vfio.h"
 #include "exec/ramlist.h"


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s)
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (21 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 22/36] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26  8:54   ` Cédric Le Goater
  2025-03-02 13:00   ` Avihai Horon
  2025-02-19 20:34 ` [PATCH v5 24/36] vfio/migration: Multifd device state transfer - add support checking function Maciej S. Szmigiero
                   ` (12 subsequent siblings)
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add VFIOStateBuffer(s) types and the associated methods.

These store received device state buffers and config state waiting to get
loaded into the device.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c | 54 +++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 0c3185a26242..760b110a39b9 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -29,3 +29,57 @@ typedef struct VFIODeviceStatePacket {
     uint32_t flags;
     uint8_t data[0];
 } QEMU_PACKED VFIODeviceStatePacket;
+
+/* type safety */
+typedef struct VFIOStateBuffers {
+    GArray *array;
+} VFIOStateBuffers;
+
+typedef struct VFIOStateBuffer {
+    bool is_present;
+    char *data;
+    size_t len;
+} VFIOStateBuffer;
+
+static void vfio_state_buffer_clear(gpointer data)
+{
+    VFIOStateBuffer *lb = data;
+
+    if (!lb->is_present) {
+        return;
+    }
+
+    g_clear_pointer(&lb->data, g_free);
+    lb->is_present = false;
+}
+
+static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
+{
+    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
+    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
+}
+
+static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
+{
+    g_clear_pointer(&bufs->array, g_array_unref);
+}
+
+static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
+{
+    assert(bufs->array);
+}
+
+static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
+{
+    return bufs->array->len;
+}
+
+static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
+{
+    g_array_set_size(bufs->array, size);
+}
+
+static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
+{
+    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
+}


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 24/36] vfio/migration: Multifd device state transfer - add support checking function
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (22 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26  8:54   ` Cédric Le Goater
  2025-02-19 20:34 ` [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
                   ` (11 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add vfio_multifd_transfer_supported() function that tells whether the
multifd device state transfer is supported.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c | 6 ++++++
 hw/vfio/migration-multifd.h | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 760b110a39b9..7328ad8e925c 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -83,3 +83,9 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
 {
     return &g_array_index(bufs->array, VFIOStateBuffer, idx);
 }
+
+bool vfio_multifd_transfer_supported(void)
+{
+    return multifd_device_state_supported() &&
+        migrate_send_switchover_start();
+}
diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
index 64d117b27210..8fe004c1da81 100644
--- a/hw/vfio/migration-multifd.h
+++ b/hw/vfio/migration-multifd.h
@@ -12,4 +12,6 @@
 
 #include "hw/vfio/vfio-common.h"
 
+bool vfio_multifd_transfer_supported(void);
+
 #endif


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (23 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 24/36] vfio/migration: Multifd device state transfer - add support checking function Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26 10:14   ` Cédric Le Goater
                     ` (2 more replies)
  2025-02-19 20:34 ` [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
                   ` (10 subsequent siblings)
  35 siblings, 3 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add support for VFIOMultifd data structure that will contain most of the
receive-side data together with its init/cleanup methods.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
 hw/vfio/migration-multifd.h   |  8 ++++++++
 hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
 include/hw/vfio/vfio-common.h |  3 +++
 4 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 7328ad8e925c..c2defc0efef0 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
     size_t len;
 } VFIOStateBuffer;
 
+typedef struct VFIOMultifd {
+} VFIOMultifd;
+
 static void vfio_state_buffer_clear(gpointer data)
 {
     VFIOStateBuffer *lb = data;
@@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
     return &g_array_index(bufs->array, VFIOStateBuffer, idx);
 }
 
+VFIOMultifd *vfio_multifd_new(void)
+{
+    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
+
+    return multifd;
+}
+
+void vfio_multifd_free(VFIOMultifd *multifd)
+{
+    g_free(multifd);
+}
+
 bool vfio_multifd_transfer_supported(void)
 {
     return multifd_device_state_supported() &&
         migrate_send_switchover_start();
 }
+
+bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
+{
+    return false;
+}
+
+bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
+{
+    if (vfio_multifd_transfer_enabled(vbasedev) &&
+        !vfio_multifd_transfer_supported()) {
+        error_setg(errp,
+                   "%s: Multifd device transfer requested but unsupported in the current config",
+                   vbasedev->name);
+        return false;
+    }
+
+    return true;
+}
diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
index 8fe004c1da81..1eefba3b2eed 100644
--- a/hw/vfio/migration-multifd.h
+++ b/hw/vfio/migration-multifd.h
@@ -12,6 +12,14 @@
 
 #include "hw/vfio/vfio-common.h"
 
+typedef struct VFIOMultifd VFIOMultifd;
+
+VFIOMultifd *vfio_multifd_new(void);
+void vfio_multifd_free(VFIOMultifd *multifd);
+
 bool vfio_multifd_transfer_supported(void);
+bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
+
+bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
 
 #endif
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 7b79be6ad293..4311de763885 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -674,15 +674,40 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
 static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
+        return -EINVAL;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   migration->device_state, errp);
+    if (ret) {
+        return ret;
+    }
 
-    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
-                                    vbasedev->migration->device_state, errp);
+    if (vfio_multifd_transfer_enabled(vbasedev)) {
+        assert(!migration->multifd);
+        migration->multifd = vfio_multifd_new();
+    }
+
+    return 0;
+}
+
+static void vfio_multifd_cleanup(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    g_clear_pointer(&migration->multifd, vfio_multifd_free);
 }
 
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
 
+    vfio_multifd_cleanup(vbasedev);
+
     vfio_migration_cleanup(vbasedev);
     trace_vfio_load_cleanup(vbasedev->name);
 
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 64ee3b1a2547..ab110198bd6b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -78,6 +78,8 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMultifd VFIOMultifd;
+
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
@@ -89,6 +91,7 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    VFIOMultifd *multifd;
     bool initial_data_sent;
 
     bool event_save_iterate_started;


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (24 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26 10:43   ` Cédric Le Goater
  2025-03-02 13:12   ` Avihai Horon
  2025-02-19 20:34 ` [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
                   ` (9 subsequent siblings)
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating its
position in the stream.
The raw device state data is saved into a VFIOStateBuffer for later
in-order loading into the device.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c | 103 ++++++++++++++++++++++++++++++++++++
 hw/vfio/migration-multifd.h |   3 ++
 hw/vfio/migration.c         |   1 +
 hw/vfio/trace-events        |   1 +
 4 files changed, 108 insertions(+)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index c2defc0efef0..5d5ee1393674 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -42,6 +42,11 @@ typedef struct VFIOStateBuffer {
 } VFIOStateBuffer;
 
 typedef struct VFIOMultifd {
+    VFIOStateBuffers load_bufs;
+    QemuCond load_bufs_buffer_ready_cond;
+    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
+    uint32_t load_buf_idx;
+    uint32_t load_buf_idx_last;
 } VFIOMultifd;
 
 static void vfio_state_buffer_clear(gpointer data)
@@ -87,15 +92,113 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
     return &g_array_index(bufs->array, VFIOStateBuffer, idx);
 }
 
+static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
+                                          VFIODeviceStatePacket *packet,
+                                          size_t packet_total_size,
+                                          Error **errp)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIOStateBuffer *lb;
+
+    vfio_state_buffers_assert_init(&multifd->load_bufs);
+    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
+        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
+    }
+
+    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
+    if (lb->is_present) {
+        error_setg(errp, "state buffer %" PRIu32 " already filled",
+                   packet->idx);
+        return false;
+    }
+
+    assert(packet->idx >= multifd->load_buf_idx);
+
+    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
+    lb->len = packet_total_size - sizeof(*packet);
+    lb->is_present = true;
+
+    return true;
+}
+
+bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                            Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+
+    /*
+     * Holding BQL here would violate the lock order and can cause
+     * a deadlock once we attempt to lock load_bufs_mutex below.
+     */
+    assert(!bql_locked());
+
+    if (!vfio_multifd_transfer_enabled(vbasedev)) {
+        error_setg(errp,
+                   "got device state packet but not doing multifd transfer");
+        return false;
+    }
+
+    assert(multifd);
+
+    if (data_size < sizeof(*packet)) {
+        error_setg(errp, "packet too short at %zu (min is %zu)",
+                   data_size, sizeof(*packet));
+        return false;
+    }
+
+    if (packet->version != VFIO_DEVICE_STATE_PACKET_VER_CURRENT) {
+        error_setg(errp, "packet has unknown version %" PRIu32,
+                   packet->version);
+        return false;
+    }
+
+    if (packet->idx == UINT32_MAX) {
+        error_setg(errp, "packet has too high idx");
+        return false;
+    }
+
+    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
+
+    /* config state packet should be the last one in the stream */
+    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+        multifd->load_buf_idx_last = packet->idx;
+    }
+
+    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {
+        return false;
+    }
+
+    qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
+
+    return true;
+}
+
 VFIOMultifd *vfio_multifd_new(void)
 {
     VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
 
+    vfio_state_buffers_init(&multifd->load_bufs);
+
+    qemu_mutex_init(&multifd->load_bufs_mutex);
+
+    multifd->load_buf_idx = 0;
+    multifd->load_buf_idx_last = UINT32_MAX;
+    qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
+
     return multifd;
 }
 
 void vfio_multifd_free(VFIOMultifd *multifd)
 {
+    qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
+    qemu_mutex_destroy(&multifd->load_bufs_mutex);
+
     g_free(multifd);
 }
 
diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
index 1eefba3b2eed..d5ab7d6f85f5 100644
--- a/hw/vfio/migration-multifd.h
+++ b/hw/vfio/migration-multifd.h
@@ -22,4 +22,7 @@ bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
 
 bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
 
+bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                            Error **errp);
+
 #endif
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4311de763885..abaf4d08d4a9 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -806,6 +806,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .load_state_buffer = vfio_load_state_buffer,
     .switchover_ack_needed = vfio_switchover_ack_needed,
 };
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 1bebe9877d88..042a3dc54a33 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -153,6 +153,7 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
 vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
+vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (25 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26 13:49   ` Cédric Le Goater
  2025-03-02 14:15   ` Avihai Horon
  2025-02-19 20:34 ` [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
                   ` (8 subsequent siblings)
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since it's important to finish loading device state transferred via the
main migration channel (via save_live_iterate SaveVMHandler) before
starting loading the data asynchronously transferred via multifd the thread
doing the actual loading of the multifd transferred data is only started
from switchover_start SaveVMHandler.

switchover_start handler is called when MIG_CMD_SWITCHOVER_START
sub-command of QEMU_VM_COMMAND is received via the main migration channel.

This sub-command is only sent after all save_live_iterate data have already
been posted so it is safe to commence loading of the multifd-transferred
device state upon receiving it - loading of save_live_iterate data happens
synchronously in the main migration thread (much like the processing of
MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
processed all the proceeding data must have already been loaded.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
 hw/vfio/migration-multifd.h |   2 +
 hw/vfio/migration.c         |  12 ++
 hw/vfio/trace-events        |   5 +
 4 files changed, 244 insertions(+)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 5d5ee1393674..b3a88c062769 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
 } VFIOStateBuffer;
 
 typedef struct VFIOMultifd {
+    QemuThread load_bufs_thread;
+    bool load_bufs_thread_running;
+    bool load_bufs_thread_want_exit;
+
     VFIOStateBuffers load_bufs;
     QemuCond load_bufs_buffer_ready_cond;
+    QemuCond load_bufs_thread_finished_cond;
     QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
     uint32_t load_buf_idx;
     uint32_t load_buf_idx_last;
@@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
     return true;
 }
 
+static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
+{
+    return -EINVAL;
+}
+
+static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
+{
+    VFIOStateBuffer *lb;
+    guint bufs_len;
+
+    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
+    if (multifd->load_buf_idx >= bufs_len) {
+        assert(multifd->load_buf_idx == bufs_len);
+        return NULL;
+    }
+
+    lb = vfio_state_buffers_at(&multifd->load_bufs,
+                               multifd->load_buf_idx);
+    if (!lb->is_present) {
+        return NULL;
+    }
+
+    return lb;
+}
+
+static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
+                                         VFIOStateBuffer *lb,
+                                         Error **errp)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    g_autofree char *buf = NULL;
+    char *buf_cur;
+    size_t buf_len;
+
+    if (!lb->len) {
+        return true;
+    }
+
+    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
+                                                   multifd->load_buf_idx);
+
+    /* lb might become re-allocated when we drop the lock */
+    buf = g_steal_pointer(&lb->data);
+    buf_cur = buf;
+    buf_len = lb->len;
+    while (buf_len > 0) {
+        ssize_t wr_ret;
+        int errno_save;
+
+        /*
+         * Loading data to the device takes a while,
+         * drop the lock during this process.
+         */
+        qemu_mutex_unlock(&multifd->load_bufs_mutex);
+        wr_ret = write(migration->data_fd, buf_cur, buf_len);
+        errno_save = errno;
+        qemu_mutex_lock(&multifd->load_bufs_mutex);
+
+        if (wr_ret < 0) {
+            error_setg(errp,
+                       "writing state buffer %" PRIu32 " failed: %d",
+                       multifd->load_buf_idx, errno_save);
+            return false;
+        }
+
+        assert(wr_ret <= buf_len);
+        buf_len -= wr_ret;
+        buf_cur += wr_ret;
+    }
+
+    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
+                                                 multifd->load_buf_idx);
+
+    return true;
+}
+
+static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
+                                            bool *should_quit)
+{
+    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
+}
+
+/*
+ * This thread is spawned by vfio_multifd_switchover_start() which gets
+ * called upon encountering the switchover point marker in main migration
+ * stream.
+ *
+ * It exits after either:
+ * * completing loading the remaining device state and device config, OR:
+ * * encountering some error while doing the above, OR:
+ * * being forcefully aborted by the migration core by it setting should_quit
+ *   or by vfio_load_cleanup_load_bufs_thread() setting
+ *   multifd->load_bufs_thread_want_exit.
+ */
+static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    bool ret = true;
+    int config_ret;
+
+    assert(multifd);
+    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
+
+    assert(multifd->load_bufs_thread_running);
+
+    while (true) {
+        VFIOStateBuffer *lb;
+
+        /*
+         * Always check cancellation first after the buffer_ready wait below in
+         * case that cond was signalled by vfio_load_cleanup_load_bufs_thread().
+         */
+        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
+            error_setg(errp, "operation cancelled");
+            ret = false;
+            goto ret_signal;
+        }
+
+        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
+
+        lb = vfio_load_state_buffer_get(multifd);
+        if (!lb) {
+            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
+                                                        multifd->load_buf_idx);
+            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
+                           &multifd->load_bufs_mutex);
+            continue;
+        }
+
+        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
+            break;
+        }
+
+        if (multifd->load_buf_idx == 0) {
+            trace_vfio_load_state_device_buffer_start(vbasedev->name);
+        }
+
+        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
+            ret = false;
+            goto ret_signal;
+        }
+
+        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
+            trace_vfio_load_state_device_buffer_end(vbasedev->name);
+        }
+
+        multifd->load_buf_idx++;
+    }
+
+    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
+    if (config_ret) {
+        error_setg(errp, "load config state failed: %d", config_ret);
+        ret = false;
+    }
+
+ret_signal:
+    /*
+     * Notify possibly waiting vfio_load_cleanup_load_bufs_thread() that
+     * this thread is exiting.
+     */
+    multifd->load_bufs_thread_running = false;
+    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
+
+    return ret;
+}
+
 VFIOMultifd *vfio_multifd_new(void)
 {
     VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
@@ -191,11 +365,42 @@ VFIOMultifd *vfio_multifd_new(void)
     multifd->load_buf_idx_last = UINT32_MAX;
     qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
 
+    multifd->load_bufs_thread_running = false;
+    multifd->load_bufs_thread_want_exit = false;
+    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
+
     return multifd;
 }
 
+/*
+ * Terminates vfio_load_bufs_thread by setting
+ * multifd->load_bufs_thread_want_exit and signalling all the conditions
+ * the thread could be blocked on.
+ *
+ * Waits for the thread to signal that it had finished.
+ */
+static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
+{
+    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+    bql_unlock();
+    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
+        while (multifd->load_bufs_thread_running) {
+            multifd->load_bufs_thread_want_exit = true;
+
+            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
+            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
+                           &multifd->load_bufs_mutex);
+        }
+    }
+    bql_lock();
+}
+
 void vfio_multifd_free(VFIOMultifd *multifd)
 {
+    vfio_load_cleanup_load_bufs_thread(multifd);
+
+    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
+    vfio_state_buffers_destroy(&multifd->load_bufs);
     qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
     qemu_mutex_destroy(&multifd->load_bufs_mutex);
 
@@ -225,3 +430,23 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
 
     return true;
 }
+
+int vfio_multifd_switchover_start(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+
+    assert(multifd);
+
+    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+    bql_unlock();
+    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
+        assert(!multifd->load_bufs_thread_running);
+        multifd->load_bufs_thread_running = true;
+    }
+    bql_lock();
+
+    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
+
+    return 0;
+}
diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
index d5ab7d6f85f5..09cbb437d9d1 100644
--- a/hw/vfio/migration-multifd.h
+++ b/hw/vfio/migration-multifd.h
@@ -25,4 +25,6 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
 bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
                             Error **errp);
 
+int vfio_multifd_switchover_start(VFIODevice *vbasedev);
+
 #endif
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index abaf4d08d4a9..85f54cb22df2 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -793,6 +793,17 @@ static bool vfio_switchover_ack_needed(void *opaque)
     return vfio_precopy_supported(vbasedev);
 }
 
+static int vfio_switchover_start(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vfio_multifd_transfer_enabled(vbasedev)) {
+        return vfio_multifd_switchover_start(vbasedev);
+    }
+
+    return 0;
+}
+
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_prepare = vfio_save_prepare,
     .save_setup = vfio_save_setup,
@@ -808,6 +819,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_state = vfio_load_state,
     .load_state_buffer = vfio_load_state_buffer,
     .switchover_ack_needed = vfio_switchover_ack_needed,
+    .switchover_start = vfio_switchover_start,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 042a3dc54a33..418b378ebd29 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -154,6 +154,11 @@ vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
 vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_start(const char *name) " (%s)"
+vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_end(const char *name) " (%s)"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (26 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26 13:52   ` Cédric Le Goater
  2025-03-02 14:25   ` Avihai Horon
  2025-02-19 20:34 ` [PATCH v5 29/36] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
                   ` (7 subsequent siblings)
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Load device config received via multifd using the existing machinery
behind vfio_load_device_config_state().

Also, make sure to process the relevant main migration channel flags.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c   | 47 ++++++++++++++++++++++++++++++++++-
 hw/vfio/migration.c           |  8 +++++-
 include/hw/vfio/vfio-common.h |  2 ++
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index b3a88c062769..7200f6f1c2a2 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -15,6 +15,7 @@
 #include "qemu/lockable.h"
 #include "qemu/main-loop.h"
 #include "qemu/thread.h"
+#include "io/channel-buffer.h"
 #include "migration/qemu-file.h"
 #include "migration-multifd.h"
 #include "trace.h"
@@ -186,7 +187,51 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
 
 static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
 {
-    return -EINVAL;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIOStateBuffer *lb;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f_out = NULL, *f_in = NULL;
+    uint64_t mig_header;
+    int ret;
+
+    assert(multifd->load_buf_idx == multifd->load_buf_idx_last);
+    lb = vfio_state_buffers_at(&multifd->load_bufs, multifd->load_buf_idx);
+    assert(lb->is_present);
+
+    bioc = qio_channel_buffer_new(lb->len);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
+
+    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
+    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
+
+    ret = qemu_fflush(f_out);
+    if (ret) {
+        g_clear_pointer(&f_out, qemu_fclose);
+        return ret;
+    }
+
+    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
+    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
+
+    mig_header = qemu_get_be64(f_in);
+    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+        g_clear_pointer(&f_out, qemu_fclose);
+        g_clear_pointer(&f_in, qemu_fclose);
+        return -EINVAL;
+    }
+
+    bql_lock();
+    ret = vfio_load_device_config_state(f_in, vbasedev);
+    bql_unlock();
+
+    g_clear_pointer(&f_out, qemu_fclose);
+    g_clear_pointer(&f_in, qemu_fclose);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
 }
 
 static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 85f54cb22df2..b962309f7c27 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -264,7 +264,7 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
     return ret;
 }
 
-static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+int vfio_load_device_config_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     uint64_t data;
@@ -728,6 +728,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
         switch (data) {
         case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
         {
+            if (vfio_multifd_transfer_enabled(vbasedev)) {
+                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
+                             vbasedev->name);
+                return -EINVAL;
+            }
+
             return vfio_load_device_config_state(f, opaque);
         }
         case VFIO_MIG_FLAG_DEV_SETUP_STATE:
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ab110198bd6b..ce2bdea8a2c2 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -298,6 +298,8 @@ void vfio_add_bytes_transferred(unsigned long val);
 bool vfio_device_state_is_running(VFIODevice *vbasedev);
 bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
 
+int vfio_load_device_config_state(QEMUFile *f, void *opaque);
+
 #ifdef CONFIG_LINUX
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
                          struct vfio_region_info **info);


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 29/36] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (27 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-19 20:34 ` [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Automatic memory management helps avoid memory safety issues.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/qemu-file.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 3e47a20621a7..f5b9f430e04b 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -33,6 +33,8 @@ QEMUFile *qemu_file_new_input(QIOChannel *ioc);
 QEMUFile *qemu_file_new_output(QIOChannel *ioc);
 int qemu_fclose(QEMUFile *f);
 
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(QEMUFile, qemu_fclose)
+
 /*
  * qemu_file_transferred:
  *


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (28 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 29/36] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26 16:43   ` Cédric Le Goater
  2025-03-02 14:41   ` Avihai Horon
  2025-02-19 20:34 ` [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
                   ` (5 subsequent siblings)
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Implement the multifd device state transfer via additional per-device
thread inside save_live_complete_precopy_thread handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
x-migration-multifd-transfer device property value.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c   | 139 ++++++++++++++++++++++++++++++++++
 hw/vfio/migration-multifd.h   |   5 ++
 hw/vfio/migration.c           |  26 +++++--
 hw/vfio/trace-events          |   2 +
 include/hw/vfio/vfio-common.h |   8 ++
 5 files changed, 174 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 7200f6f1c2a2..0cfa9d31732a 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -476,6 +476,145 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
     return true;
 }
 
+void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
+{
+    assert(vfio_multifd_transfer_enabled(vbasedev));
+
+    /*
+     * Emit dummy NOP data on the main migration channel since the actual
+     * device state transfer is done via multifd channels.
+     */
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+}
+
+static bool
+vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev,
+                                               char *idstr,
+                                               uint32_t instance_id,
+                                               uint32_t idx,
+                                               Error **errp)
+{
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    g_autoptr(QEMUFile) f = NULL;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    size_t packet_len;
+
+    bioc = qio_channel_buffer_new(0);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+    f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+    if (vfio_save_device_config_state(f, vbasedev, errp)) {
+        return false;
+    }
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        error_setg(errp, "save config state flush failed: %d", ret);
+        return false;
+    }
+
+    packet_len = sizeof(*packet) + bioc->usage;
+    packet = g_malloc0(packet_len);
+    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
+    packet->idx = idx;
+    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+    memcpy(&packet->data, bioc->data, bioc->usage);
+
+    if (!multifd_queue_device_state(idstr, instance_id,
+                                    (char *)packet, packet_len)) {
+        error_setg(errp, "multifd config data queuing failed");
+        return false;
+    }
+
+    vfio_add_bytes_transferred(packet_len);
+
+    return true;
+}
+
+/*
+ * This thread is spawned by the migration core directly via
+ * .save_live_complete_precopy_thread SaveVMHandler.
+ *
+ * It exits after either:
+ * * completing saving the remaining device state and device config, OR:
+ * * encountering some error while doing the above, OR:
+ * * being forcefully aborted by the migration core by
+ *   multifd_device_state_save_thread_should_exit() returning true.
+ */
+bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
+                                       Error **errp)
+{
+    VFIODevice *vbasedev = d->handler_opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    bool ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    uint32_t idx;
+
+    if (!vfio_multifd_transfer_enabled(vbasedev)) {
+        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
+        return true;
+    }
+
+    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
+                                                  d->idstr, d->instance_id);
+
+    /* We reach here with device state STOP or STOP_COPY only */
+    if (vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                 VFIO_DEVICE_STATE_STOP, errp)) {
+        ret = false;
+        goto ret_finish;
+    }
+
+    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
+
+    for (idx = 0; ; idx++) {
+        ssize_t data_size;
+        size_t packet_size;
+
+        if (multifd_device_state_save_thread_should_exit()) {
+            error_setg(errp, "operation cancelled");
+            ret = false;
+            goto ret_finish;
+        }
+
+        data_size = read(migration->data_fd, &packet->data,
+                         migration->data_buffer_size);
+        if (data_size < 0) {
+            error_setg(errp, "reading state buffer %" PRIu32 " failed: %d",
+                       idx, errno);
+            ret = false;
+            goto ret_finish;
+        } else if (data_size == 0) {
+            break;
+        }
+
+        packet->idx = idx;
+        packet_size = sizeof(*packet) + data_size;
+
+        if (!multifd_queue_device_state(d->idstr, d->instance_id,
+                                        (char *)packet, packet_size)) {
+            error_setg(errp, "multifd data queuing failed");
+            ret = false;
+            goto ret_finish;
+        }
+
+        vfio_add_bytes_transferred(packet_size);
+    }
+
+    ret = vfio_save_complete_precopy_thread_config_state(vbasedev,
+                                                         d->idstr,
+                                                         d->instance_id,
+                                                         idx, errp);
+
+ret_finish:
+    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
+
+    return ret;
+}
+
 int vfio_multifd_switchover_start(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
index 09cbb437d9d1..79780d7b5392 100644
--- a/hw/vfio/migration-multifd.h
+++ b/hw/vfio/migration-multifd.h
@@ -25,6 +25,11 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
 bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
                             Error **errp);
 
+void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f);
+
+bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
+                                       Error **errp);
+
 int vfio_multifd_switchover_start(VFIODevice *vbasedev);
 
 #endif
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index b962309f7c27..69dcf2dac2fa 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -120,10 +120,10 @@ static void vfio_migration_set_device_state(VFIODevice *vbasedev,
     vfio_migration_send_event(vbasedev);
 }
 
-static int vfio_migration_set_state(VFIODevice *vbasedev,
-                                    enum vfio_device_mig_state new_state,
-                                    enum vfio_device_mig_state recover_state,
-                                    Error **errp)
+int vfio_migration_set_state(VFIODevice *vbasedev,
+                             enum vfio_device_mig_state new_state,
+                             enum vfio_device_mig_state recover_state,
+                             Error **errp)
 {
     VFIOMigration *migration = vbasedev->migration;
     uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
@@ -238,8 +238,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
-static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
-                                         Error **errp)
+int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
     int ret;
@@ -453,6 +452,10 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
+    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
+        return -EINVAL;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
@@ -631,6 +634,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     int ret;
     Error *local_err = NULL;
 
+    if (vfio_multifd_transfer_enabled(vbasedev)) {
+        vfio_multifd_emit_dummy_eos(vbasedev, f);
+        return 0;
+    }
+
     trace_vfio_save_complete_precopy_start(vbasedev->name);
 
     /* We reach here with device state STOP or STOP_COPY only */
@@ -662,6 +670,11 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
     Error *local_err = NULL;
     int ret;
 
+    if (vfio_multifd_transfer_enabled(vbasedev)) {
+        vfio_multifd_emit_dummy_eos(vbasedev, f);
+        return;
+    }
+
     ret = vfio_save_device_config_state(f, opaque, &local_err);
     if (ret) {
         error_prepend(&local_err,
@@ -819,6 +832,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 418b378ebd29..039979bdd98f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -168,6 +168,8 @@ vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_complete_precopy_start(const char *name) " (%s)"
+vfio_save_complete_precopy_thread_start(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
+vfio_save_complete_precopy_thread_end(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_save_iterate_start(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ce2bdea8a2c2..ba851917f9fc 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -298,6 +298,14 @@ void vfio_add_bytes_transferred(unsigned long val);
 bool vfio_device_state_is_running(VFIODevice *vbasedev);
 bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
 
+#ifdef CONFIG_LINUX
+int vfio_migration_set_state(VFIODevice *vbasedev,
+                             enum vfio_device_mig_state new_state,
+                             enum vfio_device_mig_state recover_state,
+                             Error **errp);
+#endif
+
+int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp);
 int vfio_load_device_config_state(QEMUFile *f, void *opaque);
 
 #ifdef CONFIG_LINUX


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (29 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-27  6:45   ` Cédric Le Goater
  2025-03-02 14:48   ` Avihai Horon
  2025-02-19 20:34 ` [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This property allows configuring at runtime whether to transfer the
particular device state via multifd channels when live migrating that
device.

It defaults to AUTO, which means that VFIO device state transfer via
multifd channels is attempted in configurations that otherwise support it.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c   | 17 ++++++++++++++++-
 hw/vfio/pci.c                 |  3 +++
 include/hw/vfio/vfio-common.h |  2 ++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 0cfa9d31732a..18a5ff964a37 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -460,11 +460,26 @@ bool vfio_multifd_transfer_supported(void)
 
 bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
 {
-    return false;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->multifd_transfer;
 }
 
 bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
 {
+    VFIOMigration *migration = vbasedev->migration;
+
+    /*
+     * Make a copy of this setting at the start in case it is changed
+     * mid-migration.
+     */
+    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
+        migration->multifd_transfer = vfio_multifd_transfer_supported();
+    } else {
+        migration->multifd_transfer =
+            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
+    }
+
     if (vfio_multifd_transfer_enabled(vbasedev) &&
         !vfio_multifd_transfer_supported()) {
         error_setg(errp,
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 89d900e9cf0c..184ff882f9d1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3377,6 +3377,9 @@ static const Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+    DEFINE_PROP_ON_OFF_AUTO("x-migration-multifd-transfer", VFIOPCIDevice,
+                            vbasedev.migration_multifd_transfer,
+                            ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ba851917f9fc..3006931accf6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -91,6 +91,7 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    bool multifd_transfer;
     VFIOMultifd *multifd;
     bool initial_data_sent;
 
@@ -153,6 +154,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
+    OnOffAuto migration_multifd_transfer;
     bool migration_events;
     VFIODeviceOps *ops;
     unsigned int num_irqs;


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (30 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26 17:59   ` Cédric Le Goater
  2025-02-19 20:34 ` [PATCH v5 33/36] hw/core/machine: Add compat for x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

DEFINE_PROP_ON_OFF_AUTO() property isn't runtime-mutable so using it
would mean that the source VM would need to decide upfront at startup
time whether it wants to do a multifd device state transfer at some
point.

Source VM can run for a long time before being migrated so it is
desirable to have a fallback mechanism to the old way of transferring
VFIO device state if it turns to be necessary.

This brings this property to the same mutability level as ordinary
migration parameters, which too can be adjusted at the run time.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/pci.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 184ff882f9d1..9111805ae06c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
     pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
 }
 
+static PropertyInfo qdev_prop_on_off_auto_mutable;
+
 static const Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
     DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
@@ -3377,9 +3379,10 @@ static const Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
-    DEFINE_PROP_ON_OFF_AUTO("x-migration-multifd-transfer", VFIOPCIDevice,
-                            vbasedev.migration_multifd_transfer,
-                            ON_OFF_AUTO_AUTO),
+    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
+                vbasedev.migration_multifd_transfer,
+                qdev_prop_on_off_auto_mutable, OnOffAuto,
+                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
@@ -3475,6 +3478,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
+    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
+
     type_register_static(&vfio_pci_dev_info);
     type_register_static(&vfio_pci_nohotplug_dev_info);
 }


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 33/36] hw/core/machine: Add compat for x-migration-multifd-transfer VFIO property
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (31 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-26 17:59   ` Cédric Le Goater
  2025-02-19 20:34 ` [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a hw_compat entry for recently added x-migration-multifd-transfer VFIO
property.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/machine.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 21c3bde92f08..d0a87f5ccbaa 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -44,6 +44,7 @@ GlobalProperty hw_compat_9_2[] = {
     { "virtio-mem-pci", "vectors", "0" },
     { "migration", "multifd-clean-tls-termination", "false" },
     { "migration", "send-switchover-start", "off"},
+    { "vfio-pci", "x-migration-multifd-transfer", "off" },
 };
 const size_t hw_compat_9_2_len = G_N_ELEMENTS(hw_compat_9_2);
 


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (32 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 33/36] hw/core/machine: Add compat for x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-27  6:48   ` Cédric Le Goater
  2025-03-02 14:53   ` Avihai Horon
  2025-02-19 20:34 ` [PATCH v5 35/36] vfio/migration: Add x-migration-load-config-after-iter VFIO property Maciej S. Szmigiero
  2025-02-19 20:34 ` [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation Maciej S. Szmigiero
  35 siblings, 2 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Allow capping the maximum count of in-flight VFIO device state buffers
queued at the destination, otherwise a malicious QEMU source could
theoretically cause the target QEMU to allocate unlimited amounts of memory
for buffers-in-flight.

Since this is not expected to be a realistic threat in most of VFIO live
migration use cases and the right value depends on the particular setup
disable the limit by default by setting it to UINT64_MAX.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c   | 14 ++++++++++++++
 hw/vfio/pci.c                 |  2 ++
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 17 insertions(+)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 18a5ff964a37..04aa3f4a6596 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -53,6 +53,7 @@ typedef struct VFIOMultifd {
     QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
     uint32_t load_buf_idx;
     uint32_t load_buf_idx_last;
+    uint32_t load_buf_queued_pending_buffers;
 } VFIOMultifd;
 
 static void vfio_state_buffer_clear(gpointer data)
@@ -121,6 +122,15 @@ static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
 
     assert(packet->idx >= multifd->load_buf_idx);
 
+    multifd->load_buf_queued_pending_buffers++;
+    if (multifd->load_buf_queued_pending_buffers >
+        vbasedev->migration_max_queued_buffers) {
+        error_setg(errp,
+                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
+                   packet->idx, vbasedev->migration_max_queued_buffers);
+        return false;
+    }
+
     lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
     lb->len = packet_total_size - sizeof(*packet);
     lb->is_present = true;
@@ -374,6 +384,9 @@ static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
             goto ret_signal;
         }
 
+        assert(multifd->load_buf_queued_pending_buffers > 0);
+        multifd->load_buf_queued_pending_buffers--;
+
         if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
             trace_vfio_load_state_device_buffer_end(vbasedev->name);
         }
@@ -408,6 +421,7 @@ VFIOMultifd *vfio_multifd_new(void)
 
     multifd->load_buf_idx = 0;
     multifd->load_buf_idx_last = UINT32_MAX;
+    multifd->load_buf_queued_pending_buffers = 0;
     qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
 
     multifd->load_bufs_thread_running = false;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9111805ae06c..247418f0fce2 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3383,6 +3383,8 @@ static const Property vfio_pci_dev_properties[] = {
                 vbasedev.migration_multifd_transfer,
                 qdev_prop_on_off_auto_mutable, OnOffAuto,
                 .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
+    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
+                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 3006931accf6..30a5bb9af61b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -155,6 +155,7 @@ typedef struct VFIODevice {
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
     OnOffAuto migration_multifd_transfer;
+    uint64_t migration_max_queued_buffers;
     bool migration_events;
     VFIODeviceOps *ops;
     unsigned int num_irqs;


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 35/36] vfio/migration: Add x-migration-load-config-after-iter VFIO property
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (33 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-19 20:34 ` [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation Maciej S. Szmigiero
  35 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This property allows configuring whether to start the config load only
after all iterables were loaded.
Such interlocking is required for ARM64 due to this platform VFIO
dependency on interrupt controller being loaded first.

The property defaults to AUTO, which means ON for ARM, OFF for other
platforms.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration-multifd.c   | 92 +++++++++++++++++++++++++++++++++++
 hw/vfio/migration-multifd.h   |  3 ++
 hw/vfio/migration.c           | 10 +++-
 hw/vfio/pci.c                 |  3 ++
 include/hw/vfio/vfio-common.h |  2 +
 5 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 04aa3f4a6596..ebb19f746f27 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -31,6 +31,31 @@ typedef struct VFIODeviceStatePacket {
     uint8_t data[0];
 } QEMU_PACKED VFIODeviceStatePacket;
 
+bool vfio_load_config_after_iter(VFIODevice *vbasedev)
+{
+    if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_ON) {
+        return true;
+    } else if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_OFF) {
+        return false;
+    }
+
+    assert(vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_AUTO);
+
+    /*
+     * Starting the config load only after all iterables were loaded is required
+     * for ARM64 due to this platform VFIO dependency on interrupt controller
+     * being loaded first.
+     *
+     * See commit d329f5032e17 ("vfio: Move the saving of the config space to
+     * the right place in VFIO migration").
+     */
+#if defined(TARGET_ARM)
+    return true;
+#else
+    return false;
+#endif
+}
+
 /* type safety */
 typedef struct VFIOStateBuffers {
     GArray *array;
@@ -47,6 +72,9 @@ typedef struct VFIOMultifd {
     bool load_bufs_thread_running;
     bool load_bufs_thread_want_exit;
 
+    bool load_bufs_iter_done;
+    QemuCond load_bufs_iter_done_cond;
+
     VFIOStateBuffers load_bufs;
     QemuCond load_bufs_buffer_ready_cond;
     QemuCond load_bufs_thread_finished_cond;
@@ -394,6 +422,23 @@ static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
         multifd->load_buf_idx++;
     }
 
+    if (vfio_load_config_after_iter(vbasedev)) {
+        while (!multifd->load_bufs_iter_done) {
+            qemu_cond_wait(&multifd->load_bufs_iter_done_cond,
+                           &multifd->load_bufs_mutex);
+
+            /*
+             * Need to re-check cancellation immediately after wait in case
+             * cond was signalled by vfio_load_cleanup_load_bufs_thread().
+             */
+            if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
+                error_setg(errp, "operation cancelled");
+                ret = false;
+                goto ret_signal;
+            }
+        }
+    }
+
     config_ret = vfio_load_bufs_thread_load_config(vbasedev);
     if (config_ret) {
         error_setg(errp, "load config state failed: %d", config_ret);
@@ -411,6 +456,48 @@ ret_signal:
     return ret;
 }
 
+int vfio_load_state_config_load_ready(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    int ret = 0;
+
+    if (!vfio_multifd_transfer_enabled(vbasedev)) {
+        error_report("%s: got DEV_CONFIG_LOAD_READY outside multifd transfer",
+                     vbasedev->name);
+        return -EINVAL;
+    }
+
+    if (!vfio_load_config_after_iter(vbasedev)) {
+        error_report("%s: got DEV_CONFIG_LOAD_READY but was disabled",
+                     vbasedev->name);
+        return -EINVAL;
+    }
+
+    assert(multifd);
+
+    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+    bql_unlock();
+    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
+        if (multifd->load_bufs_iter_done) {
+            /* Can't print error here as we're outside BQL */
+            ret = -EINVAL;
+            break;
+        }
+
+        multifd->load_bufs_iter_done = true;
+        qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
+    }
+    bql_lock();
+
+    if (ret) {
+        error_report("%s: duplicate DEV_CONFIG_LOAD_READY",
+                     vbasedev->name);
+    }
+
+    return ret;
+}
+
 VFIOMultifd *vfio_multifd_new(void)
 {
     VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
@@ -424,6 +511,9 @@ VFIOMultifd *vfio_multifd_new(void)
     multifd->load_buf_queued_pending_buffers = 0;
     qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
 
+    multifd->load_bufs_iter_done = false;
+    qemu_cond_init(&multifd->load_bufs_iter_done_cond);
+
     multifd->load_bufs_thread_running = false;
     multifd->load_bufs_thread_want_exit = false;
     qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
@@ -447,6 +537,7 @@ static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
             multifd->load_bufs_thread_want_exit = true;
 
             qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
+            qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
             qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
                            &multifd->load_bufs_mutex);
         }
@@ -459,6 +550,7 @@ void vfio_multifd_free(VFIOMultifd *multifd)
     vfio_load_cleanup_load_bufs_thread(multifd);
 
     qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
+    qemu_cond_destroy(&multifd->load_bufs_iter_done_cond);
     vfio_state_buffers_destroy(&multifd->load_bufs);
     qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
     qemu_mutex_destroy(&multifd->load_bufs_mutex);
diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
index 79780d7b5392..414f2bc2ece9 100644
--- a/hw/vfio/migration-multifd.h
+++ b/hw/vfio/migration-multifd.h
@@ -22,9 +22,12 @@ bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
 
 bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
 
+bool vfio_load_config_after_iter(VFIODevice *vbasedev);
 bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
                             Error **errp);
 
+int vfio_load_state_config_load_ready(VFIODevice *vbasedev);
+
 void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f);
 
 bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 69dcf2dac2fa..c6f04f9756aa 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -671,7 +671,11 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
     int ret;
 
     if (vfio_multifd_transfer_enabled(vbasedev)) {
-        vfio_multifd_emit_dummy_eos(vbasedev, f);
+        if (vfio_load_config_after_iter(vbasedev)) {
+            qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY);
+        } else {
+            vfio_multifd_emit_dummy_eos(vbasedev, f);
+        }
         return;
     }
 
@@ -791,6 +795,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 
             return ret;
         }
+        case VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY:
+        {
+            return vfio_load_state_config_load_ready(vbasedev);
+        }
         default:
             error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
             return -EINVAL;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 247418f0fce2..9ca33b49421c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3383,6 +3383,9 @@ static const Property vfio_pci_dev_properties[] = {
                 vbasedev.migration_multifd_transfer,
                 qdev_prop_on_off_auto_mutable, OnOffAuto,
                 .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
+    DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
+                            vbasedev.migration_load_config_after_iter,
+                            ON_OFF_AUTO_AUTO),
     DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
                        vbasedev.migration_max_queued_buffers, UINT64_MAX),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 30a5bb9af61b..bd3b0a29ecf2 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -52,6 +52,7 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY (0xffffffffef100006ULL)
 
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
@@ -155,6 +156,7 @@ typedef struct VFIODevice {
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
     OnOffAuto migration_multifd_transfer;
+    OnOffAuto migration_load_config_after_iter;
     uint64_t migration_max_queued_buffers;
     bool migration_events;
     VFIODeviceOps *ops;


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (34 preceding siblings ...)
  2025-02-19 20:34 ` [PATCH v5 35/36] vfio/migration: Add x-migration-load-config-after-iter VFIO property Maciej S. Szmigiero
@ 2025-02-19 20:34 ` Maciej S. Szmigiero
  2025-02-27  6:59   ` Cédric Le Goater
  35 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-19 20:34 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Update the VFIO documentation at docs/devel/migration describing the
changes brought by the multifd device state transfer.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
 1 file changed, 71 insertions(+), 9 deletions(-)

diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index c49482eab66d..d9b169d29921 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
 support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
 VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
+Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
+_STOP_COPY state via multifd channels. This helps reduce downtime - especially
+with multiple VFIO devices or with devices having a large migration state.
+As an additional benefit, setting the VFIO device to _STOP_COPY state and
+saving its config space is also parallelized (run in a separate thread) in
+such migration mode.
+
+The multifd VFIO device state transfer is controlled by
+"x-migration-multifd-transfer" VFIO device property. This property defaults to
+AUTO, which means that VFIO device state transfer via multifd channels is
+attempted in configurations that otherwise support it.
+
+Since the target QEMU needs to load device state buffers in-order it needs to
+queue incoming buffers until they can be loaded into the device.
+This means that a malicious QEMU source could theoretically cause the target
+QEMU to allocate unlimited amounts of memory for such buffers-in-flight.
+
+The "x-migration-max-queued-buffers" property allows capping the maximum count
+of these VFIO device state buffers queued at the destination.
+
+Because a malicious QEMU source causing OOM on the target is not expected to be
+a realistic threat in most of VFIO live migration use cases and the right value
+depends on the particular setup by default this queued buffers limit is
+disabled by setting it to UINT64_MAX.
+
+Some host platforms (like ARM64) require that VFIO device config is loaded only
+after all iterables were loaded.
+Such interlocking is controlled by "x-migration-load-config-after-iter" VFIO
+device property, which in its default setting (AUTO) does so only on platforms
+that actually require it.
+
 When pre-copy is supported, it's possible to further reduce downtime by
 enabling "switchover-ack" migration capability.
 VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
@@ -67,14 +98,39 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``switchover_ack_needed`` function that checks if the VFIO device uses
   "switchover-ack" migration capability when this capability is enabled.
 
-* A ``save_state`` function to save the device config space if it is present.
-
-* A ``save_live_complete_precopy`` function that sets the VFIO device in
-  _STOP_COPY state and iteratively copies the data for the VFIO device until
-  the vendor driver indicates that no data remains.
-
-* A ``load_state`` function that loads the config section and the data
-  sections that are generated by the save functions above.
+* A ``switchover_start`` function that in the multifd mode starts a thread that
+  reassembles the multifd received data and loads it in-order into the device.
+  In the non-multifd mode this function is a NOP.
+
+* A ``save_state`` function to save the device config space if it is present
+  in the non-multifd mode.
+  In the multifd mode it just emits either a dummy EOS marker or
+  "all iterables were loaded" flag for configurations that need to defer
+  loading device config space after them.
+
+* A ``save_live_complete_precopy`` function that in the non-multifd mode sets
+  the VFIO device in _STOP_COPY state and iteratively copies the data for the
+  VFIO device until the vendor driver indicates that no data remains.
+  In the multifd mode it just emits a dummy EOS marker.
+
+* A ``save_live_complete_precopy_thread`` function that in the multifd mode
+  provides thread handler performing multifd device state transfer.
+  It sets the VFIO device to _STOP_COPY state, iteratively reads the data
+  from the VFIO device and queues it for multifd transmission until the vendor
+  driver indicates that no data remains.
+  After that, it saves the device config space and queues it for multifd
+  transfer too.
+  In the non-multifd mode this thread is a NOP.
+
+* A ``load_state`` function that loads the data sections that are generated
+  by the main migration channel save functions above.
+  In the non-multifd mode it also loads the config section, while in the
+  multifd mode it handles the optional "all iterables were loaded" flag if
+  it is in use.
+
+* A ``load_state_buffer`` function that loads the device state and the device
+  config that arrived via multifd channels.
+  It's used only in the multifd mode.
 
 * ``cleanup`` functions for both save and load that perform any migration
   related cleanup.
@@ -176,8 +232,11 @@ Live migration save path
                 Then the VFIO device is put in _STOP_COPY state
                      (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
          .save_live_complete_precopy() is called for each active device
-      For the VFIO device, iterate in .save_live_complete_precopy() until
+              For the VFIO device: in the non-multifd mode iterate in
+                        .save_live_complete_precopy() until
                                pending data is 0
+	          In the multifd mode this iteration is done in
+	          .save_live_complete_precopy_thread() instead.
                                       |
                      (POSTMIGRATE, _COMPLETED, _STOP_COPY)
             Migraton thread schedules cleanup bottom half and exits
@@ -194,6 +253,9 @@ Live migration resume path
                           (RESTORE_VM, _ACTIVE, _STOP)
                                       |
      For each device, .load_state() is called for that device section data
+                 transmitted via the main migration channel.
+     For data transmitted via multifd channels .load_state_buffer() is called
+                                   instead.
                         (RESTORE_VM, _ACTIVE, _RESUMING)
                                       |
   At the end, .load_cleanup() is called for each device and vCPUs are started


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls
  2025-02-19 20:33 ` [PATCH v5 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls Maciej S. Szmigiero
@ 2025-02-25 17:16   ` Peter Xu
  2025-02-25 21:08     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Xu @ 2025-02-25 17:16 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Feb 19, 2025 at 09:33:49PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> All callers to migration_incoming_state_destroy() other than
> postcopy_ram_listen_thread() do this call with BQL held.
> 
> Since migration_incoming_state_destroy() ultimately calls "load_cleanup"
> SaveVMHandlers and it will soon call BQL-sensitive code it makes sense
> to always call that function under BQL rather than to have it deal with
> both cases (with BQL and without BQL).
> Add the necessary bql_lock() and bql_unlock() to
> postcopy_ram_listen_thread().

We can do that, but let's be explicit on what needs BQL to be taken.

Could you add an assertion in migration_incoming_state_destroy() on
bql_locked(), then add a rich comment above it listing what needs the BQL?
We may consider dropping it some day when it's not needed.

Thanks,

> 
> qemu_loadvm_state_main() in postcopy_ram_listen_thread() could call
> "load_state" SaveVMHandlers that are expecting BQL to be held.
> 
> In principle, the only devices that should be arriving on migration
> channel serviced by postcopy_ram_listen_thread() are those that are
> postcopiable and whose load handlers are safe to be called without BQL
> being held.
> 
> But nothing currently prevents the source from sending data for "unsafe"
> devices which would cause trouble there.
> Add a TODO comment there so it's clear that it would be good to improve
> handling of such (erroneous) case in the future.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  migration/savevm.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 7c1aa8ad7b9d..3e86b572cfa8 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1986,6 +1986,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
>       * in qemu_file, and thus we must be blocking now.
>       */
>      qemu_file_set_blocking(f, true);
> +
> +    /* TODO: sanity check that only postcopiable data will be loaded here */
>      load_res = qemu_loadvm_state_main(f, mis);
>  
>      /*
> @@ -2046,7 +2048,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>       * (If something broke then qemu will have to exit anyway since it's
>       * got a bad migration state).
>       */
> +    bql_lock();
>      migration_incoming_state_destroy();
> +    bql_unlock();
>  
>      rcu_unregister_thread();
>      mis->have_listen_thread = false;
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls
  2025-02-25 17:16   ` Peter Xu
@ 2025-02-25 21:08     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-25 21:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 25.02.2025 18:16, Peter Xu wrote:
> On Wed, Feb 19, 2025 at 09:33:49PM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> All callers to migration_incoming_state_destroy() other than
>> postcopy_ram_listen_thread() do this call with BQL held.
>>
>> Since migration_incoming_state_destroy() ultimately calls "load_cleanup"
>> SaveVMHandlers and it will soon call BQL-sensitive code it makes sense
>> to always call that function under BQL rather than to have it deal with
>> both cases (with BQL and without BQL).
>> Add the necessary bql_lock() and bql_unlock() to
>> postcopy_ram_listen_thread().
> 
> We can do that, but let's be explicit on what needs BQL to be taken.
> 
> Could you add an assertion in migration_incoming_state_destroy() on
> bql_locked(), then add a rich comment above it listing what needs the BQL?
> We may consider dropping it some day when it's not needed.

Sure, good idea.

Updated this commit now, and tests (make check) still pass.

> Thanks,

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic
  2025-02-19 20:34 ` [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
@ 2025-02-26  7:52   ` Cédric Le Goater
  2025-02-26 13:55     ` Maciej S. Szmigiero
  2025-02-26 16:20   ` Cédric Le Goater
  1 sibling, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26  7:52 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> So it can be safety accessed from multiple threads.
> 
> This variable type needs to be changed to unsigned long since
> 32-bit host platforms lack the necessary addition atomics on 64-bit
> variables.
> 
> Using 32-bit counters on 32-bit host platforms should not be a problem
> in practice since they can't realistically address more memory anyway.

Is it useful to have VFIO on 32-bit host platforms ?

If not, VFIO PCI should depend on (AARCH64 || PPC64 || X86_64) and we
could drop this patch. Let's address that independently.

Thanks,

C.






> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 03890eaa48a9..5532787be63b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -55,7 +55,7 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>   
> -static int64_t bytes_transferred;
> +static unsigned long bytes_transferred;
>   
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>   {
> @@ -391,7 +391,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>       qemu_put_be64(f, data_size);
>       qemu_put_buffer(f, migration->data_buffer, data_size);
> -    bytes_transferred += data_size;
> +    qatomic_add(&bytes_transferred, data_size);
>   
>       trace_vfio_save_block(migration->vbasedev->name, data_size);
>   
> @@ -1013,12 +1013,12 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
>   
>   int64_t vfio_mig_bytes_transferred(void)
>   {
> -    return bytes_transferred;
> +    return MIN(qatomic_read(&bytes_transferred), INT64_MAX);
>   }
>   
>   void vfio_reset_bytes_transferred(void)
>   {
> -    bytes_transferred = 0;
> +    qatomic_set(&bytes_transferred, 0);
>   }
>   
>   /*
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 20/36] vfio/migration: Add vfio_add_bytes_transferred()
  2025-02-19 20:34 ` [PATCH v5 20/36] vfio/migration: Add vfio_add_bytes_transferred() Maciej S. Szmigiero
@ 2025-02-26  8:06   ` Cédric Le Goater
  2025-02-26 15:45     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26  8:06 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This way bytes_transferred can also be incremented in other translation
> units than migration.c.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Looks good. Just a small aesthetic issue.

> ---
>   hw/vfio/migration.c           | 7 ++++++-
>   include/hw/vfio/vfio-common.h | 1 +
>   2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 5532787be63b..e9645cb9d088 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -391,7 +391,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>       qemu_put_be64(f, data_size);
>       qemu_put_buffer(f, migration->data_buffer, data_size);
> -    qatomic_add(&bytes_transferred, data_size);
> +    vfio_add_bytes_transferred(data_size);
>   
>       trace_vfio_save_block(migration->vbasedev->name, data_size);
>   
> @@ -1021,6 +1021,11 @@ void vfio_reset_bytes_transferred(void)
>       qatomic_set(&bytes_transferred, 0);
>   }
>   
> +void vfio_add_bytes_transferred(unsigned long val)

vfio_migration_add_bytes_transferred()


Thanks,

C.



> +{
> +    qatomic_add(&bytes_transferred, val);
> +}
> +
>   /*
>    * Return true when either migration initialized or blocker registered.
>    * Currently only return false when adding blocker fails which will
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ac35136a1105..70f2a1891ed1 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -274,6 +274,7 @@ void vfio_unblock_multiple_devices_migration(void);
>   bool vfio_viommu_preset(VFIODevice *vbasedev);
>   int64_t vfio_mig_bytes_transferred(void);
>   void vfio_reset_bytes_transferred(void);
> +void vfio_add_bytes_transferred(unsigned long val);
>   bool vfio_device_state_is_running(VFIODevice *vbasedev);
>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>   
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 21/36] vfio/migration: Move migration channel flags to vfio-common.h header file
  2025-02-19 20:34 ` [PATCH v5 21/36] vfio/migration: Move migration channel flags to vfio-common.h header file Maciej S. Szmigiero
@ 2025-02-26  8:19   ` Cédric Le Goater
  0 siblings, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26  8:19 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This way they can also be referenced in other translation
> units than migration.c.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/migration.c           | 17 -----------------
>   include/hw/vfio/vfio-common.h | 17 +++++++++++++++++
>   2 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e9645cb9d088..46adb798352f 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -31,23 +31,6 @@
>   #include "trace.h"
>   #include "hw/hw.h"
>   
> -/*
> - * Flags to be used as unique delimiters for VFIO devices in the migration
> - * stream. These flags are composed as:
> - * 0xffffffff => MSB 32-bit all 1s
> - * 0xef10     => Magic ID, represents emulated (virtual) function IO
> - * 0x0000     => 16-bits reserved for flags
> - *
> - * The beginning of state information is marked by _DEV_CONFIG_STATE,
> - * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
> - * certain state information is marked by _END_OF_STATE.
> - */
> -#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> -#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> -#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> -#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> -#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
> -
>   /*
>    * This is an arbitrary size based on migration of mlx5 devices, where typically
>    * total device migration size is on the order of 100s of MB. Testing with
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 70f2a1891ed1..64ee3b1a2547 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -36,6 +36,23 @@
>   
>   #define VFIO_MSG_PREFIX "vfio %s: "
>   
> +/*
> + * Flags to be used as unique delimiters for VFIO devices in the migration
> + * stream. These flags are composed as:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => Magic ID, represents emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + *
> + * The beginning of state information is marked by _DEV_CONFIG_STATE,
> + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
> + * certain state information is marked by _END_OF_STATE.
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
> +
>   enum {
>       VFIO_DEVICE_TYPE_PCI = 0,
>       VFIO_DEVICE_TYPE_PLATFORM = 1,
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 22/36] vfio/migration: Multifd device state transfer support - basic types
  2025-02-19 20:34 ` [PATCH v5 22/36] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
@ 2025-02-26  8:52   ` Cédric Le Goater
  2025-02-26 16:06     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26  8:52 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add basic types and flags used by VFIO multifd device state transfer
> support.
> 
> Since we'll be introducing a lot of multifd transfer specific code,
> add a new file migration-multifd.c to home it, wired into main VFIO
> migration code (migration.c) via migration-multifd.h header file.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/meson.build         |  1 +
>   hw/vfio/migration-multifd.c | 31 +++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h | 15 +++++++++++++++
>   hw/vfio/migration.c         |  1 +
>   4 files changed, 48 insertions(+)
>   create mode 100644 hw/vfio/migration-multifd.c
>   create mode 100644 hw/vfio/migration-multifd.h
> 
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index bba776f75cc7..260d65febd6b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>     'container-base.c',
>     'container.c',
>     'migration.c',
> +  'migration-multifd.c',
>     'cpr.c',
>   ))
>   vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> new file mode 100644
> index 000000000000..0c3185a26242
> --- /dev/null
> +++ b/hw/vfio/migration-multifd.c
> @@ -0,0 +1,31 @@
> +/*

Please add :

   SPDX-License-Identifier: GPL-2.0-or-later

in new files.


Thanks,

C.




> + * Multifd VFIO migration
> + *
> + * Copyright (C) 2024,2025 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "qemu/lockable.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/thread.h"
> +#include "migration/qemu-file.h"
> +#include "migration-multifd.h"
> +#include "trace.h"
> +
> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
> +
> +#define VFIO_DEVICE_STATE_PACKET_VER_CURRENT (0)
> +
> +typedef struct VFIODeviceStatePacket {
> +    uint32_t version;
> +    uint32_t idx;
> +    uint32_t flags;
> +    uint8_t data[0];
> +} QEMU_PACKED VFIODeviceStatePacket;
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> new file mode 100644
> index 000000000000..64d117b27210
> --- /dev/null
> +++ b/hw/vfio/migration-multifd.h
> @@ -0,0 +1,15 @@
> +/*
> + * Multifd VFIO migration
> + *
> + * Copyright (C) 2024,2025 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef HW_VFIO_MIGRATION_MULTIFD_H
> +#define HW_VFIO_MIGRATION_MULTIFD_H
> +
> +#include "hw/vfio/vfio-common.h"
> +
> +#endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 46adb798352f..7b79be6ad293 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -23,6 +23,7 @@
>   #include "migration/qemu-file.h"
>   #include "migration/register.h"
>   #include "migration/blocker.h"
> +#include "migration-multifd.h"
>   #include "qapi/error.h"
>   #include "qapi/qapi-events-vfio.h"
>   #include "exec/ramlist.h"
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s)
  2025-02-19 20:34 ` [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
@ 2025-02-26  8:54   ` Cédric Le Goater
  2025-03-02 13:00   ` Avihai Horon
  1 sibling, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26  8:54 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add VFIOStateBuffer(s) types and the associated methods.
> 
> These store received device state buffers and config state waiting to get
> loaded into the device.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/migration-multifd.c | 54 +++++++++++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 0c3185a26242..760b110a39b9 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -29,3 +29,57 @@ typedef struct VFIODeviceStatePacket {
>       uint32_t flags;
>       uint8_t data[0];
>   } QEMU_PACKED VFIODeviceStatePacket;
> +
> +/* type safety */
> +typedef struct VFIOStateBuffers {
> +    GArray *array;
> +} VFIOStateBuffers;
> +
> +typedef struct VFIOStateBuffer {
> +    bool is_present;
> +    char *data;
> +    size_t len;
> +} VFIOStateBuffer;
> +
> +static void vfio_state_buffer_clear(gpointer data)
> +{
> +    VFIOStateBuffer *lb = data;
> +
> +    if (!lb->is_present) {
> +        return;
> +    }
> +
> +    g_clear_pointer(&lb->data, g_free);
> +    lb->is_present = false;
> +}
> +
> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
> +{
> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
> +}
> +
> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
> +{
> +    g_clear_pointer(&bufs->array, g_array_unref);
> +}
> +
> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
> +{
> +    assert(bufs->array);
> +}
> +
> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
> +{
> +    return bufs->array->len;
> +}
> +
> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
> +{
> +    g_array_set_size(bufs->array, size);
> +}
> +
> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
> +{
> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
> +}
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 24/36] vfio/migration: Multifd device state transfer - add support checking function
  2025-02-19 20:34 ` [PATCH v5 24/36] vfio/migration: Multifd device state transfer - add support checking function Maciej S. Szmigiero
@ 2025-02-26  8:54   ` Cédric Le Goater
  0 siblings, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26  8:54 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add vfio_multifd_transfer_supported() function that tells whether the
> multifd device state transfer is supported.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/migration-multifd.c | 6 ++++++
>   hw/vfio/migration-multifd.h | 2 ++
>   2 files changed, 8 insertions(+)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 760b110a39b9..7328ad8e925c 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -83,3 +83,9 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>   {
>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>   }
> +
> +bool vfio_multifd_transfer_supported(void)
> +{
> +    return multifd_device_state_supported() &&
> +        migrate_send_switchover_start();
> +}
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 64d117b27210..8fe004c1da81 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -12,4 +12,6 @@
>   
>   #include "hw/vfio/vfio-common.h"
>   
> +bool vfio_multifd_transfer_supported(void);
> +
>   #endif
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-19 20:34 ` [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
@ 2025-02-26 10:14   ` Cédric Le Goater
  2025-02-26 17:22     ` Cédric Le Goater
  2025-02-26 17:28   ` Cédric Le Goater
  2025-02-26 17:46   ` Cédric Le Goater
  2 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 10:14 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add support for VFIOMultifd data structure that will contain most of the
> receive-side data together with its init/cleanup methods.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h   |  8 ++++++++
>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
>   include/hw/vfio/vfio-common.h |  3 +++
>   4 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 7328ad8e925c..c2defc0efef0 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
>       size_t len;
>   } VFIOStateBuffer;
>   
> +typedef struct VFIOMultifd {
> +} VFIOMultifd;
> +
>   static void vfio_state_buffer_clear(gpointer data)
>   {
>       VFIOStateBuffer *lb = data;
> @@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>   }
>   
> +VFIOMultifd *vfio_multifd_new(void)
> +{
> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
> +
> +    return multifd;
> +}
> +
> +void vfio_multifd_free(VFIOMultifd *multifd)
> +{
> +    g_free(multifd);
> +}
> +
>   bool vfio_multifd_transfer_supported(void)
>   {
>       return multifd_device_state_supported() &&
>           migrate_send_switchover_start();
>   }
> +
> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
> +{
> +    return false;
> +}
> +
> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
> +{
> +    if (vfio_multifd_transfer_enabled(vbasedev) &&
> +        !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return false;
> +    }
> +
> +    return true;
> +}
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 8fe004c1da81..1eefba3b2eed 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -12,6 +12,14 @@
>   
>   #include "hw/vfio/vfio-common.h"
>   
> +typedef struct VFIOMultifd VFIOMultifd;
> +
> +VFIOMultifd *vfio_multifd_new(void);
> +void vfio_multifd_free(VFIOMultifd *multifd);
> +
>   bool vfio_multifd_transfer_supported(void);
> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
> +
> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 7b79be6ad293..4311de763885 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -674,15 +674,40 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
> +        return -EINVAL;
> +    }

This check on the consistency of the settings confused me a little. Even if
simple, I would have put it in a separate patch for better understanding.

The rest looks good.


Thanks,

C.



> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   migration->device_state, errp);
> +    if (ret) {
> +        return ret;
> +    }
>   
> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> -                                    vbasedev->migration->device_state, errp);
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        assert(!migration->multifd);
> +        migration->multifd = vfio_multifd_new();
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_multifd_cleanup(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);
>   }
>   
>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
>   
> +    vfio_multifd_cleanup(vbasedev);
> +
>       vfio_migration_cleanup(vbasedev);
>       trace_vfio_load_cleanup(vbasedev->name);
>   
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 64ee3b1a2547..ab110198bd6b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -78,6 +78,8 @@ typedef struct VFIORegion {
>       uint8_t nr; /* cache the region number for debug */
>   } VFIORegion;
>   
> +typedef struct VFIOMultifd VFIOMultifd;
> +
>   typedef struct VFIOMigration {
>       struct VFIODevice *vbasedev;
>       VMChangeStateEntry *vm_state;
> @@ -89,6 +91,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    VFIOMultifd *multifd;
>       bool initial_data_sent;
>   
>       bool event_save_iterate_started;
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-19 20:34 ` [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
@ 2025-02-26 10:43   ` Cédric Le Goater
  2025-02-26 21:04     ` Maciej S. Szmigiero
  2025-03-02 13:12   ` Avihai Horon
  1 sibling, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 10:43 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> The multifd received data needs to be reassembled since device state
> packets sent via different multifd channels can arrive out-of-order.
> 
> Therefore, each VFIO device state packet carries a header indicating its
> position in the stream.
> The raw device state data is saved into a VFIOStateBuffer for later
> in-order loading into the device.
> 
> The last such VFIO device state packet should have
> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c | 103 ++++++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h |   3 ++
>   hw/vfio/migration.c         |   1 +
>   hw/vfio/trace-events        |   1 +
>   4 files changed, 108 insertions(+)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index c2defc0efef0..5d5ee1393674 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -42,6 +42,11 @@ typedef struct VFIOStateBuffer {
>   } VFIOStateBuffer;
>   
>   typedef struct VFIOMultifd {
> +    VFIOStateBuffers load_bufs;
> +    QemuCond load_bufs_buffer_ready_cond;
> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
> +    uint32_t load_buf_idx;
> +    uint32_t load_buf_idx_last;
>   } VFIOMultifd;
>   
>   static void vfio_state_buffer_clear(gpointer data)
> @@ -87,15 +92,113 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>   }
>   

this routine expects load_bufs_mutex to be locked ? May be say so.

> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,

could you pass VFIOMultifd* instead  ?

> +                                          VFIODeviceStatePacket *packet,
> +                                          size_t packet_total_size,
> +                                          Error **errp)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIOStateBuffer *lb;
> +
> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
> +    }
> +
> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
> +    if (lb->is_present) {
> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
> +                   packet->idx);
> +        return false;
> +    }
> +
> +    assert(packet->idx >= multifd->load_buf_idx);
> +
> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
> +    lb->len = packet_total_size - sizeof(*packet);
> +    lb->is_present = true;
> +
> +    return true;
> +}
> +
> +bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                            Error **errp)
  

AFAICS, the only users of the .load_state_buffer() handlers is
multifd_device_state_recv().

Please rename to vfio_multifd_load_state_buffer().

> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
> +
> +    /*
> +     * Holding BQL here would violate the lock order and can cause
> +     * a deadlock once we attempt to lock load_bufs_mutex below.
> +     */
> +    assert(!bql_locked());
> +
> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
> +        error_setg(errp,
> +                   "got device state packet but not doing multifd transfer");
> +        return false;
> +    }
> +
> +    assert(multifd);
> +
> +    if (data_size < sizeof(*packet)) {
> +        error_setg(errp, "packet too short at %zu (min is %zu)",
> +                   data_size, sizeof(*packet));
> +        return false;
> +    }
> +
> +    if (packet->version != VFIO_DEVICE_STATE_PACKET_VER_CURRENT) {
> +        error_setg(errp, "packet has unknown version %" PRIu32,
> +                   packet->version);
> +        return false;
> +    }
> +
> +    if (packet->idx == UINT32_MAX) {
> +        error_setg(errp, "packet has too high idx");

or "packet index is invalid" ?

> +        return false;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
> +
> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);

Using WITH_QEMU_LOCK_GUARD() would be cleaner I think.

> +
> +    /* config state packet should be the last one in the stream */
> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
> +        multifd->load_buf_idx_last = packet->idx;
> +    }
> +
> +    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {
> +        return false;
> +    }
> +
> +    qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
> +
> +    return true;
> +}
> +
>   VFIOMultifd *vfio_multifd_new(void)
>   {
>       VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>   
> +    vfio_state_buffers_init(&multifd->load_bufs);
> +
> +    qemu_mutex_init(&multifd->load_bufs_mutex);
> +
> +    multifd->load_buf_idx = 0;
> +    multifd->load_buf_idx_last = UINT32_MAX;
> +    qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
> +
>       return multifd;
>   }
>   
>   void vfio_multifd_free(VFIOMultifd *multifd)
>   {
> +    qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
> +    qemu_mutex_destroy(&multifd->load_bufs_mutex);
> +
>       g_free(multifd);
>   }
>   
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 1eefba3b2eed..d5ab7d6f85f5 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -22,4 +22,7 @@ bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
>   
>   bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   
> +bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                            Error **errp);
> +
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4311de763885..abaf4d08d4a9 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -806,6 +806,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,>   };
>   
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1bebe9877d88..042a3dc54a33 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -153,6 +153,7 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
>   vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
> +vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
> 


Thanks,

C.





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-02-19 20:34 ` [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
@ 2025-02-26 13:49   ` Cédric Le Goater
  2025-02-26 21:05     ` Maciej S. Szmigiero
  2025-03-02 14:19     ` Avihai Horon
  2025-03-02 14:15   ` Avihai Horon
  1 sibling, 2 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 13:49 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Since it's important to finish loading device state transferred via the
> main migration channel (via save_live_iterate SaveVMHandler) before
> starting loading the data asynchronously transferred via multifd the thread
> doing the actual loading of the multifd transferred data is only started
> from switchover_start SaveVMHandler.
> 
> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
> 
> This sub-command is only sent after all save_live_iterate data have already
> been posted so it is safe to commence loading of the multifd-transferred
> device state upon receiving it - loading of save_live_iterate data happens
> synchronously in the main migration thread (much like the processing of
> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
> processed all the proceeding data must have already been loaded.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h |   2 +
>   hw/vfio/migration.c         |  12 ++
>   hw/vfio/trace-events        |   5 +
>   4 files changed, 244 insertions(+)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 5d5ee1393674..b3a88c062769 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>   } VFIOStateBuffer;
>   
>   typedef struct VFIOMultifd {
> +    QemuThread load_bufs_thread;
> +    bool load_bufs_thread_running;
> +    bool load_bufs_thread_want_exit;
> +
>       VFIOStateBuffers load_bufs;
>       QemuCond load_bufs_buffer_ready_cond;
> +    QemuCond load_bufs_thread_finished_cond;
>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>       uint32_t load_buf_idx;
>       uint32_t load_buf_idx_last;
> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>       return true;
>   }
>   
> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> +{
> +    return -EINVAL;
> +}


please move to next patch.

> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
> +{
> +    VFIOStateBuffer *lb;
> +    guint bufs_len;

guint:  I guess it's ok to use here. It is not common practice in VFIO.

> +
> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
> +    if (multifd->load_buf_idx >= bufs_len) {
> +        assert(multifd->load_buf_idx == bufs_len);
> +        return NULL;
> +    }
> +
> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
> +                               multifd->load_buf_idx);

Could be one line. minor.

> +    if (!lb->is_present) {
> +        return NULL;
> +    }
> +
> +    return lb;
> +}
> +
> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
> +                                         VFIOStateBuffer *lb,
> +                                         Error **errp)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    g_autofree char *buf = NULL;
> +    char *buf_cur;
> +    size_t buf_len;
> +
> +    if (!lb->len) {
> +        return true;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
> +                                                   multifd->load_buf_idx);

I thin we can move this trace event to vfio_load_bufs_thread()

> +    /* lb might become re-allocated when we drop the lock */
> +    buf = g_steal_pointer(&lb->data);
> +    buf_cur = buf;
> +    buf_len = lb->len;
> +    while (buf_len > 0) {
> +        ssize_t wr_ret;
> +        int errno_save;
> +
> +        /*
> +         * Loading data to the device takes a while,
> +         * drop the lock during this process.
> +         */
> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);> +        errno_save = errno;
> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
> +
> +        if (wr_ret < 0) {
> +            error_setg(errp,
> +                       "writing state buffer %" PRIu32 " failed: %d",
> +                       multifd->load_buf_idx, errno_save);
> +            return false;
> +        }
> +
> +        assert(wr_ret <= buf_len);
> +        buf_len -= wr_ret;
> +        buf_cur += wr_ret;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
> +                                                 multifd->load_buf_idx);

and drop this trace event.

In which case, we can modify the parameters of vfio_load_state_buffer_write()
to use directly a 'VFIOMultifd *multifd'and an fd instead of "migration->data_fd".

> +
> +    return true;
> +}
> +
> +static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
> +                                            bool *should_quit)
> +{
> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
> +}
> +
> +/*
> + * This thread is spawned by vfio_multifd_switchover_start() which gets
> + * called upon encountering the switchover point marker in main migration
> + * stream.
> + *
> + * It exits after either:
> + * * completing loading the remaining device state and device config, OR:
> + * * encountering some error while doing the above, OR:
> + * * being forcefully aborted by the migration core by it setting should_quit
> + *   or by vfio_load_cleanup_load_bufs_thread() setting
> + *   multifd->load_bufs_thread_want_exit.
> + */
> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    bool ret = true;
> +    int config_ret;

No needed IMO. see below.

> +
> +    assert(multifd);
> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
> +
> +    assert(multifd->load_bufs_thread_running);

We could add a trace event for the start and the end of the thread.

> +    while (true) {
> +        VFIOStateBuffer *lb;
> +
> +        /*
> +         * Always check cancellation first after the buffer_ready wait below in
> +         * case that cond was signalled by vfio_load_cleanup_load_bufs_thread().
> +         */
> +        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
> +            error_setg(errp, "operation cancelled");
> +            ret = false;
> +            goto ret_signal;

goto thread_exit ?

> +        }
> +
> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
> +
> +        lb = vfio_load_state_buffer_get(multifd);
> +        if (!lb) {
> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
> +                                                        multifd->load_buf_idx);
> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
> +                           &multifd->load_bufs_mutex);
> +            continue;
> +        }
> +
> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
> +            break;
> +        }
> +
> +        if (multifd->load_buf_idx == 0) {
> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
> +        }
> +
> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
> +            ret = false;
> +            goto ret_signal;
> +        }
> +
> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
> +        }
> +
> +        multifd->load_buf_idx++;
> +    }

if ret is assigned to true here, the "ret = false" can dropped

> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
> +    if (config_ret) {
> +        error_setg(errp, "load config state failed: %d", config_ret);
> +        ret = false;
> +    }

please move to next patch. This is adding nothing to this patch
since it's returning -EINVAL.


Thanks,

C.



> +ret_signal:
> +    /*
> +     * Notify possibly waiting vfio_load_cleanup_load_bufs_thread() that
> +     * this thread is exiting.
> +     */
> +    multifd->load_bufs_thread_running = false;
> +    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
> +
> +    return ret;
> +}
> +
>   VFIOMultifd *vfio_multifd_new(void)
>   {
>       VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
> @@ -191,11 +365,42 @@ VFIOMultifd *vfio_multifd_new(void)
>       multifd->load_buf_idx_last = UINT32_MAX;
>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>   
> +    multifd->load_bufs_thread_running = false;
> +    multifd->load_bufs_thread_want_exit = false;
> +    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
> +
>       return multifd;
>   }
>   
> +/*
> + * Terminates vfio_load_bufs_thread by setting
> + * multifd->load_bufs_thread_want_exit and signalling all the conditions
> + * the thread could be blocked on.
> + *
> + * Waits for the thread to signal that it had finished.
> + */
> +static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
> +{
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> +        while (multifd->load_bufs_thread_running) {
> +            multifd->load_bufs_thread_want_exit = true;
> +
> +            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
> +            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
> +                           &multifd->load_bufs_mutex);
> +        }
> +    }
> +    bql_lock();
> +}
> +
>   void vfio_multifd_free(VFIOMultifd *multifd)
>   {
> +    vfio_load_cleanup_load_bufs_thread(multifd);
> +
> +    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
> +    vfio_state_buffers_destroy(&multifd->load_bufs);
>       qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
>       qemu_mutex_destroy(&multifd->load_bufs_mutex);
>   
> @@ -225,3 +430,23 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>   
>       return true;
>   }
> +
> +int vfio_multifd_switchover_start(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +
> +    assert(multifd);
> +
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> +        assert(!multifd->load_bufs_thread_running);
> +        multifd->load_bufs_thread_running = true;
> +    }
> +    bql_lock();
> +
> +    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
> +
> +    return 0;
> +}
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index d5ab7d6f85f5..09cbb437d9d1 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -25,4 +25,6 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>                               Error **errp);
>   
> +int vfio_multifd_switchover_start(VFIODevice *vbasedev);
> +
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index abaf4d08d4a9..85f54cb22df2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -793,6 +793,17 @@ static bool vfio_switchover_ack_needed(void *opaque)
>       return vfio_precopy_supported(vbasedev);
>   }
>   
> +static int vfio_switchover_start(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        return vfio_multifd_switchover_start(vbasedev);
> +    }
> +
> +    return 0;
> +}
> +
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
> @@ -808,6 +819,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_state = vfio_load_state,
>       .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
> +    .switchover_start = vfio_switchover_start,
>   };
>   
>   /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 042a3dc54a33..418b378ebd29 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -154,6 +154,11 @@ vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
>   vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_start(const char *name) " (%s)"
> +vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_end(const char *name) " (%s)"
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support
  2025-02-19 20:34 ` [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
@ 2025-02-26 13:52   ` Cédric Le Goater
  2025-02-26 21:05     ` Maciej S. Szmigiero
  2025-03-02 14:25   ` Avihai Horon
  1 sibling, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 13:52 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Load device config received via multifd using the existing machinery
> behind vfio_load_device_config_state().
> 
> Also, make sure to process the relevant main migration channel flags.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 47 ++++++++++++++++++++++++++++++++++-
>   hw/vfio/migration.c           |  8 +++++-
>   include/hw/vfio/vfio-common.h |  2 ++
>   3 files changed, 55 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index b3a88c062769..7200f6f1c2a2 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -15,6 +15,7 @@
>   #include "qemu/lockable.h"
>   #include "qemu/main-loop.h"
>   #include "qemu/thread.h"
> +#include "io/channel-buffer.h"
>   #include "migration/qemu-file.h"
>   #include "migration-multifd.h"
>   #include "trace.h"
> @@ -186,7 +187,51 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>   
>   static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)

please modify to return a bool and take a "Error **errp" parameter.


Thanks,

C.


>   {
> -    return -EINVAL;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIOStateBuffer *lb;
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    QEMUFile *f_out = NULL, *f_in = NULL;
> +    uint64_t mig_header;
> +    int ret;
> +
> +    assert(multifd->load_buf_idx == multifd->load_buf_idx_last);
> +    lb = vfio_state_buffers_at(&multifd->load_bufs, multifd->load_buf_idx);
> +    assert(lb->is_present);
> +
> +    bioc = qio_channel_buffer_new(lb->len);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
> +
> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
> +
> +    ret = qemu_fflush(f_out);
> +    if (ret) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        return ret;
> +    }
> +
> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
> +
> +    mig_header = qemu_get_be64(f_in);
> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        g_clear_pointer(&f_in, qemu_fclose);
> +        return -EINVAL;
> +    }
> +
> +    bql_lock();
> +    ret = vfio_load_device_config_state(f_in, vbasedev);
> +    bql_unlock();
> +
> +    g_clear_pointer(&f_out, qemu_fclose);
> +    g_clear_pointer(&f_in, qemu_fclose);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return 0;
>   }
>   
>   static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 85f54cb22df2..b962309f7c27 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -264,7 +264,7 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>       return ret;
>   }
>   
> -static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
>       uint64_t data;
> @@ -728,6 +728,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            if (vfio_multifd_transfer_enabled(vbasedev)) {
> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
> +                             vbasedev->name);
> +                return -EINVAL;
> +            }
> +
>               return vfio_load_device_config_state(f, opaque);
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ab110198bd6b..ce2bdea8a2c2 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -298,6 +298,8 @@ void vfio_add_bytes_transferred(unsigned long val);
>   bool vfio_device_state_is_running(VFIODevice *vbasedev);
>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>   
> +int vfio_load_device_config_state(QEMUFile *f, void *opaque);
> +
>   #ifdef CONFIG_LINUX
>   int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                            struct vfio_region_info **info);
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic
  2025-02-26  7:52   ` Cédric Le Goater
@ 2025-02-26 13:55     ` Maciej S. Szmigiero
  2025-02-26 15:56       ` Cédric Le Goater
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 13:55 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 08:52, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> So it can be safety accessed from multiple threads.
>>
>> This variable type needs to be changed to unsigned long since
>> 32-bit host platforms lack the necessary addition atomics on 64-bit
>> variables.
>>
>> Using 32-bit counters on 32-bit host platforms should not be a problem
>> in practice since they can't realistically address more memory anyway.
> 
> Is it useful to have VFIO on 32-bit host platforms ?
> 
> If not, VFIO PCI should depend on (AARCH64 || PPC64 || X86_64) and we
> could drop this patch. Let's address that independently.

Not sure how much use VFIO gets on 32-bit host platforms,
however totally disabling it on these would be a major functional regression -
at least if taken at its face value.

Especially considering that making it work on 32-bit platform requires
just this tiny variable type change here.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 20/36] vfio/migration: Add vfio_add_bytes_transferred()
  2025-02-26  8:06   ` Cédric Le Goater
@ 2025-02-26 15:45     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 15:45 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Peter Xu, Fabiano Rosas

On 26.02.2025 09:06, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This way bytes_transferred can also be incremented in other translation
>> units than migration.c.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Looks good. Just a small aesthetic issue.
> 
>> ---
>>   hw/vfio/migration.c           | 7 ++++++-
>>   include/hw/vfio/vfio-common.h | 1 +
>>   2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 5532787be63b..e9645cb9d088 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -391,7 +391,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>>       qemu_put_be64(f, data_size);
>>       qemu_put_buffer(f, migration->data_buffer, data_size);
>> -    qatomic_add(&bytes_transferred, data_size);
>> +    vfio_add_bytes_transferred(data_size);
>>       trace_vfio_save_block(migration->vbasedev->name, data_size);
>> @@ -1021,6 +1021,11 @@ void vfio_reset_bytes_transferred(void)
>>       qatomic_set(&bytes_transferred, 0);
>>   }
>> +void vfio_add_bytes_transferred(unsigned long val)
> 
> vfio_migration_add_bytes_transferred()
> 

Renamed into vfio_mig_add_bytes_transferred() for consistency with
vfio_mig_bytes_transferred().
  
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic
  2025-02-26 13:55     ` Maciej S. Szmigiero
@ 2025-02-26 15:56       ` Cédric Le Goater
  0 siblings, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 15:56 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/26/25 14:55, Maciej S. Szmigiero wrote:
> On 26.02.2025 08:52, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> So it can be safety accessed from multiple threads.
>>>
>>> This variable type needs to be changed to unsigned long since
>>> 32-bit host platforms lack the necessary addition atomics on 64-bit
>>> variables.
>>>
>>> Using 32-bit counters on 32-bit host platforms should not be a problem
>>> in practice since they can't realistically address more memory anyway.
>>
>> Is it useful to have VFIO on 32-bit host platforms ?
>>
>> If not, VFIO PCI should depend on (AARCH64 || PPC64 || X86_64) and we
>> could drop this patch. Let's address that independently.
> 
> Not sure how much use VFIO gets on 32-bit host platforms,
> however totally disabling it on these would be a major functional regression -
> at least if taken at its face value.

32-bit host platform support is being deprecated in QEMU 10.0 and should
be removed in QEMU 10.2.

> Especially considering that making it work on 32-bit platform requires
> just this tiny variable type change here.

yes. It raised my attention because x86 32-bit was the only host platform
I was not sure about and Alex confirmed it worked. We should simply wait
for removal.


Thanks,

C.




^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 22/36] vfio/migration: Multifd device state transfer support - basic types
  2025-02-26  8:52   ` Cédric Le Goater
@ 2025-02-26 16:06     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 16:06 UTC (permalink / raw)
  To: Cédric Le Goater, Peter Xu
  Cc: Alex Williamson, Eric Blake, Fabiano Rosas, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 26.02.2025 09:52, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add basic types and flags used by VFIO multifd device state transfer
>> support.
>>
>> Since we'll be introducing a lot of multifd transfer specific code,
>> add a new file migration-multifd.c to home it, wired into main VFIO
>> migration code (migration.c) via migration-multifd.h header file.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/meson.build         |  1 +
>>   hw/vfio/migration-multifd.c | 31 +++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h | 15 +++++++++++++++
>>   hw/vfio/migration.c         |  1 +
>>   4 files changed, 48 insertions(+)
>>   create mode 100644 hw/vfio/migration-multifd.c
>>   create mode 100644 hw/vfio/migration-multifd.h
>>
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index bba776f75cc7..260d65febd6b 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>     'container-base.c',
>>     'container.c',
>>     'migration.c',
>> +  'migration-multifd.c',
>>     'cpr.c',
>>   ))
>>   vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> new file mode 100644
>> index 000000000000..0c3185a26242
>> --- /dev/null
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -0,0 +1,31 @@
>> +/*
> 
> Please add :
> 
>    SPDX-License-Identifier: GPL-2.0-or-later
> 
> in new files.

Done, also to migration/multifd-device-state.c
outside VFIO.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic
  2025-02-19 20:34 ` [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
  2025-02-26  7:52   ` Cédric Le Goater
@ 2025-02-26 16:20   ` Cédric Le Goater
  1 sibling, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 16:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> So it can be safety accessed from multiple threads.
> 
> This variable type needs to be changed to unsigned long since
> 32-bit host platforms lack the necessary addition atomics on 64-bit
> variables.
> 
> Using 32-bit counters on 32-bit host platforms should not be a problem
> in practice since they can't realistically address more memory anyway.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/migration.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 03890eaa48a9..5532787be63b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -55,7 +55,7 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>   
> -static int64_t bytes_transferred;
> +static unsigned long bytes_transferred;
>   
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>   {
> @@ -391,7 +391,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>       qemu_put_be64(f, data_size);
>       qemu_put_buffer(f, migration->data_buffer, data_size);
> -    bytes_transferred += data_size;
> +    qatomic_add(&bytes_transferred, data_size);
>   
>       trace_vfio_save_block(migration->vbasedev->name, data_size);
>   
> @@ -1013,12 +1013,12 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
>   
>   int64_t vfio_mig_bytes_transferred(void)
>   {
> -    return bytes_transferred;
> +    return MIN(qatomic_read(&bytes_transferred), INT64_MAX);
>   }
>   
>   void vfio_reset_bytes_transferred(void)
>   {
> -    bytes_transferred = 0;
> +    qatomic_set(&bytes_transferred, 0);
>   }
>   
>   /*
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side
  2025-02-19 20:34 ` [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2025-02-26 16:43   ` Cédric Le Goater
  2025-02-26 21:05     ` Maciej S. Szmigiero
  2025-03-02 14:41   ` Avihai Horon
  1 sibling, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 16:43 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Implement the multifd device state transfer via additional per-device
> thread inside save_live_complete_precopy_thread handler.
> 
> Switch between doing the data transfer in the new handler and doing it
> in the old save_state handler depending on the
> x-migration-multifd-transfer device property value.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 139 ++++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h   |   5 ++
>   hw/vfio/migration.c           |  26 +++++--
>   hw/vfio/trace-events          |   2 +
>   include/hw/vfio/vfio-common.h |   8 ++
>   5 files changed, 174 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 7200f6f1c2a2..0cfa9d31732a 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -476,6 +476,145 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>       return true;
>   }
>   
> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    assert(vfio_multifd_transfer_enabled(vbasedev));
> +
> +    /*
> +     * Emit dummy NOP data on the main migration channel since the actual
> +     * device state transfer is done via multifd channels.
> +     */
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +}
> +
> +static bool
> +vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev,
> +                                               char *idstr,
> +                                               uint32_t instance_id,
> +                                               uint32_t idx,
> +                                               Error **errp)
> +{
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    g_autoptr(QEMUFile) f = NULL;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    size_t packet_len;
> +
> +    bioc = qio_channel_buffer_new(0);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
> +
> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
> +
> +    if (vfio_save_device_config_state(f, vbasedev, errp)) {
> +        return false;
> +    }
> +
> +    ret = qemu_fflush(f);
> +    if (ret) {
> +        error_setg(errp, "save config state flush failed: %d", ret);
> +        return false;
> +    }
> +
> +    packet_len = sizeof(*packet) + bioc->usage;
> +    packet = g_malloc0(packet_len);
> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
> +    packet->idx = idx;
> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
> +    memcpy(&packet->data, bioc->data, bioc->usage);
> +
> +    if (!multifd_queue_device_state(idstr, instance_id,
> +                                    (char *)packet, packet_len)) {
> +        error_setg(errp, "multifd config data queuing failed");
> +        return false;
> +    }
> +
> +    vfio_add_bytes_transferred(packet_len);
> +
> +    return true;
> +}
> +
> +/*
> + * This thread is spawned by the migration core directly via
> + * .save_live_complete_precopy_thread SaveVMHandler.
> + *
> + * It exits after either:
> + * * completing saving the remaining device state and device config, OR:
> + * * encountering some error while doing the above, OR:
> + * * being forcefully aborted by the migration core by
> + *   multifd_device_state_save_thread_should_exit() returning true.
> + */
> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
> +                                       Error **errp)

In qemu_savevm_state_complete_precopy_iterable(), this handler is
called :

     ....
     if (multifd_device_state) {
         QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
             SaveLiveCompletePrecopyThreadHandler hdlr;

             if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
                              se->ops->has_postcopy(se->opaque)) ||
                 !se->ops->save_live_complete_precopy_thread) {
                 continue;
             }

             hdlr = se->ops->save_live_complete_precopy_thread;
             multifd_spawn_device_state_save_thread(hdlr,
                                                    se->idstr, se->instance_id,
                                                    se->opaque);
         }
     }


I suggest naming it : vfio_multifd_save_complete_precopy_thread()

> +{
> +    VFIODevice *vbasedev = d->handler_opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    bool ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    uint32_t idx;
> +
> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
> +        return true;
> +    }
> +
> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
> +                                                  d->idstr, d->instance_id);
> +
> +    /* We reach here with device state STOP or STOP_COPY only */
> +    if (vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +                                 VFIO_DEVICE_STATE_STOP, errp)) {
> +        ret = false;

These "ret = false" can be avoided if the variable is set at the
top of the function.

> +        goto ret_finish;


goto thread_exit ?
> +    }
> +
> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
> +
> +    for (idx = 0; ; idx++) {
> +        ssize_t data_size;
> +        size_t packet_size;
> +
> +        if (multifd_device_state_save_thread_should_exit()) {
> +            error_setg(errp, "operation cancelled");
> +            ret = false;
> +            goto ret_finish;
> +        }> +
> +        data_size = read(migration->data_fd, &packet->data,
> +                         migration->data_buffer_size);
> +        if (data_size < 0) {
> +            error_setg(errp, "reading state buffer %" PRIu32 " failed: %d",
> +                       idx, errno);
> +            ret = false;
> +            goto ret_finish;
> +        } else if (data_size == 0) {
> +            break;
> +        }
> +
> +        packet->idx = idx;
> +        packet_size = sizeof(*packet) + data_size;
> +
> +        if (!multifd_queue_device_state(d->idstr, d->instance_id,
> +                                        (char *)packet, packet_size)) {
> +            error_setg(errp, "multifd data queuing failed");
> +            ret = false;
> +            goto ret_finish;
> +        }
> +
> +        vfio_add_bytes_transferred(packet_size);
> +    }
> +
> +    ret = vfio_save_complete_precopy_thread_config_state(vbasedev,
> +                                                         d->idstr,
> +                                                         d->instance_id,
> +                                                         idx, errp);
> +
> +ret_finish:
> +    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
> +
> +    return ret;
> +}
> +
>   int vfio_multifd_switchover_start(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 09cbb437d9d1..79780d7b5392 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -25,6 +25,11 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>                               Error **errp);
>   
> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f);
> +
> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
> +                                       Error **errp);
> +
>   int vfio_multifd_switchover_start(VFIODevice *vbasedev);
>   
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index b962309f7c27..69dcf2dac2fa 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -120,10 +120,10 @@ static void vfio_migration_set_device_state(VFIODevice *vbasedev,
>       vfio_migration_send_event(vbasedev);
>   }
>   
> -static int vfio_migration_set_state(VFIODevice *vbasedev,
> -                                    enum vfio_device_mig_state new_state,
> -                                    enum vfio_device_mig_state recover_state,
> -                                    Error **errp)
> +int vfio_migration_set_state(VFIODevice *vbasedev,
> +                             enum vfio_device_mig_state new_state,
> +                             enum vfio_device_mig_state recover_state,
> +                             Error **errp)
>   {
>       VFIOMigration *migration = vbasedev->migration;
>       uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> @@ -238,8 +238,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>       return ret;
>   }
>   
> -static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
> -                                         Error **errp)
> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
>       int ret;
> @@ -453,6 +452,10 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>       int ret;
>   
> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
> +        return -EINVAL;
> +    }
> +

please move to another patch with the similar change of patch 25.


>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>   
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> @@ -631,6 +634,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       int ret;
>       Error *local_err = NULL;
>   
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        vfio_multifd_emit_dummy_eos(vbasedev, f);
> +        return 0;
> +    }
> +
>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>   
>       /* We reach here with device state STOP or STOP_COPY only */
> @@ -662,6 +670,11 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>       Error *local_err = NULL;
>       int ret;
>   
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        vfio_multifd_emit_dummy_eos(vbasedev, f);
> +        return;
> +    }
> +
>       ret = vfio_save_device_config_state(f, opaque, &local_err);
>       if (ret) {
>           error_prepend(&local_err,
> @@ -819,6 +832,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
>       .save_state = vfio_save_state,
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 418b378ebd29..039979bdd98f 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -168,6 +168,8 @@ vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>   vfio_save_complete_precopy_start(const char *name) " (%s)"
> +vfio_save_complete_precopy_thread_start(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
> +vfio_save_complete_precopy_thread_end(const char *name, int ret) " (%s) ret %d"
>   vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_save_iterate_start(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ce2bdea8a2c2..ba851917f9fc 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -298,6 +298,14 @@ void vfio_add_bytes_transferred(unsigned long val);
>   bool vfio_device_state_is_running(VFIODevice *vbasedev);
>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>   
> +#ifdef CONFIG_LINUX
> +int vfio_migration_set_state(VFIODevice *vbasedev,
> +                             enum vfio_device_mig_state new_state,
> +                             enum vfio_device_mig_state recover_state,
> +                             Error **errp);

please move below with the other declarations under #ifdef CONFIG_LINUX.

> +#endif
> +
> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp);
>   int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>   
>   #ifdef CONFIG_LINUX
> 



Thanks,

C.





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler
  2025-02-19 20:33 ` [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
@ 2025-02-26 16:43   ` Peter Xu
  2025-03-04 21:50     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Xu @ 2025-02-26 16:43 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Feb 19, 2025 at 09:33:59PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This SaveVMHandler helps device provide its own asynchronous transmission
> of the remaining data at the end of a precopy phase via multifd channels,
> in parallel with the transfer done by save_live_complete_precopy handlers.
> 
> These threads are launched only when multifd device state transfer is
> supported.
> 
> Management of these threads in done in the multifd migration code,
> wrapping them in the generic thread pool.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/migration/misc.h         | 17 +++++++
>  include/migration/register.h     | 19 +++++++
>  include/qemu/typedefs.h          |  3 ++
>  migration/multifd-device-state.c | 85 ++++++++++++++++++++++++++++++++
>  migration/savevm.c               | 35 ++++++++++++-
>  5 files changed, 158 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 273ebfca6256..8fd36eba1da7 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -119,8 +119,25 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
>                         Error **errp);
>  
>  /* migration/multifd-device-state.c */
> +typedef struct SaveLiveCompletePrecopyThreadData {
> +    SaveLiveCompletePrecopyThreadHandler hdlr;
> +    char *idstr;
> +    uint32_t instance_id;
> +    void *handler_opaque;
> +} SaveLiveCompletePrecopyThreadData;
> +
>  bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>                                  char *data, size_t len);
>  bool multifd_device_state_supported(void);
>  
> +void
> +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
> +                                       char *idstr, uint32_t instance_id,
> +                                       void *opaque);
> +
> +bool multifd_device_state_save_thread_should_exit(void);
> +
> +void multifd_abort_device_state_save_threads(void);
> +bool multifd_join_device_state_save_threads(void);
> +
>  #endif
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 58891aa54b76..c041ce32f2fc 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -105,6 +105,25 @@ typedef struct SaveVMHandlers {
>       */
>      int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>  
> +    /**
> +     * @save_live_complete_precopy_thread (invoked in a separate thread)
> +     *
> +     * Called at the end of a precopy phase from a separate worker thread
> +     * in configurations where multifd device state transfer is supported
> +     * in order to perform asynchronous transmission of the remaining data in
> +     * parallel with @save_live_complete_precopy handlers.
> +     * When postcopy is enabled, devices that support postcopy will skip this
> +     * step.
> +     *
> +     * @d: a #SaveLiveCompletePrecopyThreadData containing parameters that the
> +     * handler may need, including this device section idstr and instance_id,
> +     * and opaque data pointer passed to register_savevm_live().
> +     * @errp: pointer to Error*, to store an error if it happens.
> +     *
> +     * Returns true to indicate success and false for errors.
> +     */
> +    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
> +
>      /* This runs both outside and inside the BQL.  */
>  
>      /**
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index fd23ff7771b1..42ed4e6be150 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -108,6 +108,7 @@ typedef struct QString QString;
>  typedef struct RAMBlock RAMBlock;
>  typedef struct Range Range;
>  typedef struct ReservedRegion ReservedRegion;
> +typedef struct SaveLiveCompletePrecopyThreadData SaveLiveCompletePrecopyThreadData;
>  typedef struct SHPCDevice SHPCDevice;
>  typedef struct SSIBus SSIBus;
>  typedef struct TCGCPUOps TCGCPUOps;
> @@ -133,5 +134,7 @@ typedef struct IRQState *qemu_irq;
>  typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>  typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
>                                      Error **errp);
> +typedef bool (*SaveLiveCompletePrecopyThreadHandler)(SaveLiveCompletePrecopyThreadData *d,
> +                                                     Error **errp);
>  
>  #endif /* QEMU_TYPEDEFS_H */
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index 5de3cf27d6e8..63f021fb8dad 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -8,7 +8,10 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qapi/error.h"
>  #include "qemu/lockable.h"
> +#include "block/thread-pool.h"
> +#include "migration.h"
>  #include "migration/misc.h"
>  #include "multifd.h"
>  #include "options.h"
> @@ -17,6 +20,9 @@ static struct {
>      QemuMutex queue_job_mutex;
>  
>      MultiFDSendData *send_data;
> +
> +    ThreadPool *threads;
> +    bool threads_abort;
>  } *multifd_send_device_state;
>  
>  void multifd_device_state_send_setup(void)
> @@ -27,10 +33,14 @@ void multifd_device_state_send_setup(void)
>      qemu_mutex_init(&multifd_send_device_state->queue_job_mutex);
>  
>      multifd_send_device_state->send_data = multifd_send_data_alloc();
> +
> +    multifd_send_device_state->threads = thread_pool_new();
> +    multifd_send_device_state->threads_abort = false;
>  }
>  
>  void multifd_device_state_send_cleanup(void)
>  {
> +    g_clear_pointer(&multifd_send_device_state->threads, thread_pool_free);
>      g_clear_pointer(&multifd_send_device_state->send_data,
>                      multifd_send_data_free);
>  
> @@ -115,3 +125,78 @@ bool multifd_device_state_supported(void)
>      return migrate_multifd() && !migrate_mapped_ram() &&
>          migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
>  }
> +
> +static void multifd_device_state_save_thread_data_free(void *opaque)
> +{
> +    SaveLiveCompletePrecopyThreadData *data = opaque;
> +
> +    g_clear_pointer(&data->idstr, g_free);
> +    g_free(data);
> +}
> +
> +static int multifd_device_state_save_thread(void *opaque)
> +{
> +    SaveLiveCompletePrecopyThreadData *data = opaque;
> +    g_autoptr(Error) local_err = NULL;
> +
> +    if (!data->hdlr(data, &local_err)) {
> +        MigrationState *s = migrate_get_current();
> +
> +        assert(local_err);
> +
> +        /*
> +         * In case of multiple save threads failing which thread error
> +         * return we end setting is purely arbitrary.
> +         */
> +        migrate_set_error(s, local_err);

Where did you kick off all the threads when one hit error?  I wonder if
migrate_set_error() should just set quit flag for everything, but for this
series it might be easier to use multifd_abort_device_state_save_threads().

Other than that, looks good to me, thanks.

> +    }
> +
> +    return 0;
> +}
> +
> +bool multifd_device_state_save_thread_should_exit(void)
> +{
> +    return qatomic_read(&multifd_send_device_state->threads_abort);
> +}
> +
> +void
> +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
> +                                       char *idstr, uint32_t instance_id,
> +                                       void *opaque)
> +{
> +    SaveLiveCompletePrecopyThreadData *data;
> +
> +    assert(multifd_device_state_supported());
> +    assert(multifd_send_device_state);
> +
> +    assert(!qatomic_read(&multifd_send_device_state->threads_abort));
> +
> +    data = g_new(SaveLiveCompletePrecopyThreadData, 1);
> +    data->hdlr = hdlr;
> +    data->idstr = g_strdup(idstr);
> +    data->instance_id = instance_id;
> +    data->handler_opaque = opaque;
> +
> +    thread_pool_submit_immediate(multifd_send_device_state->threads,
> +                                 multifd_device_state_save_thread,
> +                                 data,
> +                                 multifd_device_state_save_thread_data_free);
> +}
> +
> +void multifd_abort_device_state_save_threads(void)
> +{
> +    assert(multifd_device_state_supported());
> +
> +    qatomic_set(&multifd_send_device_state->threads_abort, true);
> +}
> +
> +bool multifd_join_device_state_save_threads(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    assert(multifd_device_state_supported());
> +
> +    thread_pool_wait(multifd_send_device_state->threads);
> +
> +    return !migrate_has_error(s);
> +}
> diff --git a/migration/savevm.c b/migration/savevm.c
> index e412d05657a1..9a1e0ac807a0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -37,6 +37,7 @@
>  #include "migration/register.h"
>  #include "migration/global_state.h"
>  #include "migration/channel-block.h"
> +#include "multifd.h"
>  #include "ram.h"
>  #include "qemu-file.h"
>  #include "savevm.h"
> @@ -1527,6 +1528,24 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>      int64_t start_ts_each, end_ts_each;
>      SaveStateEntry *se;
>      int ret;
> +    bool multifd_device_state = multifd_device_state_supported();
> +
> +    if (multifd_device_state) {
> +        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +            SaveLiveCompletePrecopyThreadHandler hdlr;
> +
> +            if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
> +                             se->ops->has_postcopy(se->opaque)) ||
> +                !se->ops->save_live_complete_precopy_thread) {
> +                continue;
> +            }
> +
> +            hdlr = se->ops->save_live_complete_precopy_thread;
> +            multifd_spawn_device_state_save_thread(hdlr,
> +                                                   se->idstr, se->instance_id,
> +                                                   se->opaque);
> +        }
> +    }
>  
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>          if (!se->ops ||
> @@ -1552,16 +1571,30 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>          save_section_footer(f, se);
>          if (ret < 0) {
>              qemu_file_set_error(f, ret);
> -            return -1;
> +            goto ret_fail_abort_threads;
>          }
>          end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
>          trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id,
>                                      end_ts_each - start_ts_each);
>      }
>  
> +    if (multifd_device_state &&
> +        !multifd_join_device_state_save_threads()) {
> +        qemu_file_set_error(f, -EINVAL);
> +        return -1;
> +    }
> +
>      trace_vmstate_downtime_checkpoint("src-iterable-saved");
>  
>      return 0;
> +
> +ret_fail_abort_threads:
> +    if (multifd_device_state) {
> +        multifd_abort_device_state_save_threads();
> +        multifd_join_device_state_save_threads();
> +    }
> +
> +    return -1;
>  }
>  
>  int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-26 10:14   ` Cédric Le Goater
@ 2025-02-26 17:22     ` Cédric Le Goater
  2025-02-26 17:28       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 17:22 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/26/25 11:14, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add support for VFIOMultifd data structure that will contain most of the
>> receive-side data together with its init/cleanup methods.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h   |  8 ++++++++
>>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
>>   include/hw/vfio/vfio-common.h |  3 +++
>>   4 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 7328ad8e925c..c2defc0efef0 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
>>       size_t len;
>>   } VFIOStateBuffer;
>> +typedef struct VFIOMultifd {
>> +} VFIOMultifd;
>> +
>>   static void vfio_state_buffer_clear(gpointer data)
>>   {
>>       VFIOStateBuffer *lb = data;
>> @@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>   }
>> +VFIOMultifd *vfio_multifd_new(void)
>> +{
>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>> +
>> +    return multifd;
>> +}
>> +
>> +void vfio_multifd_free(VFIOMultifd *multifd)
>> +{
>> +    g_free(multifd);
>> +}
>> +
>>   bool vfio_multifd_transfer_supported(void)
>>   {
>>       return multifd_device_state_supported() &&
>>           migrate_send_switchover_start();
>>   }
>> +
>> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>> +{
>> +    return false;
>> +}
>> +
>> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    if (vfio_multifd_transfer_enabled(vbasedev) &&
>> +        !vfio_multifd_transfer_supported()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return false;
>> +    }


now that I have reached patch 31 I understand better. I would put the
check above in patch 31 and simply return true for now.

Thanks,

C.


>> +    return true;
>> +}
>> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
>> index 8fe004c1da81..1eefba3b2eed 100644
>> --- a/hw/vfio/migration-multifd.h
>> +++ b/hw/vfio/migration-multifd.h
>> @@ -12,6 +12,14 @@
>>   #include "hw/vfio/vfio-common.h"
>> +typedef struct VFIOMultifd VFIOMultifd;
>> +
>> +VFIOMultifd *vfio_multifd_new(void);
>> +void vfio_multifd_free(VFIOMultifd *multifd);
>> +
>>   bool vfio_multifd_transfer_supported(void);
>> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
>> +
>> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>>   #endif
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 7b79be6ad293..4311de763885 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -674,15 +674,40 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
>> +        return -EINVAL;
>> +    }
> 
> This check on the consistency of the settings confused me a little. Even if
> simple, I would have put it in a separate patch for better understanding.
> The rest looks good.
> 
> 
> Thanks,
> 
> C.
> 
> 
> 
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> +                                   migration->device_state, errp);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> -                                    vbasedev->migration->device_state, errp);
>> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
>> +        assert(!migration->multifd);
>> +        migration->multifd = vfio_multifd_new();
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_multifd_cleanup(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);
>>   }
>>   static int vfio_load_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    vfio_multifd_cleanup(vbasedev);
>> +
>>       vfio_migration_cleanup(vbasedev);
>>       trace_vfio_load_cleanup(vbasedev->name);
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 64ee3b1a2547..ab110198bd6b 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -78,6 +78,8 @@ typedef struct VFIORegion {
>>       uint8_t nr; /* cache the region number for debug */
>>   } VFIORegion;
>> +typedef struct VFIOMultifd VFIOMultifd;
>> +
>>   typedef struct VFIOMigration {
>>       struct VFIODevice *vbasedev;
>>       VMChangeStateEntry *vm_state;
>> @@ -89,6 +91,7 @@ typedef struct VFIOMigration {
>>       uint64_t mig_flags;
>>       uint64_t precopy_init_size;
>>       uint64_t precopy_dirty_size;
>> +    VFIOMultifd *multifd;
>>       bool initial_data_sent;
>>       bool event_save_iterate_started;
>>
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-26 17:22     ` Cédric Le Goater
@ 2025-02-26 17:28       ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 17:28 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 18:22, Cédric Le Goater wrote:
> On 2/26/25 11:14, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Add support for VFIOMultifd data structure that will contain most of the
>>> receive-side data together with its init/cleanup methods.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
>>>   hw/vfio/migration-multifd.h   |  8 ++++++++
>>>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
>>>   include/hw/vfio/vfio-common.h |  3 +++
>>>   4 files changed, 71 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 7328ad8e925c..c2defc0efef0 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
>>>       size_t len;
>>>   } VFIOStateBuffer;
>>> +typedef struct VFIOMultifd {
>>> +} VFIOMultifd;
>>> +
>>>   static void vfio_state_buffer_clear(gpointer data)
>>>   {
>>>       VFIOStateBuffer *lb = data;
>>> @@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>>   }
>>> +VFIOMultifd *vfio_multifd_new(void)
>>> +{
>>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>>> +
>>> +    return multifd;
>>> +}
>>> +
>>> +void vfio_multifd_free(VFIOMultifd *multifd)
>>> +{
>>> +    g_free(multifd);
>>> +}
>>> +
>>>   bool vfio_multifd_transfer_supported(void)
>>>   {
>>>       return multifd_device_state_supported() &&
>>>           migrate_send_switchover_start();
>>>   }
>>> +
>>> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>>> +{
>>> +    return false;
>>> +}
>>> +
>>> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>>> +{
>>> +    if (vfio_multifd_transfer_enabled(vbasedev) &&
>>> +        !vfio_multifd_transfer_supported()) {
>>> +        error_setg(errp,
>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>> +                   vbasedev->name);
>>> +        return false;
>>> +    }
>> This check on the consistency of the settings confused me a little. Even if
>> simple, I would have put it in a separate patch for better understanding.
>> The rest looks good.
> 
> now that I have reached patch 31 I understand better. I would put the
> check above in patch 31 and simply return true for now.
> 

Moved it to that patch then ("Add x-migration-multifd-transfer VFIO property").

> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-19 20:34 ` [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
  2025-02-26 10:14   ` Cédric Le Goater
@ 2025-02-26 17:28   ` Cédric Le Goater
  2025-02-27 22:00     ` Maciej S. Szmigiero
  2025-02-26 17:46   ` Cédric Le Goater
  2 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 17:28 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add support for VFIOMultifd data structure that will contain most of the
> receive-side data together with its init/cleanup methods.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h   |  8 ++++++++
>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
>   include/hw/vfio/vfio-common.h |  3 +++
>   4 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 7328ad8e925c..c2defc0efef0 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
>       size_t len;
>   } VFIOStateBuffer;
>   
> +typedef struct VFIOMultifd {
> +} VFIOMultifd;
> +
>   static void vfio_state_buffer_clear(gpointer data)
>   {
>       VFIOStateBuffer *lb = data;
> @@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>   }
>   
> +VFIOMultifd *vfio_multifd_new(void)
> +{
> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
> +
> +    return multifd;
> +}
> +
> +void vfio_multifd_free(VFIOMultifd *multifd)
> +{
> +    g_free(multifd);
> +}
> +
>   bool vfio_multifd_transfer_supported(void)
>   {
>       return multifd_device_state_supported() &&
>           migrate_send_switchover_start();
>   }
> +
> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
> +{
> +    return false;
> +}
> +
> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
> +{
> +    if (vfio_multifd_transfer_enabled(vbasedev) &&
> +        !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return false;
> +    }
> +
> +    return true;
> +}
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 8fe004c1da81..1eefba3b2eed 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -12,6 +12,14 @@
>   
>   #include "hw/vfio/vfio-common.h"
>   
> +typedef struct VFIOMultifd VFIOMultifd;
> +
> +VFIOMultifd *vfio_multifd_new(void);
> +void vfio_multifd_free(VFIOMultifd *multifd);
> +
>   bool vfio_multifd_transfer_supported(void);
> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
> +
> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 7b79be6ad293..4311de763885 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -674,15 +674,40 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   migration->device_state, errp);
> +    if (ret) {
> +        return ret;
> +    }
>   
> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> -                                    vbasedev->migration->device_state, errp);
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        assert(!migration->multifd);
> +        migration->multifd = vfio_multifd_new();
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_multifd_cleanup(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);
>   }

Please move vfio_multifd_cleanup() to migration-multifd.c.

Thanks,

C.



>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
>   
> +    vfio_multifd_cleanup(vbasedev);
> +
>       vfio_migration_cleanup(vbasedev);
>       trace_vfio_load_cleanup(vbasedev->name);
>   
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 64ee3b1a2547..ab110198bd6b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -78,6 +78,8 @@ typedef struct VFIORegion {
>       uint8_t nr; /* cache the region number for debug */
>   } VFIORegion;
>   
> +typedef struct VFIOMultifd VFIOMultifd;
> +
>   typedef struct VFIOMigration {
>       struct VFIODevice *vbasedev;
>       VMChangeStateEntry *vm_state;
> @@ -89,6 +91,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    VFIOMultifd *multifd;
>       bool initial_data_sent;
>   
>       bool event_save_iterate_started;
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-19 20:34 ` [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
  2025-02-26 10:14   ` Cédric Le Goater
  2025-02-26 17:28   ` Cédric Le Goater
@ 2025-02-26 17:46   ` Cédric Le Goater
  2025-02-27 22:00     ` Maciej S. Szmigiero
  2 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 17:46 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add support for VFIOMultifd data structure that will contain most of the
> receive-side data together with its init/cleanup methods.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h   |  8 ++++++++
>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
>   include/hw/vfio/vfio-common.h |  3 +++
>   4 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 7328ad8e925c..c2defc0efef0 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
>       size_t len;
>   } VFIOStateBuffer;
>   
> +typedef struct VFIOMultifd {
> +} VFIOMultifd;
> +
>   static void vfio_state_buffer_clear(gpointer data)
>   {
>       VFIOStateBuffer *lb = data;
> @@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>   }
>   
> +VFIOMultifd *vfio_multifd_new(void)
> +{
> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
> +
> +    return multifd;
> +}
> +
> +void vfio_multifd_free(VFIOMultifd *multifd)
> +{
> +    g_free(multifd);
> +}
> +
>   bool vfio_multifd_transfer_supported(void)
>   {
>       return multifd_device_state_supported() &&
>           migrate_send_switchover_start();
>   }
> +
> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
> +{
> +    return false;
> +}
> +
> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
> +{
> +    if (vfio_multifd_transfer_enabled(vbasedev) &&
> +        !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return false;
> +    }
> +
> +    return true;
> +}
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 8fe004c1da81..1eefba3b2eed 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -12,6 +12,14 @@
>   
>   #include "hw/vfio/vfio-common.h"
>   
> +typedef struct VFIOMultifd VFIOMultifd;
> +
> +VFIOMultifd *vfio_multifd_new(void);
> +void vfio_multifd_free(VFIOMultifd *multifd);
> +
>   bool vfio_multifd_transfer_supported(void);
> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
> +
> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 7b79be6ad293..4311de763885 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -674,15 +674,40 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   migration->device_state, errp);
> +    if (ret) {
> +        return ret;
> +    }
>   
> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> -                                    vbasedev->migration->device_state, errp);
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        assert(!migration->multifd);
> +        migration->multifd = vfio_multifd_new();

When called from vfio_load_setup(), I think vfio_multifd_transfer_setup()
should allocate migration->multifd at the same time. It would simplify
the setup to one step. Maybe we could add a bool parameter ? because,
IIRC, you didn't like the idea of allocating it always, that is in
vfio_save_setup() too.

For symmetry, could vfio_save_cleanup() call vfio_multifd_cleanup() too ?
a setup implies a cleanup.

Thanks,

C.


> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_multifd_cleanup(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);
>   }
>   
>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
>   
> +    vfio_multifd_cleanup(vbasedev);
> +
>       vfio_migration_cleanup(vbasedev);
>       trace_vfio_load_cleanup(vbasedev->name);
>   
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 64ee3b1a2547..ab110198bd6b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -78,6 +78,8 @@ typedef struct VFIORegion {
>       uint8_t nr; /* cache the region number for debug */
>   } VFIORegion;
>   
> +typedef struct VFIOMultifd VFIOMultifd;
> +
>   typedef struct VFIOMigration {
>       struct VFIODevice *vbasedev;
>       VMChangeStateEntry *vm_state;
> @@ -89,6 +91,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    VFIOMultifd *multifd;
>       bool initial_data_sent;
>   
>       bool event_save_iterate_started;
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable
  2025-02-19 20:34 ` [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable Maciej S. Szmigiero
@ 2025-02-26 17:59   ` Cédric Le Goater
  2025-02-26 21:05     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 17:59 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> DEFINE_PROP_ON_OFF_AUTO() property isn't runtime-mutable so using it
> would mean that the source VM would need to decide upfront at startup
> time whether it wants to do a multifd device state transfer at some
> point.
> 
> Source VM can run for a long time before being migrated so it is
> desirable to have a fallback mechanism to the old way of transferring
> VFIO device state if it turns to be necessary.
> 
> This brings this property to the same mutability level as ordinary
> migration parameters, which too can be adjusted at the run time.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/pci.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 184ff882f9d1..9111805ae06c 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>   }
>   
> +static PropertyInfo qdev_prop_on_off_auto_mutable;

please use another name, like vfio_pci_migration_multifd_transfer_prop.
I wish we could define the property info all at once.

Thanks,

C.


> +
>   static const Property vfio_pci_dev_properties[] = {
>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
> @@ -3377,9 +3379,10 @@ static const Property vfio_pci_dev_properties[] = {
>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
> -    DEFINE_PROP_ON_OFF_AUTO("x-migration-multifd-transfer", VFIOPCIDevice,
> -                            vbasedev.migration_multifd_transfer,
> -                            ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
> +                vbasedev.migration_multifd_transfer,
> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> @@ -3475,6 +3478,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
>   
>   static void register_vfio_pci_dev_type(void)
>   {
> +    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
> +    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
> +
>       type_register_static(&vfio_pci_dev_info);
>       type_register_static(&vfio_pci_nohotplug_dev_info);
>   }
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 33/36] hw/core/machine: Add compat for x-migration-multifd-transfer VFIO property
  2025-02-19 20:34 ` [PATCH v5 33/36] hw/core/machine: Add compat for x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2025-02-26 17:59   ` Cédric Le Goater
  0 siblings, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-26 17:59 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add a hw_compat entry for recently added x-migration-multifd-transfer VFIO
> property.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/core/machine.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 21c3bde92f08..d0a87f5ccbaa 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -44,6 +44,7 @@ GlobalProperty hw_compat_9_2[] = {
>       { "virtio-mem-pci", "vectors", "0" },
>       { "migration", "multifd-clean-tls-termination", "false" },
>       { "migration", "send-switchover-start", "off"},
> +    { "vfio-pci", "x-migration-multifd-transfer", "off" },
>   };
>   const size_t hw_compat_9_2_len = G_N_ELEMENTS(hw_compat_9_2);
>   
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-26 10:43   ` Cédric Le Goater
@ 2025-02-26 21:04     ` Maciej S. Szmigiero
  2025-02-28  8:09       ` Cédric Le Goater
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 21:04 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 11:43, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The multifd received data needs to be reassembled since device state
>> packets sent via different multifd channels can arrive out-of-order.
>>
>> Therefore, each VFIO device state packet carries a header indicating its
>> position in the stream.
>> The raw device state data is saved into a VFIOStateBuffer for later
>> in-order loading into the device.
>>
>> The last such VFIO device state packet should have
>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c | 103 ++++++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h |   3 ++
>>   hw/vfio/migration.c         |   1 +
>>   hw/vfio/trace-events        |   1 +
>>   4 files changed, 108 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index c2defc0efef0..5d5ee1393674 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -42,6 +42,11 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>   typedef struct VFIOMultifd {
>> +    VFIOStateBuffers load_bufs;
>> +    QemuCond load_bufs_buffer_ready_cond;
>> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>> +    uint32_t load_buf_idx;
>> +    uint32_t load_buf_idx_last;
>>   } VFIOMultifd;
>>   static void vfio_state_buffer_clear(gpointer data)
>> @@ -87,15 +92,113 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>   }
> 
> this routine expects load_bufs_mutex to be locked ? May be say so.

I guess the comment above pertains to the vfio_load_state_buffer_insert()
below.

Do you mean it should have a comment that it expects to be called
under load_bufs_mutex?

>> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
> 
> could you pass VFIOMultifd* instead  ?

No, it needs vbasedev->migration_max_queued_buffers too (introduced
in later patch).

Also, most of VFIO routines (besides very small helpers/wrappers)
take VFIODevice *.

>> +                                          VFIODeviceStatePacket *packet,
>> +                                          size_t packet_total_size,
>> +                                          Error **errp)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIOStateBuffer *lb;
>> +
>> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
>> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
>> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
>> +    if (lb->is_present) {
>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>> +                   packet->idx);
>> +        return false;
>> +    }
>> +
>> +    assert(packet->idx >= multifd->load_buf_idx);
>> +
>> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>> +    lb->len = packet_total_size - sizeof(*packet);
>> +    lb->is_present = true;
>> +
>> +    return true;
>> +}
>> +
>> +bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>> +                            Error **errp)
> 
> 
> AFAICS, the only users of the .load_state_buffer() handlers is
> multifd_device_state_recv().
> 
> Please rename to vfio_multifd_load_state_buffer().

Renamed it accordingly.

>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>> +
>> +    /*
>> +     * Holding BQL here would violate the lock order and can cause
>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>> +     */
>> +    assert(!bql_locked());
>> +
>> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
>> +        error_setg(errp,
>> +                   "got device state packet but not doing multifd transfer");
>> +        return false;
>> +    }
>> +
>> +    assert(multifd);
>> +
>> +    if (data_size < sizeof(*packet)) {
>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>> +                   data_size, sizeof(*packet));
>> +        return false;
>> +    }
>> +
>> +    if (packet->version != VFIO_DEVICE_STATE_PACKET_VER_CURRENT) {
>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>> +                   packet->version);
>> +        return false;
>> +    }
>> +
>> +    if (packet->idx == UINT32_MAX) {
>> +        error_setg(errp, "packet has too high idx");
> 
> or "packet index is invalid" ?

Changed the error message.

>> +        return false;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>> +
>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
> 
> Using WITH_QEMU_LOCK_GUARD() would be cleaner I think.

Changed into a WITH_QEMU_LOCK_GUARD() block.
  
> 
> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-02-26 13:49   ` Cédric Le Goater
@ 2025-02-26 21:05     ` Maciej S. Szmigiero
  2025-02-28  9:11       ` Cédric Le Goater
  2025-03-02 14:19     ` Avihai Horon
  1 sibling, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 21:05 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 14:49, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Since it's important to finish loading device state transferred via the
>> main migration channel (via save_live_iterate SaveVMHandler) before
>> starting loading the data asynchronously transferred via multifd the thread
>> doing the actual loading of the multifd transferred data is only started
>> from switchover_start SaveVMHandler.
>>
>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>
>> This sub-command is only sent after all save_live_iterate data have already
>> been posted so it is safe to commence loading of the multifd-transferred
>> device state upon receiving it - loading of save_live_iterate data happens
>> synchronously in the main migration thread (much like the processing of
>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>> processed all the proceeding data must have already been loaded.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h |   2 +
>>   hw/vfio/migration.c         |  12 ++
>>   hw/vfio/trace-events        |   5 +
>>   4 files changed, 244 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 5d5ee1393674..b3a88c062769 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>   typedef struct VFIOMultifd {
>> +    QemuThread load_bufs_thread;
>> +    bool load_bufs_thread_running;
>> +    bool load_bufs_thread_want_exit;
>> +
>>       VFIOStateBuffers load_bufs;
>>       QemuCond load_bufs_buffer_ready_cond;
>> +    QemuCond load_bufs_thread_finished_cond;
>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>       uint32_t load_buf_idx;
>>       uint32_t load_buf_idx_last;
>> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>       return true;
>>   }
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    return -EINVAL;
>> +}
> 
> 
> please move to next patch.

As I wrote on the previous version of the patch set at
https://lore.kernel.org/qemu-devel/4f335de0-ba9f-4537-b230-2cf8af1c160b@maciej.szmigiero.name/:
> The dummy call has to be there, otherwise the code at the
> previous commit time wouldn't compile since that
> vfio_load_bufs_thread_load_config() call is a part of
> vfio_load_bufs_thread().
> 
> This is an artifact of splitting the whole load operation in
> multiple commits.

I think adding empty dummy implementations is the typical way
to do this - much like you asked today to leave
vfio_multifd_transfer_setup() returning true unconditionally
before being filled with true implementation in later patch.

See also my response at the end of this e-mail message, below
the call to vfio_load_bufs_thread_load_config().

>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>> +{
>> +    VFIOStateBuffer *lb;
>> +    guint bufs_len;
> 
> guint:  I guess it's ok to use here. It is not common practice in VFIO.
> 
>> +
>> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
>> +    if (multifd->load_buf_idx >= bufs_len) {
>> +        assert(multifd->load_buf_idx == bufs_len);
>> +        return NULL;
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
>> +                               multifd->load_buf_idx);
> 
> Could be one line. minor.
> 
>> +    if (!lb->is_present) {
>> +        return NULL;
>> +    }
>> +
>> +    return lb;
>> +}
>> +
>> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>> +                                         VFIOStateBuffer *lb,
>> +                                         Error **errp)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    g_autofree char *buf = NULL;
>> +    char *buf_cur;
>> +    size_t buf_len;
>> +
>> +    if (!lb->len) {
>> +        return true;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>> +                                                   multifd->load_buf_idx);
> 
> I thin we can move this trace event to vfio_load_bufs_thread()

It would get messy since we don't load empty buffers,
so we we don't print this trace point (and its _end sibling)
for empty buffers.

If we print this in vfio_load_bufs_thread() then it would
need to duplicate that !lb->len check.

>> +    /* lb might become re-allocated when we drop the lock */
>> +    buf = g_steal_pointer(&lb->data);
>> +    buf_cur = buf;
>> +    buf_len = lb->len;
>> +    while (buf_len > 0) {
>> +        ssize_t wr_ret;
>> +        int errno_save;
>> +
>> +        /*
>> +         * Loading data to the device takes a while,
>> +         * drop the lock during this process.
>> +         */
>> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
>> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);> +        errno_save = errno;
>> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
>> +
>> +        if (wr_ret < 0) {
>> +            error_setg(errp,
>> +                       "writing state buffer %" PRIu32 " failed: %d",
>> +                       multifd->load_buf_idx, errno_save);
>> +            return false;
>> +        }
>> +
>> +        assert(wr_ret <= buf_len);
>> +        buf_len -= wr_ret;
>> +        buf_cur += wr_ret;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>> +                                                 multifd->load_buf_idx);
> 
> and drop this trace event.

That's important data since it provides for how long it took to load that
buffer (_end - _start).

It's not the same information as _start(next buffer) - _start(current buffer)
since the next buffer might not have arrived yet so its loading won't
start immediately after the end of loading of the previous one.

> In which case, we can modify the parameters of vfio_load_state_buffer_write()
> to use directly a 'VFIOMultifd *multifd'and an fd instead of "migration->data_fd".
> 
>> +
>> +    return true;
>> +}
>> +
>> +static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
>> +                                            bool *should_quit)
>> +{
>> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
>> +}
>> +
>> +/*
>> + * This thread is spawned by vfio_multifd_switchover_start() which gets
>> + * called upon encountering the switchover point marker in main migration
>> + * stream.
>> + *
>> + * It exits after either:
>> + * * completing loading the remaining device state and device config, OR:
>> + * * encountering some error while doing the above, OR:
>> + * * being forcefully aborted by the migration core by it setting should_quit
>> + *   or by vfio_load_cleanup_load_bufs_thread() setting
>> + *   multifd->load_bufs_thread_want_exit.
>> + */
>> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    bool ret = true;
>> +    int config_ret;
> 
> No needed IMO. see below.
> 
>> +
>> +    assert(multifd);
>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>> +
>> +    assert(multifd->load_bufs_thread_running);
> 
> We could add a trace event for the start and the end of the thread.

Added vfio_load_bufs_thread_{start,end} trace events now.

>> +    while (true) {
>> +        VFIOStateBuffer *lb;
>> +
>> +        /*
>> +         * Always check cancellation first after the buffer_ready wait below in
>> +         * case that cond was signalled by vfio_load_cleanup_load_bufs_thread().
>> +         */
>> +        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
>> +            error_setg(errp, "operation cancelled");
>> +            ret = false;
>> +            goto ret_signal;
> 
> goto thread_exit ?

I'm not sure that I fully understand this comment.
Do you mean to rename ret_signal label to thread_exit?

>> +        }
>> +
>> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
>> +
>> +        lb = vfio_load_state_buffer_get(multifd);
>> +        if (!lb) {
>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>> +                                                        multifd->load_buf_idx);
>> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
>> +                           &multifd->load_bufs_mutex);
>> +            continue;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
>> +            break;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == 0) {
>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>> +        }
>> +
>> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
>> +            ret = false;
>> +            goto ret_signal;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>> +        }
>> +
>> +        multifd->load_buf_idx++;
>> +    }
> 
> if ret is assigned to true here, the "ret = false" can dropped

I inverted the "ret" logic here now - initialized ret to false
at definition, removed "ret = false" at every failure/early exit block
and added "ret = true" just before the "ret_signal" label.

>> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
>> +    if (config_ret) {
>> +        error_setg(errp, "load config state failed: %d", config_ret);
>> +        ret = false;
>> +    }
> 
> please move to next patch. This is adding nothing to this patch
> since it's returning -EINVAL.
> 

That's the whole point - if someone were to accidentally enable this
(for example by forgetting to apply the next patch when backporting
the series) it would fail safely with EINVAL instead of having a
half-broken implementation.

Another option would be to simply integrate the next patch into this
one as these are two parts of the same single operation and I think
splitting them in two in the end brings little value.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support
  2025-02-26 13:52   ` Cédric Le Goater
@ 2025-02-26 21:05     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 21:05 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 14:52, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Load device config received via multifd using the existing machinery
>> behind vfio_load_device_config_state().
>>
>> Also, make sure to process the relevant main migration channel flags.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 47 ++++++++++++++++++++++++++++++++++-
>>   hw/vfio/migration.c           |  8 +++++-
>>   include/hw/vfio/vfio-common.h |  2 ++
>>   3 files changed, 55 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index b3a88c062769..7200f6f1c2a2 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -15,6 +15,7 @@
>>   #include "qemu/lockable.h"
>>   #include "qemu/main-loop.h"
>>   #include "qemu/thread.h"
>> +#include "io/channel-buffer.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration-multifd.h"
>>   #include "trace.h"
>> @@ -186,7 +187,51 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>   static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> 
> please modify to return a bool and take a "Error **errp" parameter.
> 

Done.
  
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side
  2025-02-26 16:43   ` Cédric Le Goater
@ 2025-02-26 21:05     ` Maciej S. Szmigiero
  2025-02-28  9:13       ` Cédric Le Goater
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 21:05 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 17:43, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Implement the multifd device state transfer via additional per-device
>> thread inside save_live_complete_precopy_thread handler.
>>
>> Switch between doing the data transfer in the new handler and doing it
>> in the old save_state handler depending on the
>> x-migration-multifd-transfer device property value.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 139 ++++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h   |   5 ++
>>   hw/vfio/migration.c           |  26 +++++--
>>   hw/vfio/trace-events          |   2 +
>>   include/hw/vfio/vfio-common.h |   8 ++
>>   5 files changed, 174 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 7200f6f1c2a2..0cfa9d31732a 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -476,6 +476,145 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>>       return true;
>>   }
>> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    assert(vfio_multifd_transfer_enabled(vbasedev));
>> +
>> +    /*
>> +     * Emit dummy NOP data on the main migration channel since the actual
>> +     * device state transfer is done via multifd channels.
>> +     */
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +}
>> +
>> +static bool
>> +vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev,
>> +                                               char *idstr,
>> +                                               uint32_t instance_id,
>> +                                               uint32_t idx,
>> +                                               Error **errp)
>> +{
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    g_autoptr(QEMUFile) f = NULL;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    size_t packet_len;
>> +
>> +    bioc = qio_channel_buffer_new(0);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>> +
>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +
>> +    if (vfio_save_device_config_state(f, vbasedev, errp)) {
>> +        return false;
>> +    }
>> +
>> +    ret = qemu_fflush(f);
>> +    if (ret) {
>> +        error_setg(errp, "save config state flush failed: %d", ret);
>> +        return false;
>> +    }
>> +
>> +    packet_len = sizeof(*packet) + bioc->usage;
>> +    packet = g_malloc0(packet_len);
>> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
>> +    packet->idx = idx;
>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>> +
>> +    if (!multifd_queue_device_state(idstr, instance_id,
>> +                                    (char *)packet, packet_len)) {
>> +        error_setg(errp, "multifd config data queuing failed");
>> +        return false;
>> +    }
>> +
>> +    vfio_add_bytes_transferred(packet_len);
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * This thread is spawned by the migration core directly via
>> + * .save_live_complete_precopy_thread SaveVMHandler.
>> + *
>> + * It exits after either:
>> + * * completing saving the remaining device state and device config, OR:
>> + * * encountering some error while doing the above, OR:
>> + * * being forcefully aborted by the migration core by
>> + *   multifd_device_state_save_thread_should_exit() returning true.
>> + */
>> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
>> +                                       Error **errp)
> 
> In qemu_savevm_state_complete_precopy_iterable(), this handler is
> called :
> 
>      ....
>      if (multifd_device_state) {
>          QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>              SaveLiveCompletePrecopyThreadHandler hdlr;
> 
>              if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
>                               se->ops->has_postcopy(se->opaque)) ||
>                  !se->ops->save_live_complete_precopy_thread) {
>                  continue;
>              }
> 
>              hdlr = se->ops->save_live_complete_precopy_thread;
>              multifd_spawn_device_state_save_thread(hdlr,
>                                                     se->idstr, se->instance_id,
>                                                     se->opaque);
>          }
>      }
> 
> 
> I suggest naming it : vfio_multifd_save_complete_precopy_thread()

Renamed accordingly.

>> +{
>> +    VFIODevice *vbasedev = d->handler_opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    bool ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    uint32_t idx;
>> +
>> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>> +        return true;
>> +    }
>> +
>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>> +                                                  d->idstr, d->instance_id);
>> +
>> +    /* We reach here with device state STOP or STOP_COPY only */
>> +    if (vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>> +                                 VFIO_DEVICE_STATE_STOP, errp)) {
>> +        ret = false;
> 
> These "ret = false" can be avoided if the variable is set at the
> top of the function.

I inverted the "ret" logic here as in vfio_load_bufs_thread()
to make it false by default and set to true just before early
exit label.

>> +        goto ret_finish;
> 
> 
> goto thread_exit ?

As I asked in one of the previous patches,
do this comment mean that your want to rename ret_finish label to
thread_exit?

>> +    }
>> +
>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
>> +
>> +    for (idx = 0; ; idx++) {
>> +        ssize_t data_size;
>> +        size_t packet_size;
>> +
>> +        if (multifd_device_state_save_thread_should_exit()) {
>> +            error_setg(errp, "operation cancelled");
>> +            ret = false;
>> +            goto ret_finish;
>> +        }> +
>> +        data_size = read(migration->data_fd, &packet->data,
>> +                         migration->data_buffer_size);
>> +        if (data_size < 0) {
>> +            error_setg(errp, "reading state buffer %" PRIu32 " failed: %d",
>> +                       idx, errno);
>> +            ret = false;
>> +            goto ret_finish;
>> +        } else if (data_size == 0) {
>> +            break;
>> +        }
>> +
>> +        packet->idx = idx;
>> +        packet_size = sizeof(*packet) + data_size;
>> +
>> +        if (!multifd_queue_device_state(d->idstr, d->instance_id,
>> +                                        (char *)packet, packet_size)) {
>> +            error_setg(errp, "multifd data queuing failed");
>> +            ret = false;
>> +            goto ret_finish;
>> +        }
>> +
>> +        vfio_add_bytes_transferred(packet_size);
>> +    }
>> +
>> +    ret = vfio_save_complete_precopy_thread_config_state(vbasedev,
>> +                                                         d->idstr,
>> +                                                         d->instance_id,
>> +                                                         idx, errp);
>> +
>> +ret_finish:
>> +    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
>> +
>> +    return ret;
>> +}
>> +
>>   int vfio_multifd_switchover_start(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
>> index 09cbb437d9d1..79780d7b5392 100644
>> --- a/hw/vfio/migration-multifd.h
>> +++ b/hw/vfio/migration-multifd.h
>> @@ -25,6 +25,11 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>>   bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>                               Error **errp);
>> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f);
>> +
>> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
>> +                                       Error **errp);
>> +
>>   int vfio_multifd_switchover_start(VFIODevice *vbasedev);
>>   #endif
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index b962309f7c27..69dcf2dac2fa 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
(..)
>> @@ -238,8 +238,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>       return ret;
>>   }
>> -static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>> -                                         Error **errp)
>> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp)
>>   {
>>       VFIODevice *vbasedev = opaque;
>>       int ret;
>> @@ -453,6 +452,10 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>       int ret;
>> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
>> +        return -EINVAL;
>> +    }
>> +
> 
> please move to another patch with the similar change of patch 25.
> 

This patch is about the send/save side while patch 25
is called "*receive* init/cleanup".

So adding save setup to patch called "receive init" wouldn't be
consistent with that patch subject.

>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);

(..)
>> index ce2bdea8a2c2..ba851917f9fc 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -298,6 +298,14 @@ void vfio_add_bytes_transferred(unsigned long val);
>>   bool vfio_device_state_is_running(VFIODevice *vbasedev);
>>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>> +#ifdef CONFIG_LINUX
>> +int vfio_migration_set_state(VFIODevice *vbasedev,
>> +                             enum vfio_device_mig_state new_state,
>> +                             enum vfio_device_mig_state recover_state,
>> +                             Error **errp);
> 
> please move below with the other declarations under #ifdef CONFIG_LINUX.
> 
>> +#endif
>> +
>> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp);
>>   int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>>   #ifdef CONFIG_LINUX
>>
> 

Done.

> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable
  2025-02-26 17:59   ` Cédric Le Goater
@ 2025-02-26 21:05     ` Maciej S. Szmigiero
  2025-02-28  8:44       ` Cédric Le Goater
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-26 21:05 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 18:59, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> DEFINE_PROP_ON_OFF_AUTO() property isn't runtime-mutable so using it
>> would mean that the source VM would need to decide upfront at startup
>> time whether it wants to do a multifd device state transfer at some
>> point.
>>
>> Source VM can run for a long time before being migrated so it is
>> desirable to have a fallback mechanism to the old way of transferring
>> VFIO device state if it turns to be necessary.
>>
>> This brings this property to the same mutability level as ordinary
>> migration parameters, which too can be adjusted at the run time.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/pci.c | 12 +++++++++---
>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 184ff882f9d1..9111805ae06c 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>   }
>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
> 
> please use another name, like vfio_pci_migration_multifd_transfer_prop.

Done.

> I wish we could define the property info all at once.

I'm not sure what you mean here, could you please elaborate a bit more?

This property mutability patch was split out from the previous patch
adding the actual x-migration-multifd-transfer VFIO property upon
your request.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-02-19 20:34 ` [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2025-02-27  6:45   ` Cédric Le Goater
  2025-03-02 14:48   ` Avihai Horon
  1 sibling, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-27  6:45 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This property allows configuring at runtime whether to transfer the
> particular device state via multifd channels when live migrating that
> device.
> 
> It defaults to AUTO, which means that VFIO device state transfer via
> multifd channels is attempted in configurations that otherwise support it.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 17 ++++++++++++++++-
>   hw/vfio/pci.c                 |  3 +++
>   include/hw/vfio/vfio-common.h |  2 ++
>   3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 0cfa9d31732a..18a5ff964a37 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -460,11 +460,26 @@ bool vfio_multifd_transfer_supported(void)
>   
>   bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>   {
> -    return false;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->multifd_transfer;
>   }
>   
>   bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>   {
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    /*
> +     * Make a copy of this setting at the start in case it is changed
> +     * mid-migration.
> +     */
> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> +    } else {
> +        migration->multifd_transfer =
> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> +    }
> +
>       if (vfio_multifd_transfer_enabled(vbasedev) &&
>           !vfio_multifd_transfer_supported()) {
>           error_setg(errp,
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 89d900e9cf0c..184ff882f9d1 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3377,6 +3377,9 @@ static const Property vfio_pci_dev_properties[] = {
>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_ON_OFF_AUTO("x-migration-multifd-transfer", VFIOPCIDevice,
> +                            vbasedev.migration_multifd_transfer,
> +                            ON_OFF_AUTO_AUTO),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),

Please add property documentation in vfio_pci_dev_class_init()


Thanks,

C.



> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ba851917f9fc..3006931accf6 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -91,6 +91,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    bool multifd_transfer;
>       VFIOMultifd *multifd;
>       bool initial_data_sent;
>   
> @@ -153,6 +154,7 @@ typedef struct VFIODevice {
>       bool no_mmap;
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
> +    OnOffAuto migration_multifd_transfer;
>       bool migration_events;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-02-19 20:34 ` [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit Maciej S. Szmigiero
@ 2025-02-27  6:48   ` Cédric Le Goater
  2025-02-27 22:01     ` Maciej S. Szmigiero
  2025-03-02 14:53   ` Avihai Horon
  1 sibling, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-27  6:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Allow capping the maximum count of in-flight VFIO device state buffers
> queued at the destination, otherwise a malicious QEMU source could
> theoretically cause the target QEMU to allocate unlimited amounts of memory
> for buffers-in-flight.
> 
> Since this is not expected to be a realistic threat in most of VFIO live
> migration use cases and the right value depends on the particular setup
> disable the limit by default by setting it to UINT64_MAX.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 14 ++++++++++++++
>   hw/vfio/pci.c                 |  2 ++
>   include/hw/vfio/vfio-common.h |  1 +
>   3 files changed, 17 insertions(+)
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 18a5ff964a37..04aa3f4a6596 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -53,6 +53,7 @@ typedef struct VFIOMultifd {
>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>       uint32_t load_buf_idx;
>       uint32_t load_buf_idx_last;
> +    uint32_t load_buf_queued_pending_buffers;
>   } VFIOMultifd;
>   
>   static void vfio_state_buffer_clear(gpointer data)
> @@ -121,6 +122,15 @@ static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>   
>       assert(packet->idx >= multifd->load_buf_idx);
>   
> +    multifd->load_buf_queued_pending_buffers++;
> +    if (multifd->load_buf_queued_pending_buffers >
> +        vbasedev->migration_max_queued_buffers) {
> +        error_setg(errp,
> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
> +                   packet->idx, vbasedev->migration_max_queued_buffers);
> +        return false;
> +    }
> +
>       lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>       lb->len = packet_total_size - sizeof(*packet);
>       lb->is_present = true;
> @@ -374,6 +384,9 @@ static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>               goto ret_signal;
>           }
>   
> +        assert(multifd->load_buf_queued_pending_buffers > 0);
> +        multifd->load_buf_queued_pending_buffers--;
> +
>           if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>               trace_vfio_load_state_device_buffer_end(vbasedev->name);
>           }
> @@ -408,6 +421,7 @@ VFIOMultifd *vfio_multifd_new(void)
>   
>       multifd->load_buf_idx = 0;
>       multifd->load_buf_idx_last = UINT32_MAX;
> +    multifd->load_buf_queued_pending_buffers = 0;
>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>   
>       multifd->load_bufs_thread_running = false;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 9111805ae06c..247418f0fce2 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3383,6 +3383,8 @@ static const Property vfio_pci_dev_properties[] = {
>                   vbasedev.migration_multifd_transfer,
>                   qdev_prop_on_off_auto_mutable, OnOffAuto,
>                   .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),

UINT64_MAX doesn't make sense to me. What would be a reasonable value ?

Have you monitored the max ? Should we collect some statistics on this
value and raise a warning if a high water mark is reached ? I think
this would more useful.

>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),


Please add property documentation in vfio_pci_dev_class_init()


Thanks,

C.


> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 3006931accf6..30a5bb9af61b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -155,6 +155,7 @@ typedef struct VFIODevice {
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
>       OnOffAuto migration_multifd_transfer;
> +    uint64_t migration_max_queued_buffers;
>       bool migration_events;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-19 20:34 ` [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation Maciej S. Szmigiero
@ 2025-02-27  6:59   ` Cédric Le Goater
  2025-02-27 22:01     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-27  6:59 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/19/25 21:34, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Update the VFIO documentation at docs/devel/migration describing the
> changes brought by the multifd device state transfer.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
>   1 file changed, 71 insertions(+), 9 deletions(-)
> 
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index c49482eab66d..d9b169d29921 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.

Please add a new "multifd" documentation subsection at the end of the file
with this part :

> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
> +_STOP_COPY state via multifd channels. This helps reduce downtime - especially
> +with multiple VFIO devices or with devices having a large migration state.
> +As an additional benefit, setting the VFIO device to _STOP_COPY state and
> +saving its config space is also parallelized (run in a separate thread) in
> +such migration mode.
> +
> +The multifd VFIO device state transfer is controlled by
> +"x-migration-multifd-transfer" VFIO device property. This property defaults to
> +AUTO, which means that VFIO device state transfer via multifd channels is
> +attempted in configurations that otherwise support it.
> +

I was expecting a much more detailed explanation on the design too  :

  * in the cover letter
  * in the hw/vfio/migration-multifd.c
  * in some new file under docs/devel/migration/



This section :

> +Since the target QEMU needs to load device state buffers in-order it needs to
> +queue incoming buffers until they can be loaded into the device.
> +This means that a malicious QEMU source could theoretically cause the target
> +QEMU to allocate unlimited amounts of memory for such buffers-in-flight.
> +
> +The "x-migration-max-queued-buffers" property allows capping the maximum count
> +of these VFIO device state buffers queued at the destination.
> +
> +Because a malicious QEMU source causing OOM on the target is not expected to be
> +a realistic threat in most of VFIO live migration use cases and the right value
> +depends on the particular setup by default this queued buffers limit is
> +disabled by setting it to UINT64_MAX.

should be in patch 34. It is not obvious it will be merged.


This section :

> +Some host platforms (like ARM64) require that VFIO device config is loaded only
> +after all iterables were loaded.
> +Such interlocking is controlled by "x-migration-load-config-after-iter" VFIO
> +device property, which in its default setting (AUTO) does so only on platforms
> +that actually require it.

Should be in 35. Same reason.


>   When pre-copy is supported, it's possible to further reduce downtime by
>   enabling "switchover-ack" migration capability.
>   VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
> @@ -67,14 +98,39 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``switchover_ack_needed`` function that checks if the VFIO device uses
>     "switchover-ack" migration capability when this capability is enabled.
>   
> -* A ``save_state`` function to save the device config space if it is present.
> -
> -* A ``save_live_complete_precopy`` function that sets the VFIO device in
> -  _STOP_COPY state and iteratively copies the data for the VFIO device until
> -  the vendor driver indicates that no data remains.
> -
> -* A ``load_state`` function that loads the config section and the data
> -  sections that are generated by the save functions above.
> +* A ``switchover_start`` function that in the multifd mode starts a thread that
> +  reassembles the multifd received data and loads it in-order into the device.
> +  In the non-multifd mode this function is a NOP.
> +
> +* A ``save_state`` function to save the device config space if it is present
> +  in the non-multifd mode.
> +  In the multifd mode it just emits either a dummy EOS marker or
> +  "all iterables were loaded" flag for configurations that need to defer
> +  loading device config space after them.
> +
> +* A ``save_live_complete_precopy`` function that in the non-multifd mode sets
> +  the VFIO device in _STOP_COPY state and iteratively copies the data for the
> +  VFIO device until the vendor driver indicates that no data remains.
> +  In the multifd mode it just emits a dummy EOS marker.
> +
> +* A ``save_live_complete_precopy_thread`` function that in the multifd mode
> +  provides thread handler performing multifd device state transfer.
> +  It sets the VFIO device to _STOP_COPY state, iteratively reads the data
> +  from the VFIO device and queues it for multifd transmission until the vendor
> +  driver indicates that no data remains.
> +  After that, it saves the device config space and queues it for multifd
> +  transfer too.
> +  In the non-multifd mode this thread is a NOP.
> +
> +* A ``load_state`` function that loads the data sections that are generated
> +  by the main migration channel save functions above.
> +  In the non-multifd mode it also loads the config section, while in the
> +  multifd mode it handles the optional "all iterables were loaded" flag if
> +  it is in use.
> +
> +* A ``load_state_buffer`` function that loads the device state and the device
> +  config that arrived via multifd channels.
> +  It's used only in the multifd mode.

Please move the documentation of the new migration handlers in the
patch introducing them.


Thanks,

C.



>   
>   * ``cleanup`` functions for both save and load that perform any migration
>     related cleanup.
> @@ -176,8 +232,11 @@ Live migration save path
>                   Then the VFIO device is put in _STOP_COPY state
>                        (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
>            .save_live_complete_precopy() is called for each active device
> -      For the VFIO device, iterate in .save_live_complete_precopy() until
> +              For the VFIO device: in the non-multifd mode iterate in
> +                        .save_live_complete_precopy() until
>                                  pending data is 0
> +	          In the multifd mode this iteration is done in
> +	          .save_live_complete_precopy_thread() instead.
>                                         |
>                        (POSTMIGRATE, _COMPLETED, _STOP_COPY)
>               Migraton thread schedules cleanup bottom half and exits
> @@ -194,6 +253,9 @@ Live migration resume path
>                             (RESTORE_VM, _ACTIVE, _STOP)
>                                         |
>        For each device, .load_state() is called for that device section data
> +                 transmitted via the main migration channel.
> +     For data transmitted via multifd channels .load_state_buffer() is called
> +                                   instead.
>                           (RESTORE_VM, _ACTIVE, _RESUMING)
>                                         |
>     At the end, .load_cleanup() is called for each device and vCPUs are started
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-26 17:46   ` Cédric Le Goater
@ 2025-02-27 22:00     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-27 22:00 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 18:46, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add support for VFIOMultifd data structure that will contain most of the
>> receive-side data together with its init/cleanup methods.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h   |  8 ++++++++
>>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
>>   include/hw/vfio/vfio-common.h |  3 +++
>>   4 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 7328ad8e925c..c2defc0efef0 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
>>       size_t len;
>>   } VFIOStateBuffer;
>> +typedef struct VFIOMultifd {
>> +} VFIOMultifd;
>> +
>>   static void vfio_state_buffer_clear(gpointer data)
>>   {
>>       VFIOStateBuffer *lb = data;
>> @@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>   }
>> +VFIOMultifd *vfio_multifd_new(void)
>> +{
>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>> +
>> +    return multifd;
>> +}
>> +
>> +void vfio_multifd_free(VFIOMultifd *multifd)
>> +{
>> +    g_free(multifd);
>> +}
>> +
>>   bool vfio_multifd_transfer_supported(void)
>>   {
>>       return multifd_device_state_supported() &&
>>           migrate_send_switchover_start();
>>   }
>> +
>> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>> +{
>> +    return false;
>> +}
>> +
>> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    if (vfio_multifd_transfer_enabled(vbasedev) &&
>> +        !vfio_multifd_transfer_supported()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
>> index 8fe004c1da81..1eefba3b2eed 100644
>> --- a/hw/vfio/migration-multifd.h
>> +++ b/hw/vfio/migration-multifd.h
>> @@ -12,6 +12,14 @@
>>   #include "hw/vfio/vfio-common.h"
>> +typedef struct VFIOMultifd VFIOMultifd;
>> +
>> +VFIOMultifd *vfio_multifd_new(void);
>> +void vfio_multifd_free(VFIOMultifd *multifd);
>> +
>>   bool vfio_multifd_transfer_supported(void);
>> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
>> +
>> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>>   #endif
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 7b79be6ad293..4311de763885 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -674,15 +674,40 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> +                                   migration->device_state, errp);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> -                                    vbasedev->migration->device_state, errp);
>> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
>> +        assert(!migration->multifd);
>> +        migration->multifd = vfio_multifd_new();
> 
> When called from vfio_load_setup(), I think vfio_multifd_transfer_setup()
> should allocate migration->multifd at the same time. It would simplify
> the setup to one step. Maybe we could add a bool parameter ? because,
> IIRC, you didn't like the idea of allocating it always, that is in
> vfio_save_setup() too.

I have added a "bool alloc_multifd" parameter to
vfio_multifd_transfer_setup() and renamed it to vfio_multifd_setup() for
consistency with vfio_multifd_cleanup().

Unexported vfio_multifd_new() now that it is called only from
vfio_multifd_setup() in the same translation unit.

> 
> For symmetry, could vfio_save_cleanup() call vfio_multifd_cleanup() too ?
> a setup implies a cleanup.

Added vfio_multifd_cleanup() call to vfio_save_cleanup() with a comment
describing that it is currently a NOP.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-26 17:28   ` Cédric Le Goater
@ 2025-02-27 22:00     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-27 22:00 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 18:28, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add support for VFIOMultifd data structure that will contain most of the
>> receive-side data together with its init/cleanup methods.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 33 +++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h   |  8 ++++++++
>>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++--
>>   include/hw/vfio/vfio-common.h |  3 +++
>>   4 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 7328ad8e925c..c2defc0efef0 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -41,6 +41,9 @@ typedef struct VFIOStateBuffer {
>>       size_t len;
>>   } VFIOStateBuffer;
>> +typedef struct VFIOMultifd {
>> +} VFIOMultifd;
>> +
>>   static void vfio_state_buffer_clear(gpointer data)
>>   {
>>       VFIOStateBuffer *lb = data;
>> @@ -84,8 +87,38 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>   }
>> +VFIOMultifd *vfio_multifd_new(void)
>> +{
>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>> +
>> +    return multifd;
>> +}
>> +
>> +void vfio_multifd_free(VFIOMultifd *multifd)
>> +{
>> +    g_free(multifd);
>> +}
>> +
>>   bool vfio_multifd_transfer_supported(void)
>>   {
>>       return multifd_device_state_supported() &&
>>           migrate_send_switchover_start();
>>   }
>> +
>> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>> +{
>> +    return false;
>> +}
>> +
>> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    if (vfio_multifd_transfer_enabled(vbasedev) &&
>> +        !vfio_multifd_transfer_supported()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
>> index 8fe004c1da81..1eefba3b2eed 100644
>> --- a/hw/vfio/migration-multifd.h
>> +++ b/hw/vfio/migration-multifd.h
>> @@ -12,6 +12,14 @@
>>   #include "hw/vfio/vfio-common.h"
>> +typedef struct VFIOMultifd VFIOMultifd;
>> +
>> +VFIOMultifd *vfio_multifd_new(void);
>> +void vfio_multifd_free(VFIOMultifd *multifd);
>> +
>>   bool vfio_multifd_transfer_supported(void);
>> +bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
>> +
>> +bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>>   #endif
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 7b79be6ad293..4311de763885 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -674,15 +674,40 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> +                                   migration->device_state, errp);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> -                                    vbasedev->migration->device_state, errp);
>> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
>> +        assert(!migration->multifd);
>> +        migration->multifd = vfio_multifd_new();
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_multifd_cleanup(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);
>>   }
> 
> Please move vfio_multifd_cleanup() to migration-multifd.c.

Done now.

> Thanks,
> 
> C.
>

Thanks,
Maciej

  



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-02-27  6:48   ` Cédric Le Goater
@ 2025-02-27 22:01     ` Maciej S. Szmigiero
  2025-02-28  8:53       ` Cédric Le Goater
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-27 22:01 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 27.02.2025 07:48, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Allow capping the maximum count of in-flight VFIO device state buffers
>> queued at the destination, otherwise a malicious QEMU source could
>> theoretically cause the target QEMU to allocate unlimited amounts of memory
>> for buffers-in-flight.
>>
>> Since this is not expected to be a realistic threat in most of VFIO live
>> migration use cases and the right value depends on the particular setup
>> disable the limit by default by setting it to UINT64_MAX.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 14 ++++++++++++++
>>   hw/vfio/pci.c                 |  2 ++
>>   include/hw/vfio/vfio-common.h |  1 +
>>   3 files changed, 17 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 18a5ff964a37..04aa3f4a6596 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -53,6 +53,7 @@ typedef struct VFIOMultifd {
>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>       uint32_t load_buf_idx;
>>       uint32_t load_buf_idx_last;
>> +    uint32_t load_buf_queued_pending_buffers;
>>   } VFIOMultifd;
>>   static void vfio_state_buffer_clear(gpointer data)
>> @@ -121,6 +122,15 @@ static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>>       assert(packet->idx >= multifd->load_buf_idx);
>> +    multifd->load_buf_queued_pending_buffers++;
>> +    if (multifd->load_buf_queued_pending_buffers >
>> +        vbasedev->migration_max_queued_buffers) {
>> +        error_setg(errp,
>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>> +        return false;
>> +    }
>> +
>>       lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>>       lb->len = packet_total_size - sizeof(*packet);
>>       lb->is_present = true;
>> @@ -374,6 +384,9 @@ static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>>               goto ret_signal;
>>           }
>> +        assert(multifd->load_buf_queued_pending_buffers > 0);
>> +        multifd->load_buf_queued_pending_buffers--;
>> +
>>           if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>>               trace_vfio_load_state_device_buffer_end(vbasedev->name);
>>           }
>> @@ -408,6 +421,7 @@ VFIOMultifd *vfio_multifd_new(void)
>>       multifd->load_buf_idx = 0;
>>       multifd->load_buf_idx_last = UINT32_MAX;
>> +    multifd->load_buf_queued_pending_buffers = 0;
>>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>>       multifd->load_bufs_thread_running = false;
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 9111805ae06c..247418f0fce2 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3383,6 +3383,8 @@ static const Property vfio_pci_dev_properties[] = {
>>                   vbasedev.migration_multifd_transfer,
>>                   qdev_prop_on_off_auto_mutable, OnOffAuto,
>>                   .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
>> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
> 
> UINT64_MAX doesn't make sense to me. What would be a reasonable value ?

It's the value that effectively disables this limit.

> Have you monitored the max ? Should we collect some statistics on this
> value and raise a warning if a high water mark is reached ? I think
> this would more useful.

It's an additional mechanism, which is not expected to be necessary
in most of real-world setups, hence it's disabled by default:
> Since this is not expected to be a realistic threat in most of VFIO live
> migration use cases and the right value depends on the particular setup
> disable the limit by default by setting it to UINT64_MAX.

The minimum value that works with particular setup depends on number of
multifd channels, probably also the number of NIC queues, etc. so it's
not something we should propose hard default to - unless it's a very
high default like 100 buffers, but then why have it set by default?.

IMHO setting it to UINT64_MAX clearly shows that it is disabled by
default since it obviously couldn't be set higher.
  
>>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>>                        vbasedev.migration_events, false),
>>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> 
> 
> Please add property documentation in vfio_pci_dev_class_init()
> 

I'm not sure what you mean by that, vfio_pci_dev_class_init() doesn't
contain any documentation or even references to either
x-migration-max-queued-buffers or x-migration-multifd-transfer:
> static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> {
>     DeviceClass *dc = DEVICE_CLASS(klass);
>     PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
> 
>     device_class_set_legacy_reset(dc, vfio_pci_reset);
>     device_class_set_props(dc, vfio_pci_dev_properties);
> #ifdef CONFIG_IOMMUFD
>     object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
> #endif
>     dc->desc = "VFIO-based PCI device assignment";
>     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>     pdc->realize = vfio_realize;
>     pdc->exit = vfio_exitfn;
>     pdc->config_read = vfio_pci_read_config;
>     pdc->config_write = vfio_pci_write_config;
> }


> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-27  6:59   ` Cédric Le Goater
@ 2025-02-27 22:01     ` Maciej S. Szmigiero
  2025-02-28 10:05       ` Cédric Le Goater
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-27 22:01 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 27.02.2025 07:59, Cédric Le Goater wrote:
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Update the VFIO documentation at docs/devel/migration describing the
>> changes brought by the multifd device state transfer.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
>>   1 file changed, 71 insertions(+), 9 deletions(-)
>>
>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
>> index c49482eab66d..d9b169d29921 100644
>> --- a/docs/devel/migration/vfio.rst
>> +++ b/docs/devel/migration/vfio.rst
>> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
> 
> Please add a new "multifd" documentation subsection at the end of the file
> with this part :
> 
>> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
>> +_STOP_COPY state via multifd channels. This helps reduce downtime - especially
>> +with multiple VFIO devices or with devices having a large migration state.
>> +As an additional benefit, setting the VFIO device to _STOP_COPY state and
>> +saving its config space is also parallelized (run in a separate thread) in
>> +such migration mode.
>> +
>> +The multifd VFIO device state transfer is controlled by
>> +"x-migration-multifd-transfer" VFIO device property. This property defaults to
>> +AUTO, which means that VFIO device state transfer via multifd channels is
>> +attempted in configurations that otherwise support it.
>> +

Done - I also moved the parts about x-migration-max-queued-buffers
and x-migration-load-config-after-iter description there since
obviously they wouldn't make sense being left alone in the top section.

> I was expecting a much more detailed explanation on the design too  :
> 
>   * in the cover letter
>   * in the hw/vfio/migration-multifd.c
>   * in some new file under docs/devel/migration/
> 

I'm not sure what descriptions you exactly want in these places, but since
that's just documentation (not code) it could be added after the code freeze...

> 
> This section :
> 
>> +Since the target QEMU needs to load device state buffers in-order it needs to
>> +queue incoming buffers until they can be loaded into the device.
>> +This means that a malicious QEMU source could theoretically cause the target
>> +QEMU to allocate unlimited amounts of memory for such buffers-in-flight.
>> +
>> +The "x-migration-max-queued-buffers" property allows capping the maximum count
>> +of these VFIO device state buffers queued at the destination.
>> +
>> +Because a malicious QEMU source causing OOM on the target is not expected to be
>> +a realistic threat in most of VFIO live migration use cases and the right value
>> +depends on the particular setup by default this queued buffers limit is
>> +disabled by setting it to UINT64_MAX.
> 
> should be in patch 34. It is not obvious it will be merged.
> 

...which brings us to this point.

I think by this point in time (less then 2 weeks to code freeze) we should
finally decide what is going to be included in the patch set.

This way this patch set could be well tested in its final form rather than
having significant parts taken out of it at the eleventh hour.

If the final form is known also the documentation can be adjusted accordingly
and user/admin documentation eventually written once the code is considered
okay.

I though we discussed a few times the rationale behind both
x-migration-max-queued-buffers and x-migration-load-config-after-iter properties
but if you still have some concerns there please let me know before I prepare
the next version of this patch set so I know whether to include these.

> This section :
> 
>> +Some host platforms (like ARM64) require that VFIO device config is loaded only
>> +after all iterables were loaded.
>> +Such interlocking is controlled by "x-migration-load-config-after-iter" VFIO
>> +device property, which in its default setting (AUTO) does so only on platforms
>> +that actually require it.
> 
> Should be in 35. Same reason.
> 
> 
>>   When pre-copy is supported, it's possible to further reduce downtime by
>>   enabling "switchover-ack" migration capability.
>>   VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
>> @@ -67,14 +98,39 @@ VFIO implements the device hooks for the iterative approach as follows:
>>   * A ``switchover_ack_needed`` function that checks if the VFIO device uses
>>     "switchover-ack" migration capability when this capability is enabled.
>> -* A ``save_state`` function to save the device config space if it is present.
>> -
>> -* A ``save_live_complete_precopy`` function that sets the VFIO device in
>> -  _STOP_COPY state and iteratively copies the data for the VFIO device until
>> -  the vendor driver indicates that no data remains.
>> -
>> -* A ``load_state`` function that loads the config section and the data
>> -  sections that are generated by the save functions above.
>> +* A ``switchover_start`` function that in the multifd mode starts a thread that
>> +  reassembles the multifd received data and loads it in-order into the device.
>> +  In the non-multifd mode this function is a NOP.
>> +
>> +* A ``save_state`` function to save the device config space if it is present
>> +  in the non-multifd mode.
>> +  In the multifd mode it just emits either a dummy EOS marker or
>> +  "all iterables were loaded" flag for configurations that need to defer
>> +  loading device config space after them.
>> +
>> +* A ``save_live_complete_precopy`` function that in the non-multifd mode sets
>> +  the VFIO device in _STOP_COPY state and iteratively copies the data for the
>> +  VFIO device until the vendor driver indicates that no data remains.
>> +  In the multifd mode it just emits a dummy EOS marker.
>> +
>> +* A ``save_live_complete_precopy_thread`` function that in the multifd mode
>> +  provides thread handler performing multifd device state transfer.
>> +  It sets the VFIO device to _STOP_COPY state, iteratively reads the data
>> +  from the VFIO device and queues it for multifd transmission until the vendor
>> +  driver indicates that no data remains.
>> +  After that, it saves the device config space and queues it for multifd
>> +  transfer too.
>> +  In the non-multifd mode this thread is a NOP.
>> +
>> +* A ``load_state`` function that loads the data sections that are generated
>> +  by the main migration channel save functions above.
>> +  In the non-multifd mode it also loads the config section, while in the
>> +  multifd mode it handles the optional "all iterables were loaded" flag if
>> +  it is in use.
>> +
>> +* A ``load_state_buffer`` function that loads the device state and the device
>> +  config that arrived via multifd channels.
>> +  It's used only in the multifd mode.
> 
> Please move the documentation of the new migration handlers in the
> patch introducing them.
> 
>
> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-26 21:04     ` Maciej S. Szmigiero
@ 2025-02-28  8:09       ` Cédric Le Goater
  2025-02-28 20:47         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-28  8:09 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/26/25 22:04, Maciej S. Szmigiero wrote:
> On 26.02.2025 11:43, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> The multifd received data needs to be reassembled since device state
>>> packets sent via different multifd channels can arrive out-of-order.
>>>
>>> Therefore, each VFIO device state packet carries a header indicating its
>>> position in the stream.
>>> The raw device state data is saved into a VFIOStateBuffer for later
>>> in-order loading into the device.
>>>
>>> The last such VFIO device state packet should have
>>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c | 103 ++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/migration-multifd.h |   3 ++
>>>   hw/vfio/migration.c         |   1 +
>>>   hw/vfio/trace-events        |   1 +
>>>   4 files changed, 108 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index c2defc0efef0..5d5ee1393674 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -42,6 +42,11 @@ typedef struct VFIOStateBuffer {
>>>   } VFIOStateBuffer;
>>>   typedef struct VFIOMultifd {
>>> +    VFIOStateBuffers load_bufs;
>>> +    QemuCond load_bufs_buffer_ready_cond;
>>> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>> +    uint32_t load_buf_idx;
>>> +    uint32_t load_buf_idx_last;
>>>   } VFIOMultifd;
>>>   static void vfio_state_buffer_clear(gpointer data)
>>> @@ -87,15 +92,113 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>>   }
>>
>> this routine expects load_bufs_mutex to be locked ? May be say so.
> 
> I guess the comment above pertains to the vfio_load_state_buffer_insert()
> below.
> 
> Do you mean it should have a comment that it expects to be called
> under load_bufs_mutex?

Just a one liner like :

/* called with load_bufs_mutex locked */

?

It's good to have for the future generations.

> 
>>> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>>
>> could you pass VFIOMultifd* instead  ?
> 
> No, it needs vbasedev->migration_max_queued_buffers too (introduced
> in later patch).
> > Also, most of VFIO routines (besides very small helpers/wrappers)
> take VFIODevice *.

OK. It's minor but I prefer when parameters are limited to the minimum.
Having 'VFIODevice *' opens doors to a lot of state.


Thanks,

C.



> 
>>> +                                          VFIODeviceStatePacket *packet,
>>> +                                          size_t packet_total_size,
>>> +                                          Error **errp)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    VFIOStateBuffer *lb;
>>> +
>>> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
>>> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
>>> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
>>> +    }
>>> +
>>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
>>> +    if (lb->is_present) {
>>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>>> +                   packet->idx);
>>> +        return false;
>>> +    }
>>> +
>>> +    assert(packet->idx >= multifd->load_buf_idx);
>>> +
>>> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>>> +    lb->len = packet_total_size - sizeof(*packet);
>>> +    lb->is_present = true;
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>> +                            Error **errp)
>>
>>
>> AFAICS, the only users of the .load_state_buffer() handlers is
>> multifd_device_state_recv().
>>
>> Please rename to vfio_multifd_load_state_buffer().
> 
> Renamed it accordingly.
> 
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>>> +
>>> +    /*
>>> +     * Holding BQL here would violate the lock order and can cause
>>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>>> +     */
>>> +    assert(!bql_locked());
>>> +
>>> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
>>> +        error_setg(errp,
>>> +                   "got device state packet but not doing multifd transfer");
>>> +        return false;
>>> +    }
>>> +
>>> +    assert(multifd);
>>> +
>>> +    if (data_size < sizeof(*packet)) {
>>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>>> +                   data_size, sizeof(*packet));
>>> +        return false;
>>> +    }
>>> +
>>> +    if (packet->version != VFIO_DEVICE_STATE_PACKET_VER_CURRENT) {
>>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>>> +                   packet->version);
>>> +        return false;
>>> +    }
>>> +
>>> +    if (packet->idx == UINT32_MAX) {
>>> +        error_setg(errp, "packet has too high idx");
>>
>> or "packet index is invalid" ?
> 
> Changed the error message.
> 
>>> +        return false;
>>> +    }
>>> +
>>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>>> +
>>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>>
>> Using WITH_QEMU_LOCK_GUARD() would be cleaner I think.
> 
> Changed into a WITH_QEMU_LOCK_GUARD() block.
> 
>>
>>
>> Thanks,
>>
>> C.
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable
  2025-02-26 21:05     ` Maciej S. Szmigiero
@ 2025-02-28  8:44       ` Cédric Le Goater
  2025-02-28 20:47         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-28  8:44 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/26/25 22:05, Maciej S. Szmigiero wrote:
> On 26.02.2025 18:59, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> DEFINE_PROP_ON_OFF_AUTO() property isn't runtime-mutable so using it
>>> would mean that the source VM would need to decide upfront at startup
>>> time whether it wants to do a multifd device state transfer at some
>>> point.
>>>
>>> Source VM can run for a long time before being migrated so it is
>>> desirable to have a fallback mechanism to the old way of transferring
>>> VFIO device state if it turns to be necessary.
>>>
>>> This brings this property to the same mutability level as ordinary
>>> migration parameters, which too can be adjusted at the run time.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/pci.c | 12 +++++++++---
>>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 184ff882f9d1..9111805ae06c 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>>   }
>>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
>>
>> please use another name, like vfio_pci_migration_multifd_transfer_prop.
> 
> Done.
> 
>> I wish we could define the property info all at once.
> 
> I'm not sure what you mean here, could you please elaborate a bit more?

I meant :

     static const PropertyInfo vfio_pci_migration_multifd_transfer_prop = {
         .name = "OnOffAuto",
         .description = "on/off/auto",
         .enum_table = &OnOffAuto_lookup,
         .get = qdev_propinfo_get_enum,
         .set = qdev_propinfo_set_enum,
         .set_default_value = qdev_propinfo_set_default_value_enum,
         .realized_set_allowed = true,
     };

which requires including "hw/core/qdev-prop-internal.h".

I think your method is preferable. Please add a little comment
before :

     qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
     qdev_prop_on_off_auto_mutable.realized_set_allowed = true;

Thanks,

C.




^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-02-27 22:01     ` Maciej S. Szmigiero
@ 2025-02-28  8:53       ` Cédric Le Goater
  2025-02-28 20:48         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-28  8:53 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/27/25 23:01, Maciej S. Szmigiero wrote:
> On 27.02.2025 07:48, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Allow capping the maximum count of in-flight VFIO device state buffers
>>> queued at the destination, otherwise a malicious QEMU source could
>>> theoretically cause the target QEMU to allocate unlimited amounts of memory
>>> for buffers-in-flight.
>>>
>>> Since this is not expected to be a realistic threat in most of VFIO live
>>> migration use cases and the right value depends on the particular setup
>>> disable the limit by default by setting it to UINT64_MAX.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c   | 14 ++++++++++++++
>>>   hw/vfio/pci.c                 |  2 ++
>>>   include/hw/vfio/vfio-common.h |  1 +
>>>   3 files changed, 17 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 18a5ff964a37..04aa3f4a6596 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -53,6 +53,7 @@ typedef struct VFIOMultifd {
>>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>       uint32_t load_buf_idx;
>>>       uint32_t load_buf_idx_last;
>>> +    uint32_t load_buf_queued_pending_buffers;
>>>   } VFIOMultifd;
>>>   static void vfio_state_buffer_clear(gpointer data)
>>> @@ -121,6 +122,15 @@ static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>>>       assert(packet->idx >= multifd->load_buf_idx);
>>> +    multifd->load_buf_queued_pending_buffers++;
>>> +    if (multifd->load_buf_queued_pending_buffers >
>>> +        vbasedev->migration_max_queued_buffers) {
>>> +        error_setg(errp,
>>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>>> +        return false;
>>> +    }
>>> +
>>>       lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>>>       lb->len = packet_total_size - sizeof(*packet);
>>>       lb->is_present = true;
>>> @@ -374,6 +384,9 @@ static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>>>               goto ret_signal;
>>>           }
>>> +        assert(multifd->load_buf_queued_pending_buffers > 0);
>>> +        multifd->load_buf_queued_pending_buffers--;
>>> +
>>>           if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>>>               trace_vfio_load_state_device_buffer_end(vbasedev->name);
>>>           }
>>> @@ -408,6 +421,7 @@ VFIOMultifd *vfio_multifd_new(void)
>>>       multifd->load_buf_idx = 0;
>>>       multifd->load_buf_idx_last = UINT32_MAX;
>>> +    multifd->load_buf_queued_pending_buffers = 0;
>>>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>>>       multifd->load_bufs_thread_running = false;
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 9111805ae06c..247418f0fce2 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3383,6 +3383,8 @@ static const Property vfio_pci_dev_properties[] = {
>>>                   vbasedev.migration_multifd_transfer,
>>>                   qdev_prop_on_off_auto_mutable, OnOffAuto,
>>>                   .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>>> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
>>> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
>>
>> UINT64_MAX doesn't make sense to me. What would be a reasonable value ?
> 
> It's the value that effectively disables this limit.
> 
>> Have you monitored the max ? Should we collect some statistics on this
>> value and raise a warning if a high water mark is reached ? I think
>> this would more useful.
> 
> It's an additional mechanism, which is not expected to be necessary
> in most of real-world setups, hence it's disabled by default:
>> Since this is not expected to be a realistic threat in most of VFIO live
>> migration use cases and the right value depends on the particular setup
>> disable the limit by default by setting it to UINT64_MAX.
> 
> The minimum value that works with particular setup depends on number of
> multifd channels, probably also the number of NIC queues, etc. so it's
> not something we should propose hard default to - unless it's a very
> high default like 100 buffers, but then why have it set by default?.
> 
> IMHO setting it to UINT64_MAX clearly shows that it is disabled by
> default since it obviously couldn't be set higher.

This doesn't convince me that we should take this patch in QEMU 10.0.
Please keep for now. We will decide in v6.
  
>>>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>>>                        vbasedev.migration_events, false),
>>>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
>>
>>
>> Please add property documentation in vfio_pci_dev_class_init()
>>
> 
> I'm not sure what you mean by that, vfio_pci_dev_class_init() doesn't
> contain any documentation or even references to either
> x-migration-max-queued-buffers or x-migration-multifd-transfer:

Indeed :/ I am trying to fix documentation here :

   https://lore.kernel.org/qemu-devel/20250217173455.449983-1-clg@redhat.com/

Please do something similar. I will polish the edges when merging
if necessary.

Overall, we should improve VFIO documentation, migration is one sub-feature
among many.

Thanks,

C.



>> static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>> {
>>     DeviceClass *dc = DEVICE_CLASS(klass);
>>     PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
>>
>>     device_class_set_legacy_reset(dc, vfio_pci_reset);
>>     device_class_set_props(dc, vfio_pci_dev_properties);
>> #ifdef CONFIG_IOMMUFD
>>     object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
>> #endif
>>     dc->desc = "VFIO-based PCI device assignment";
>>     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>     pdc->realize = vfio_realize;
>>     pdc->exit = vfio_exitfn;
>>     pdc->config_read = vfio_pci_read_config;
>>     pdc->config_write = vfio_pci_write_config;
>> }
> 
> 
>> Thanks,
>>
>> C.
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-02-26 21:05     ` Maciej S. Szmigiero
@ 2025-02-28  9:11       ` Cédric Le Goater
  2025-02-28 20:48         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-28  9:11 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/26/25 22:05, Maciej S. Szmigiero wrote:
> On 26.02.2025 14:49, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Since it's important to finish loading device state transferred via the
>>> main migration channel (via save_live_iterate SaveVMHandler) before
>>> starting loading the data asynchronously transferred via multifd the thread
>>> doing the actual loading of the multifd transferred data is only started
>>> from switchover_start SaveVMHandler.
>>>
>>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>>
>>> This sub-command is only sent after all save_live_iterate data have already
>>> been posted so it is safe to commence loading of the multifd-transferred
>>> device state upon receiving it - loading of save_live_iterate data happens
>>> synchronously in the main migration thread (much like the processing of
>>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>>> processed all the proceeding data must have already been loaded.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/migration-multifd.h |   2 +
>>>   hw/vfio/migration.c         |  12 ++
>>>   hw/vfio/trace-events        |   5 +
>>>   4 files changed, 244 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 5d5ee1393674..b3a88c062769 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>>>   } VFIOStateBuffer;
>>>   typedef struct VFIOMultifd {
>>> +    QemuThread load_bufs_thread;
>>> +    bool load_bufs_thread_running;
>>> +    bool load_bufs_thread_want_exit;
>>> +
>>>       VFIOStateBuffers load_bufs;
>>>       QemuCond load_bufs_buffer_ready_cond;
>>> +    QemuCond load_bufs_thread_finished_cond;
>>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>       uint32_t load_buf_idx;
>>>       uint32_t load_buf_idx_last;
>>> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>>       return true;
>>>   }
>>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>>> +{
>>> +    return -EINVAL;
>>> +}
>>
>>
>> please move to next patch.
> 
> As I wrote on the previous version of the patch set at
> https://lore.kernel.org/qemu-devel/4f335de0-ba9f-4537-b230-2cf8af1c160b@maciej.szmigiero.name/:
>> The dummy call has to be there, otherwise the code at the
>> previous commit time wouldn't compile since that
>> vfio_load_bufs_thread_load_config() call is a part of
>> vfio_load_bufs_thread().
>>
>> This is an artifact of splitting the whole load operation in
>> multiple commits.
> 
> I think adding empty dummy implementations is the typical way
> to do this - much like you asked today to leave
> vfio_multifd_transfer_setup() returning true unconditionally
> before being filled with true implementation in later patch.
> 
> See also my response at the end of this e-mail message, below
> the call to vfio_load_bufs_thread_load_config().
> 
>>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>>> +{
>>> +    VFIOStateBuffer *lb;
>>> +    guint bufs_len;
>>
>> guint:  I guess it's ok to use here. It is not common practice in VFIO.
>>
>>> +
>>> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
>>> +    if (multifd->load_buf_idx >= bufs_len) {
>>> +        assert(multifd->load_buf_idx == bufs_len);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
>>> +                               multifd->load_buf_idx);
>>
>> Could be one line. minor.
>>
>>> +    if (!lb->is_present) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    return lb;
>>> +}
>>> +
>>> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>>> +                                         VFIOStateBuffer *lb,
>>> +                                         Error **errp)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    g_autofree char *buf = NULL;
>>> +    char *buf_cur;
>>> +    size_t buf_len;
>>> +
>>> +    if (!lb->len) {
>>> +        return true;
>>> +    }
>>> +
>>> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>>> +                                                   multifd->load_buf_idx);
>>
>> I thin we can move this trace event to vfio_load_bufs_thread()
> 
> It would get messy since we don't load empty buffers,
> so we we don't print this trace point (and its _end sibling)
> for empty buffers.
> 
> If we print this in vfio_load_bufs_thread() then it would
> need to duplicate that !lb->len check.
> 
>>> +    /* lb might become re-allocated when we drop the lock */
>>> +    buf = g_steal_pointer(&lb->data);
>>> +    buf_cur = buf;
>>> +    buf_len = lb->len;
>>> +    while (buf_len > 0) {
>>> +        ssize_t wr_ret;
>>> +        int errno_save;
>>> +
>>> +        /*
>>> +         * Loading data to the device takes a while,
>>> +         * drop the lock during this process.
>>> +         */
>>> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
>>> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);> +        errno_save = errno;
>>> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
>>> +
>>> +        if (wr_ret < 0) {
>>> +            error_setg(errp,
>>> +                       "writing state buffer %" PRIu32 " failed: %d",
>>> +                       multifd->load_buf_idx, errno_save);
>>> +            return false;
>>> +        }
>>> +
>>> +        assert(wr_ret <= buf_len);
>>> +        buf_len -= wr_ret;
>>> +        buf_cur += wr_ret;
>>> +    }
>>> +
>>> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>>> +                                                 multifd->load_buf_idx);
>>
>> and drop this trace event.
> 
> That's important data since it provides for how long it took to load that
> buffer (_end - _start).
> 
> It's not the same information as _start(next buffer) - _start(current buffer)
> since the next buffer might not have arrived yet so its loading won't
> start immediately after the end of loading of the previous one.
> 
>> In which case, we can modify the parameters of vfio_load_state_buffer_write()
>> to use directly a 'VFIOMultifd *multifd'and an fd instead of "migration->data_fd".
>>
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
>>> +                                            bool *should_quit)
>>> +{
>>> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
>>> +}
>>> +
>>> +/*
>>> + * This thread is spawned by vfio_multifd_switchover_start() which gets
>>> + * called upon encountering the switchover point marker in main migration
>>> + * stream.
>>> + *
>>> + * It exits after either:
>>> + * * completing loading the remaining device state and device config, OR:
>>> + * * encountering some error while doing the above, OR:
>>> + * * being forcefully aborted by the migration core by it setting should_quit
>>> + *   or by vfio_load_cleanup_load_bufs_thread() setting
>>> + *   multifd->load_bufs_thread_want_exit.
>>> + */
>>> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    bool ret = true;
>>> +    int config_ret;
>>
>> No needed IMO. see below.
>>
>>> +
>>> +    assert(multifd);
>>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>>> +
>>> +    assert(multifd->load_bufs_thread_running);
>>
>> We could add a trace event for the start and the end of the thread.
> 
> Added vfio_load_bufs_thread_{start,end} trace events now.
> 
>>> +    while (true) {
>>> +        VFIOStateBuffer *lb;
>>> +
>>> +        /*
>>> +         * Always check cancellation first after the buffer_ready wait below in
>>> +         * case that cond was signalled by vfio_load_cleanup_load_bufs_thread().
>>> +         */
>>> +        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
>>> +            error_setg(errp, "operation cancelled");
>>> +            ret = false;
>>> +            goto ret_signal;
>>
>> goto thread_exit ?
> 
> I'm not sure that I fully understand this comment.
> Do you mean to rename ret_signal label to thread_exit?


Yes. I find label 'thread_exit' more meaning full. This is minor since
there is only one 'exit' label.

> 
>>> +        }
>>> +
>>> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
>>> +
>>> +        lb = vfio_load_state_buffer_get(multifd);
>>> +        if (!lb) {
>>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>>> +                                                        multifd->load_buf_idx);
>>> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
>>> +                           &multifd->load_bufs_mutex);
>>> +            continue;
>>> +        }
>>> +
>>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
>>> +            break;
>>> +        }
>>> +
>>> +        if (multifd->load_buf_idx == 0) {
>>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>>> +        }
>>> +
>>> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
>>> +            ret = false;
>>> +            goto ret_signal;
>>> +        }
>>> +
>>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>>> +        }
>>> +
>>> +        multifd->load_buf_idx++;
>>> +    }
>>
>> if ret is assigned to true here, the "ret = false" can dropped
> 
> I inverted the "ret" logic here now - initialized ret to false
> at definition, removed "ret = false" at every failure/early exit block
> and added "ret = true" just before the "ret_signal" label.
> 
>>> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
>>> +    if (config_ret) {
>>> +        error_setg(errp, "load config state failed: %d", config_ret);
>>> +        ret = false;
>>> +    }
>>
>> please move to next patch. This is adding nothing to this patch
>> since it's returning -EINVAL.
>>
> 
> That's the whole point - if someone were to accidentally enable this
> (for example by forgetting to apply the next patch when backporting
> the series) it would fail safely with EINVAL instead of having a
> half-broken implementation.

OK. Let's keep it that way.


Thanks,

C.


> 
> Another option would be to simply integrate the next patch into this
> one as these are two parts of the same single operation and I think
> splitting them in two in the end brings little value.
> 
>> Thanks,
>>
>> C.
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side
  2025-02-26 21:05     ` Maciej S. Szmigiero
@ 2025-02-28  9:13       ` Cédric Le Goater
  2025-02-28 20:49         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-28  9:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/26/25 22:05, Maciej S. Szmigiero wrote:
> On 26.02.2025 17:43, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Implement the multifd device state transfer via additional per-device
>>> thread inside save_live_complete_precopy_thread handler.
>>>
>>> Switch between doing the data transfer in the new handler and doing it
>>> in the old save_state handler depending on the
>>> x-migration-multifd-transfer device property value.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c   | 139 ++++++++++++++++++++++++++++++++++
>>>   hw/vfio/migration-multifd.h   |   5 ++
>>>   hw/vfio/migration.c           |  26 +++++--
>>>   hw/vfio/trace-events          |   2 +
>>>   include/hw/vfio/vfio-common.h |   8 ++
>>>   5 files changed, 174 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 7200f6f1c2a2..0cfa9d31732a 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -476,6 +476,145 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>>>       return true;
>>>   }
>>> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
>>> +{
>>> +    assert(vfio_multifd_transfer_enabled(vbasedev));
>>> +
>>> +    /*
>>> +     * Emit dummy NOP data on the main migration channel since the actual
>>> +     * device state transfer is done via multifd channels.
>>> +     */
>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> +}
>>> +
>>> +static bool
>>> +vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev,
>>> +                                               char *idstr,
>>> +                                               uint32_t instance_id,
>>> +                                               uint32_t idx,
>>> +                                               Error **errp)
>>> +{
>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>> +    g_autoptr(QEMUFile) f = NULL;
>>> +    int ret;
>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>> +    size_t packet_len;
>>> +
>>> +    bioc = qio_channel_buffer_new(0);
>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>>> +
>>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>>> +
>>> +    if (vfio_save_device_config_state(f, vbasedev, errp)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    ret = qemu_fflush(f);
>>> +    if (ret) {
>>> +        error_setg(errp, "save config state flush failed: %d", ret);
>>> +        return false;
>>> +    }
>>> +
>>> +    packet_len = sizeof(*packet) + bioc->usage;
>>> +    packet = g_malloc0(packet_len);
>>> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
>>> +    packet->idx = idx;
>>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>>> +
>>> +    if (!multifd_queue_device_state(idstr, instance_id,
>>> +                                    (char *)packet, packet_len)) {
>>> +        error_setg(errp, "multifd config data queuing failed");
>>> +        return false;
>>> +    }
>>> +
>>> +    vfio_add_bytes_transferred(packet_len);
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +/*
>>> + * This thread is spawned by the migration core directly via
>>> + * .save_live_complete_precopy_thread SaveVMHandler.
>>> + *
>>> + * It exits after either:
>>> + * * completing saving the remaining device state and device config, OR:
>>> + * * encountering some error while doing the above, OR:
>>> + * * being forcefully aborted by the migration core by
>>> + *   multifd_device_state_save_thread_should_exit() returning true.
>>> + */
>>> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
>>> +                                       Error **errp)
>>
>> In qemu_savevm_state_complete_precopy_iterable(), this handler is
>> called :
>>
>>      ....
>>      if (multifd_device_state) {
>>          QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>>              SaveLiveCompletePrecopyThreadHandler hdlr;
>>
>>              if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
>>                               se->ops->has_postcopy(se->opaque)) ||
>>                  !se->ops->save_live_complete_precopy_thread) {
>>                  continue;
>>              }
>>
>>              hdlr = se->ops->save_live_complete_precopy_thread;
>>              multifd_spawn_device_state_save_thread(hdlr,
>>                                                     se->idstr, se->instance_id,
>>                                                     se->opaque);
>>          }
>>      }
>>
>>
>> I suggest naming it : vfio_multifd_save_complete_precopy_thread()
> 
> Renamed accordingly.
> 
>>> +{
>>> +    VFIODevice *vbasedev = d->handler_opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    bool ret;
>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>> +    uint32_t idx;
>>> +
>>> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
>>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>>> +        return true;
>>> +    }
>>> +
>>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>>> +                                                  d->idstr, d->instance_id);
>>> +
>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>> +    if (vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>> +                                 VFIO_DEVICE_STATE_STOP, errp)) {
>>> +        ret = false;
>>
>> These "ret = false" can be avoided if the variable is set at the
>> top of the function.
> 
> I inverted the "ret" logic here as in vfio_load_bufs_thread()
> to make it false by default and set to true just before early
> exit label.

ok. Let's see what it looks like in v6.

>>> +        goto ret_finish;
>>
>>
>> goto thread_exit ?
> 
> As I asked in one of the previous patches,
> do this comment mean that your want to rename ret_finish label to
> thread_exit?

Yes. I find label 'thread_exit' more meaning full. This is minor since
there is only one 'exit' label.


> 
>>> +    }
>>> +
>>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>>> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
>>> +
>>> +    for (idx = 0; ; idx++) {
>>> +        ssize_t data_size;
>>> +        size_t packet_size;
>>> +
>>> +        if (multifd_device_state_save_thread_should_exit()) {
>>> +            error_setg(errp, "operation cancelled");
>>> +            ret = false;
>>> +            goto ret_finish;
>>> +        }> +
>>> +        data_size = read(migration->data_fd, &packet->data,
>>> +                         migration->data_buffer_size);
>>> +        if (data_size < 0) {
>>> +            error_setg(errp, "reading state buffer %" PRIu32 " failed: %d",
>>> +                       idx, errno);
>>> +            ret = false;
>>> +            goto ret_finish;
>>> +        } else if (data_size == 0) {
>>> +            break;
>>> +        }
>>> +
>>> +        packet->idx = idx;
>>> +        packet_size = sizeof(*packet) + data_size;
>>> +
>>> +        if (!multifd_queue_device_state(d->idstr, d->instance_id,
>>> +                                        (char *)packet, packet_size)) {
>>> +            error_setg(errp, "multifd data queuing failed");
>>> +            ret = false;
>>> +            goto ret_finish;
>>> +        }
>>> +
>>> +        vfio_add_bytes_transferred(packet_size);
>>> +    }
>>> +
>>> +    ret = vfio_save_complete_precopy_thread_config_state(vbasedev,
>>> +                                                         d->idstr,
>>> +                                                         d->instance_id,
>>> +                                                         idx, errp);
>>> +
>>> +ret_finish:
>>> +    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   int vfio_multifd_switchover_start(VFIODevice *vbasedev)
>>>   {
>>>       VFIOMigration *migration = vbasedev->migration;
>>> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
>>> index 09cbb437d9d1..79780d7b5392 100644
>>> --- a/hw/vfio/migration-multifd.h
>>> +++ b/hw/vfio/migration-multifd.h
>>> @@ -25,6 +25,11 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>>>   bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>>                               Error **errp);
>>> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f);
>>> +
>>> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
>>> +                                       Error **errp);
>>> +
>>>   int vfio_multifd_switchover_start(VFIODevice *vbasedev);
>>>   #endif
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index b962309f7c27..69dcf2dac2fa 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
> (..)
>>> @@ -238,8 +238,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>>       return ret;
>>>   }
>>> -static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>> -                                         Error **errp)
>>> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>>       int ret;
>>> @@ -453,6 +452,10 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>       int ret;
>>> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>
>> please move to another patch with the similar change of patch 25.
>>
> 
> This patch is about the send/save side while patch 25
> is called "*receive* init/cleanup".
> 
> So adding save setup to patch called "receive init" wouldn't be
> consistent with that patch subject.

In that case, could please add an extra patch checking for the consistency
of the settings ?


Thanks,

C.



> 
>>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> 
> (..)
>>> index ce2bdea8a2c2..ba851917f9fc 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -298,6 +298,14 @@ void vfio_add_bytes_transferred(unsigned long val);
>>>   bool vfio_device_state_is_running(VFIODevice *vbasedev);
>>>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>>> +#ifdef CONFIG_LINUX
>>> +int vfio_migration_set_state(VFIODevice *vbasedev,
>>> +                             enum vfio_device_mig_state new_state,
>>> +                             enum vfio_device_mig_state recover_state,
>>> +                             Error **errp);
>>
>> please move below with the other declarations under #ifdef CONFIG_LINUX.
>>
>>> +#endif
>>> +
>>> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp);
>>>   int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>>>   #ifdef CONFIG_LINUX
>>>
>>
> 
> Done.
> 
>>
>> Thanks,
>>
>> C.
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-27 22:01     ` Maciej S. Szmigiero
@ 2025-02-28 10:05       ` Cédric Le Goater
  2025-02-28 20:49         ` Maciej S. Szmigiero
  2025-02-28 23:38         ` Fabiano Rosas
  0 siblings, 2 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-02-28 10:05 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/27/25 23:01, Maciej S. Szmigiero wrote:
> On 27.02.2025 07:59, Cédric Le Goater wrote:
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Update the VFIO documentation at docs/devel/migration describing the
>>> changes brought by the multifd device state transfer.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
>>>   1 file changed, 71 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
>>> index c49482eab66d..d9b169d29921 100644
>>> --- a/docs/devel/migration/vfio.rst
>>> +++ b/docs/devel/migration/vfio.rst
>>> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>
>> Please add a new "multifd" documentation subsection at the end of the file
>> with this part :
>>
>>> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
>>> +_STOP_COPY state via multifd channels. This helps reduce downtime - especially
>>> +with multiple VFIO devices or with devices having a large migration state.
>>> +As an additional benefit, setting the VFIO device to _STOP_COPY state and
>>> +saving its config space is also parallelized (run in a separate thread) in
>>> +such migration mode.
>>> +
>>> +The multifd VFIO device state transfer is controlled by
>>> +"x-migration-multifd-transfer" VFIO device property. This property defaults to
>>> +AUTO, which means that VFIO device state transfer via multifd channels is
>>> +attempted in configurations that otherwise support it.
>>> +
> 
> Done - I also moved the parts about x-migration-max-queued-buffers
> and x-migration-load-config-after-iter description there since
> obviously they wouldn't make sense being left alone in the top section.
> 
>> I was expecting a much more detailed explanation on the design too  :
>>
>>   * in the cover letter
>>   * in the hw/vfio/migration-multifd.c
>>   * in some new file under docs/devel/migration/

I forgot to add  :

      * guide on how to use this new feature from QEMU and libvirt.
        something we can refer to for tests. That's a must have.
      * usage scenarios
        There are some benefits but it is not obvious a user would
        like to use multiple VFs in one VM, please explain.
        This is a major addition which needs justification anyhow
      * pros and cons

> I'm not sure what descriptions you exactly want in these places, 

Looking from the VFIO subsystem, the way this series works is very opaque.
There are a couple of a new migration handlers, new threads, new channels,
etc. It has been discussed several times with migration folks, please provide
a summary for a new reader as ignorant as everyone would be when looking at
a new file.


> but since
> that's just documentation (not code) it could be added after the code freeze...

That's the risk of not getting any ! and the initial proposal should be
discussed before code freeze.

For the general framework, I was expecting an extension of a "multifd"
subsection under :

   https://qemu.readthedocs.io/en/v9.2.0/devel/migration/features.html

but it doesn't exist :/

So, for now, let's use the new "multifd" subsection of

   https://qemu.readthedocs.io/en/v9.2.0/devel/migration/vfio.html

> 
>>
>> This section :
>>
>>> +Since the target QEMU needs to load device state buffers in-order it needs to
>>> +queue incoming buffers until they can be loaded into the device.
>>> +This means that a malicious QEMU source could theoretically cause the target
>>> +QEMU to allocate unlimited amounts of memory for such buffers-in-flight.
>>> +
>>> +The "x-migration-max-queued-buffers" property allows capping the maximum count
>>> +of these VFIO device state buffers queued at the destination.
>>> +
>>> +Because a malicious QEMU source causing OOM on the target is not expected to be
>>> +a realistic threat in most of VFIO live migration use cases and the right value
>>> +depends on the particular setup by default this queued buffers limit is
>>> +disabled by setting it to UINT64_MAX.
>>
>> should be in patch 34. It is not obvious it will be merged.
>>
> 
> ...which brings us to this point.
> 
> I think by this point in time (less then 2 weeks to code freeze) we should
> finally decide what is going to be included in the patch set.
> > This way this patch set could be well tested in its final form rather than
> having significant parts taken out of it at the eleventh hour.
> 
> If the final form is known also the documentation can be adjusted accordingly
> and user/admin documentation eventually written once the code is considered
> okay.
> 
> I though we discussed a few times the rationale behind both
> x-migration-max-queued-buffers and x-migration-load-config-after-iter properties
> but if you still have some concerns there please let me know before I prepare
> the next version of this patch set so I know whether to include these.

Patch 34, not sure yet.

Patch 35 is for next cycle IMO.

For QEMU 10.0, let's focus on x86 first and see how it goes. We can add
ARM support in QEMU 10.1 if nothing new arises. We will need the virt-arm
folks in cc: then.

Please keep patch 35 in v6 nevertheless, it is good for reference if
someone wants to apply on an out of tree QEMU.


Thanks,

C.


> 
>> This section :
>>
>>> +Some host platforms (like ARM64) require that VFIO device config is loaded only
>>> +after all iterables were loaded.
>>> +Such interlocking is controlled by "x-migration-load-config-after-iter" VFIO
>>> +device property, which in its default setting (AUTO) does so only on platforms
>>> +that actually require it.
>>
>> Should be in 35. Same reason.
>>
>>
>>>   When pre-copy is supported, it's possible to further reduce downtime by
>>>   enabling "switchover-ack" migration capability.
>>>   VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
>>> @@ -67,14 +98,39 @@ VFIO implements the device hooks for the iterative approach as follows:
>>>   * A ``switchover_ack_needed`` function that checks if the VFIO device uses
>>>     "switchover-ack" migration capability when this capability is enabled.
>>> -* A ``save_state`` function to save the device config space if it is present.
>>> -
>>> -* A ``save_live_complete_precopy`` function that sets the VFIO device in
>>> -  _STOP_COPY state and iteratively copies the data for the VFIO device until
>>> -  the vendor driver indicates that no data remains.
>>> -
>>> -* A ``load_state`` function that loads the config section and the data
>>> -  sections that are generated by the save functions above.
>>> +* A ``switchover_start`` function that in the multifd mode starts a thread that
>>> +  reassembles the multifd received data and loads it in-order into the device.
>>> +  In the non-multifd mode this function is a NOP.
>>> +
>>> +* A ``save_state`` function to save the device config space if it is present
>>> +  in the non-multifd mode.
>>> +  In the multifd mode it just emits either a dummy EOS marker or
>>> +  "all iterables were loaded" flag for configurations that need to defer
>>> +  loading device config space after them.
>>> +
>>> +* A ``save_live_complete_precopy`` function that in the non-multifd mode sets
>>> +  the VFIO device in _STOP_COPY state and iteratively copies the data for the
>>> +  VFIO device until the vendor driver indicates that no data remains.
>>> +  In the multifd mode it just emits a dummy EOS marker.
>>> +
>>> +* A ``save_live_complete_precopy_thread`` function that in the multifd mode
>>> +  provides thread handler performing multifd device state transfer.
>>> +  It sets the VFIO device to _STOP_COPY state, iteratively reads the data
>>> +  from the VFIO device and queues it for multifd transmission until the vendor
>>> +  driver indicates that no data remains.
>>> +  After that, it saves the device config space and queues it for multifd
>>> +  transfer too.
>>> +  In the non-multifd mode this thread is a NOP.
>>> +
>>> +* A ``load_state`` function that loads the data sections that are generated
>>> +  by the main migration channel save functions above.
>>> +  In the non-multifd mode it also loads the config section, while in the
>>> +  multifd mode it handles the optional "all iterables were loaded" flag if
>>> +  it is in use.
>>> +
>>> +* A ``load_state_buffer`` function that loads the device state and the device
>>> +  config that arrived via multifd channels.
>>> +  It's used only in the multifd mode.
>>
>> Please move the documentation of the new migration handlers in the
>> patch introducing them.
>>
>>
>> Thanks,
>>
>> C.
>>
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-28  8:09       ` Cédric Le Goater
@ 2025-02-28 20:47         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-28 20:47 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.02.2025 09:09, Cédric Le Goater wrote:
> On 2/26/25 22:04, Maciej S. Szmigiero wrote:
>> On 26.02.2025 11:43, Cédric Le Goater wrote:
>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> The multifd received data needs to be reassembled since device state
>>>> packets sent via different multifd channels can arrive out-of-order.
>>>>
>>>> Therefore, each VFIO device state packet carries a header indicating its
>>>> position in the stream.
>>>> The raw device state data is saved into a VFIOStateBuffer for later
>>>> in-order loading into the device.
>>>>
>>>> The last such VFIO device state packet should have
>>>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration-multifd.c | 103 ++++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/migration-multifd.h |   3 ++
>>>>   hw/vfio/migration.c         |   1 +
>>>>   hw/vfio/trace-events        |   1 +
>>>>   4 files changed, 108 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>>> index c2defc0efef0..5d5ee1393674 100644
>>>> --- a/hw/vfio/migration-multifd.c
>>>> +++ b/hw/vfio/migration-multifd.c
>>>> @@ -42,6 +42,11 @@ typedef struct VFIOStateBuffer {
>>>>   } VFIOStateBuffer;
>>>>   typedef struct VFIOMultifd {
>>>> +    VFIOStateBuffers load_bufs;
>>>> +    QemuCond load_bufs_buffer_ready_cond;
>>>> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>> +    uint32_t load_buf_idx;
>>>> +    uint32_t load_buf_idx_last;
>>>>   } VFIOMultifd;
>>>>   static void vfio_state_buffer_clear(gpointer data)
>>>> @@ -87,15 +92,113 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>>>   }
>>>
>>> this routine expects load_bufs_mutex to be locked ? May be say so.
>>
>> I guess the comment above pertains to the vfio_load_state_buffer_insert()
>> below.
>>
>> Do you mean it should have a comment that it expects to be called
>> under load_bufs_mutex?
> 
> Just a one liner like :
> 
> /* called with load_bufs_mutex locked */
> 
> ?
> 
> It's good to have for the future generations.

Okay, done.

>>
>>>> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>>>
>>> could you pass VFIOMultifd* instead  ?
>>
>> No, it needs vbasedev->migration_max_queued_buffers too (introduced
>> in later patch).
>> > Also, most of VFIO routines (besides very small helpers/wrappers)
>> take VFIODevice *.
> 
> OK. It's minor but I prefer when parameters are limited to the minimum.
> Having 'VFIODevice *' opens doors to a lot of state.
> 
> 
> Thanks,
> 
> C.
> 

Thanks.
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable
  2025-02-28  8:44       ` Cédric Le Goater
@ 2025-02-28 20:47         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-28 20:47 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.02.2025 09:44, Cédric Le Goater wrote:
> On 2/26/25 22:05, Maciej S. Szmigiero wrote:
>> On 26.02.2025 18:59, Cédric Le Goater wrote:
>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> DEFINE_PROP_ON_OFF_AUTO() property isn't runtime-mutable so using it
>>>> would mean that the source VM would need to decide upfront at startup
>>>> time whether it wants to do a multifd device state transfer at some
>>>> point.
>>>>
>>>> Source VM can run for a long time before being migrated so it is
>>>> desirable to have a fallback mechanism to the old way of transferring
>>>> VFIO device state if it turns to be necessary.
>>>>
>>>> This brings this property to the same mutability level as ordinary
>>>> migration parameters, which too can be adjusted at the run time.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/pci.c | 12 +++++++++---
>>>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 184ff882f9d1..9111805ae06c 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>>>   }
>>>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
>>>
>>> please use another name, like vfio_pci_migration_multifd_transfer_prop.
>>
>> Done.
>>
>>> I wish we could define the property info all at once.
>>
>> I'm not sure what you mean here, could you please elaborate a bit more?
> 
> I meant :
> 
>      static const PropertyInfo vfio_pci_migration_multifd_transfer_prop = {
>          .name = "OnOffAuto",
>          .description = "on/off/auto",
>          .enum_table = &OnOffAuto_lookup,
>          .get = qdev_propinfo_get_enum,
>          .set = qdev_propinfo_set_enum,
>          .set_default_value = qdev_propinfo_set_default_value_enum,
>          .realized_set_allowed = true,
>      };
> 
> which requires including "hw/core/qdev-prop-internal.h".
> 
> I think your method is preferable. Please add a little comment
> before :
> 
>      qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
>      qdev_prop_on_off_auto_mutable.realized_set_allowed = true;

Added a comment above these code lines describing why custom
property type is justified in this case.

> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-02-28  8:53       ` Cédric Le Goater
@ 2025-02-28 20:48         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-28 20:48 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.02.2025 09:53, Cédric Le Goater wrote:
> On 2/27/25 23:01, Maciej S. Szmigiero wrote:
>> On 27.02.2025 07:48, Cédric Le Goater wrote:
>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Allow capping the maximum count of in-flight VFIO device state buffers
>>>> queued at the destination, otherwise a malicious QEMU source could
>>>> theoretically cause the target QEMU to allocate unlimited amounts of memory
>>>> for buffers-in-flight.
>>>>
>>>> Since this is not expected to be a realistic threat in most of VFIO live
>>>> migration use cases and the right value depends on the particular setup
>>>> disable the limit by default by setting it to UINT64_MAX.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration-multifd.c   | 14 ++++++++++++++
>>>>   hw/vfio/pci.c                 |  2 ++
>>>>   include/hw/vfio/vfio-common.h |  1 +
>>>>   3 files changed, 17 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>>> index 18a5ff964a37..04aa3f4a6596 100644
>>>> --- a/hw/vfio/migration-multifd.c
>>>> +++ b/hw/vfio/migration-multifd.c
>>>> @@ -53,6 +53,7 @@ typedef struct VFIOMultifd {
>>>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>>       uint32_t load_buf_idx;
>>>>       uint32_t load_buf_idx_last;
>>>> +    uint32_t load_buf_queued_pending_buffers;
>>>>   } VFIOMultifd;
>>>>   static void vfio_state_buffer_clear(gpointer data)
>>>> @@ -121,6 +122,15 @@ static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>>>>       assert(packet->idx >= multifd->load_buf_idx);
>>>> +    multifd->load_buf_queued_pending_buffers++;
>>>> +    if (multifd->load_buf_queued_pending_buffers >
>>>> +        vbasedev->migration_max_queued_buffers) {
>>>> +        error_setg(errp,
>>>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>>>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>>>> +        return false;
>>>> +    }
>>>> +
>>>>       lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>>>>       lb->len = packet_total_size - sizeof(*packet);
>>>>       lb->is_present = true;
>>>> @@ -374,6 +384,9 @@ static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>>>>               goto ret_signal;
>>>>           }
>>>> +        assert(multifd->load_buf_queued_pending_buffers > 0);
>>>> +        multifd->load_buf_queued_pending_buffers--;
>>>> +
>>>>           if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>>>>               trace_vfio_load_state_device_buffer_end(vbasedev->name);
>>>>           }
>>>> @@ -408,6 +421,7 @@ VFIOMultifd *vfio_multifd_new(void)
>>>>       multifd->load_buf_idx = 0;
>>>>       multifd->load_buf_idx_last = UINT32_MAX;
>>>> +    multifd->load_buf_queued_pending_buffers = 0;
>>>>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>>>>       multifd->load_bufs_thread_running = false;
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 9111805ae06c..247418f0fce2 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3383,6 +3383,8 @@ static const Property vfio_pci_dev_properties[] = {
>>>>                   vbasedev.migration_multifd_transfer,
>>>>                   qdev_prop_on_off_auto_mutable, OnOffAuto,
>>>>                   .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>>>> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
>>>> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
>>>
>>> UINT64_MAX doesn't make sense to me. What would be a reasonable value ?
>>
>> It's the value that effectively disables this limit.
>>
>>> Have you monitored the max ? Should we collect some statistics on this
>>> value and raise a warning if a high water mark is reached ? I think
>>> this would more useful.
>>
>> It's an additional mechanism, which is not expected to be necessary
>> in most of real-world setups, hence it's disabled by default:
>>> Since this is not expected to be a realistic threat in most of VFIO live
>>> migration use cases and the right value depends on the particular setup
>>> disable the limit by default by setting it to UINT64_MAX.
>>
>> The minimum value that works with particular setup depends on number of
>> multifd channels, probably also the number of NIC queues, etc. so it's
>> not something we should propose hard default to - unless it's a very
>> high default like 100 buffers, but then why have it set by default?.
>>
>> IMHO setting it to UINT64_MAX clearly shows that it is disabled by
>> default since it obviously couldn't be set higher.
> 
> This doesn't convince me that we should take this patch in QEMU 10.0.
> Please keep for now. We will decide in v6.

Okay, let's decide at the v6 time then.

>>>>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>>>>                        vbasedev.migration_events, false),
>>>>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
>>>
>>>
>>> Please add property documentation in vfio_pci_dev_class_init()
>>>
>>
>> I'm not sure what you mean by that, vfio_pci_dev_class_init() doesn't
>> contain any documentation or even references to either
>> x-migration-max-queued-buffers or x-migration-multifd-transfer:
> 
> Indeed :/ I am trying to fix documentation here :
> 
>    https://lore.kernel.org/qemu-devel/20250217173455.449983-1-clg@redhat.com/
> 
> Please do something similar. I will polish the edges when merging
> if necessary.

Ahh, I see now - that patch set of yours isn't merged upstream yet so
that's why I did not know what you had on mind.

> Overall, we should improve VFIO documentation, migration is one sub-feature
> among many.

Sure - I've now added object_class_property_set_description() description
for all 3 newly added parameters:
x-migration-multifd-transfer, x-migration-load-config-after-iter and
x-migration-max-queued-buffers.

> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-02-28  9:11       ` Cédric Le Goater
@ 2025-02-28 20:48         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-28 20:48 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.02.2025 10:11, Cédric Le Goater wrote:
> On 2/26/25 22:05, Maciej S. Szmigiero wrote:
>> On 26.02.2025 14:49, Cédric Le Goater wrote:
>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Since it's important to finish loading device state transferred via the
>>>> main migration channel (via save_live_iterate SaveVMHandler) before
>>>> starting loading the data asynchronously transferred via multifd the thread
>>>> doing the actual loading of the multifd transferred data is only started
>>>> from switchover_start SaveVMHandler.
>>>>
>>>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>>>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>>>
>>>> This sub-command is only sent after all save_live_iterate data have already
>>>> been posted so it is safe to commence loading of the multifd-transferred
>>>> device state upon receiving it - loading of save_live_iterate data happens
>>>> synchronously in the main migration thread (much like the processing of
>>>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>>>> processed all the proceeding data must have already been loaded.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/migration-multifd.h |   2 +
>>>>   hw/vfio/migration.c         |  12 ++
>>>>   hw/vfio/trace-events        |   5 +
>>>>   4 files changed, 244 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>>> index 5d5ee1393674..b3a88c062769 100644
>>>> --- a/hw/vfio/migration-multifd.c
>>>> +++ b/hw/vfio/migration-multifd.c
(..)
>>>> +    while (true) {
>>>> +        VFIOStateBuffer *lb;
>>>> +
>>>> +        /*
>>>> +         * Always check cancellation first after the buffer_ready wait below in
>>>> +         * case that cond was signalled by vfio_load_cleanup_load_bufs_thread().
>>>> +         */
>>>> +        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
>>>> +            error_setg(errp, "operation cancelled");
>>>> +            ret = false;
>>>> +            goto ret_signal;
>>>
>>> goto thread_exit ?
>>
>> I'm not sure that I fully understand this comment.
>> Do you mean to rename ret_signal label to thread_exit?
> 
> 
> Yes. I find label 'thread_exit' more meaning full. This is minor since
> there is only one 'exit' label.
> 

Renamed ret_signal to thread_exit then.

(..)
>>>> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
>>>> +    if (config_ret) {
>>>> +        error_setg(errp, "load config state failed: %d", config_ret);
>>>> +        ret = false;
>>>> +    }
>>>
>>> please move to next patch. This is adding nothing to this patch
>>> since it's returning -EINVAL.
>>>
>>
>> That's the whole point - if someone were to accidentally enable this
>> (for example by forgetting to apply the next patch when backporting
>> the series) it would fail safely with EINVAL instead of having a
>> half-broken implementation.
> 
> OK. Let's keep it that way.
> 
> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side
  2025-02-28  9:13       ` Cédric Le Goater
@ 2025-02-28 20:49         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-28 20:49 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.02.2025 10:13, Cédric Le Goater wrote:
> On 2/26/25 22:05, Maciej S. Szmigiero wrote:
>> On 26.02.2025 17:43, Cédric Le Goater wrote:
>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Implement the multifd device state transfer via additional per-device
>>>> thread inside save_live_complete_precopy_thread handler.
>>>>
>>>> Switch between doing the data transfer in the new handler and doing it
>>>> in the old save_state handler depending on the
>>>> x-migration-multifd-transfer device property value.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration-multifd.c   | 139 ++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/migration-multifd.h   |   5 ++
>>>>   hw/vfio/migration.c           |  26 +++++--
>>>>   hw/vfio/trace-events          |   2 +
>>>>   include/hw/vfio/vfio-common.h |   8 ++
>>>>   5 files changed, 174 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>>> index 7200f6f1c2a2..0cfa9d31732a 100644
>>>> --- a/hw/vfio/migration-multifd.c
>>>> +++ b/hw/vfio/migration-multifd.c
(..)
>>>> +{
>>>> +    VFIODevice *vbasedev = d->handler_opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    bool ret;
>>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>>> +    uint32_t idx;
>>>> +
>>>> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
>>>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>>>> +                                                  d->idstr, d->instance_id);
>>>> +
>>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>>> +    if (vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>>> +                                 VFIO_DEVICE_STATE_STOP, errp)) {
>>>> +        ret = false;
>>>
>>> These "ret = false" can be avoided if the variable is set at the
>>> top of the function.
>>
>> I inverted the "ret" logic here as in vfio_load_bufs_thread()
>> to make it false by default and set to true just before early
>> exit label.
> 
> ok. Let's see what it looks like in v6.
> 
>>>> +        goto ret_finish;
>>>
>>>
>>> goto thread_exit ?
>>
>> As I asked in one of the previous patches,
>> do this comment mean that your want to rename ret_finish label to
>> thread_exit?
> 
> Yes. I find label 'thread_exit' more meaning full. This is minor since
> there is only one 'exit' label.

Renamed ret_finish to thread_exit then.

> 
>>
(..)
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index b962309f7c27..69dcf2dac2fa 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>> (..)
>>>> @@ -238,8 +238,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>>>       return ret;
>>>>   }
>>>> -static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>>> -                                         Error **errp)
>>>> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>>       int ret;
>>>> @@ -453,6 +452,10 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>>       int ret;
>>>> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>
>>> please move to another patch with the similar change of patch 25.
>>>
>>
>> This patch is about the send/save side while patch 25
>> is called "*receive* init/cleanup".
>>
>> So adding save setup to patch called "receive init" wouldn't be
>> consistent with that patch subject.
> 
> In that case, could please add an extra patch checking for the consistency
> of the settings ?

I split out wiring vfio_multifd_setup() and vfio_multifd_cleanup() into
general VFIO load/save setup and cleanup methods from this patch and
patch "Multifd device state transfer support - receive init/cleanup"
into a brand new patch/commit.

By the way, due to changes discussed over the last two days
vfio_multifd_setup() (aka vfio_multifd_transfer_setup()) not only
does consistency checking but also allocates VFIOMultifd:
https://lore.kernel.org/qemu-devel/6546c3a4-bd81-42ea-88a2-b2f88ec2fbb3@maciej.szmigiero.name/

> 
> Thanks,
> 
> C.
> 
> 
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-28 10:05       ` Cédric Le Goater
@ 2025-02-28 20:49         ` Maciej S. Szmigiero
  2025-02-28 23:38         ` Fabiano Rosas
  1 sibling, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-28 20:49 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 28.02.2025 11:05, Cédric Le Goater wrote:
> On 2/27/25 23:01, Maciej S. Szmigiero wrote:
>> On 27.02.2025 07:59, Cédric Le Goater wrote:
>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Update the VFIO documentation at docs/devel/migration describing the
>>>> changes brought by the multifd device state transfer.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
>>>>   1 file changed, 71 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
>>>> index c49482eab66d..d9b169d29921 100644
>>>> --- a/docs/devel/migration/vfio.rst
>>>> +++ b/docs/devel/migration/vfio.rst
>>>> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>>>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>>>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>
>>> Please add a new "multifd" documentation subsection at the end of the file
>>> with this part :
>>>
>>>> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
>>>> +_STOP_COPY state via multifd channels. This helps reduce downtime - especially
>>>> +with multiple VFIO devices or with devices having a large migration state.
>>>> +As an additional benefit, setting the VFIO device to _STOP_COPY state and
>>>> +saving its config space is also parallelized (run in a separate thread) in
>>>> +such migration mode.
>>>> +
>>>> +The multifd VFIO device state transfer is controlled by
>>>> +"x-migration-multifd-transfer" VFIO device property. This property defaults to
>>>> +AUTO, which means that VFIO device state transfer via multifd channels is
>>>> +attempted in configurations that otherwise support it.
>>>> +
>>
>> Done - I also moved the parts about x-migration-max-queued-buffers
>> and x-migration-load-config-after-iter description there since
>> obviously they wouldn't make sense being left alone in the top section.
>>
>>> I was expecting a much more detailed explanation on the design too  :
>>>
>>>   * in the cover letter
>>>   * in the hw/vfio/migration-multifd.c
>>>   * in some new file under docs/devel/migration/
> 
> I forgot to add  :
> 
>       * guide on how to use this new feature from QEMU and libvirt.
>         something we can refer to for tests. That's a must have.

So basically a user's guide.

That's something I plan to write after the code is ready.

>       * usage scenarios
>         There are some benefits but it is not obvious a user would
>         like to use multiple VFs in one VM, please explain.

Hmm, this patch set does not bring ability to use multiple VFs
in a single VM - that ability is already in QEMU even without this
patch set.

As Yanghang has measured the downtime improvement happens even
with a single VF, although with more VFs one can additionality see
the scalability benefits of this patch set.

>         This is a major addition which needs justification anyhow
>       * pros and cons

The biggest advantage is obviously the downtime performance.

I'm not sure if there are any obvious disadvantages (assuming
the setup supports the multifd migration in the first place),
besides maybe slightly bigger memory usage for in-flight buffers?

But we have an option for capping that if someone is concerned
about it.

>> I'm not sure what descriptions you exactly want in these places, 
> 
> Looking from the VFIO subsystem, the way this series works is very opaque.
> There are a couple of a new migration handlers,

I've added descriptions of these 3 new migration handlers to
docs/devel/migration/vfio.rst.

They are also described in struct SaveVMHandlers in include/migration/register.h
and also in the commit messages that introduce them.

> new threads,

A total of two of these, their function is described in docs/devel/migration/vfio.rst
and also in the commit messages that introduce them.

> new channels,

I think you meant "new data type for multifd channel" here but that's
in migration core, not VFIO.

> etc. It has been discussed several times with migration folks, please provide
> a summary for a new reader as ignorant as everyone would be when looking at
> a new file.

I can certainly include all these in the new version cover letter if that's
easier for a new reader.

>> but since
>> that's just documentation (not code) it could be added after the code freeze...
> 
> That's the risk of not getting any ! and the initial proposal should be
> discussed before code freeze.
> 
> For the general framework, I was expecting an extension of a "multifd"
> subsection under :
> 
>    https://qemu.readthedocs.io/en/v9.2.0/devel/migration/features.html
> 
> but it doesn't exist :/

Looking at the source file for this page at docs/devel/migration/features.rst
the "multifd" section should appear on this page automatically after
I added it to docs/devel/migration/vfio.rst.

> So, for now, let's use the new "multifd" subsection of
> 
>    https://qemu.readthedocs.io/en/v9.2.0/devel/migration/vfio.html

Okay.

>>
>>>
>>> This section :
>>>
>>>> +Since the target QEMU needs to load device state buffers in-order it needs to
>>>> +queue incoming buffers until they can be loaded into the device.
>>>> +This means that a malicious QEMU source could theoretically cause the target
>>>> +QEMU to allocate unlimited amounts of memory for such buffers-in-flight.
>>>> +
>>>> +The "x-migration-max-queued-buffers" property allows capping the maximum count
>>>> +of these VFIO device state buffers queued at the destination.
>>>> +
>>>> +Because a malicious QEMU source causing OOM on the target is not expected to be
>>>> +a realistic threat in most of VFIO live migration use cases and the right value
>>>> +depends on the particular setup by default this queued buffers limit is
>>>> +disabled by setting it to UINT64_MAX.
>>>
>>> should be in patch 34. It is not obvious it will be merged.
>>>
>>
>> ...which brings us to this point.
>>
>> I think by this point in time (less then 2 weeks to code freeze) we should
>> finally decide what is going to be included in the patch set.
>> > This way this patch set could be well tested in its final form rather than
>> having significant parts taken out of it at the eleventh hour.
>>
>> If the final form is known also the documentation can be adjusted accordingly
>> and user/admin documentation eventually written once the code is considered
>> okay.
>>
>> I though we discussed a few times the rationale behind both
>> x-migration-max-queued-buffers and x-migration-load-config-after-iter properties
>> but if you still have some concerns there please let me know before I prepare
>> the next version of this patch set so I know whether to include these.
> 
> Patch 34, not sure yet.
> 
> Patch 35 is for next cycle IMO.
> 
> For QEMU 10.0, let's focus on x86 first and see how it goes. We can add
> ARM support in QEMU 10.1 if nothing new arises. We will need the virt-arm
> folks in cc: then.
> 
> Please keep patch 35 in v6 nevertheless, it is good for reference if
> someone wants to apply on an out of tree QEMU.

If we are to drop/skip adding the "x-migration-load-config-after-iter"
option for now then let's do it now so the next version could be already
tested in its target shape.

After this "x-migration-load-config-after-iter" option is proposed
once again in QEMU 10.1 cycle then it obviously will be forward ported
to whatever the code looks at that point and tested again.

The patch itself is not going to suddenly disappear :) - it's on the
mailing list and in my repository here:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/6582ac5ac338c40ad74ec60820e85b06c4509a2a

> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-28 10:05       ` Cédric Le Goater
  2025-02-28 20:49         ` Maciej S. Szmigiero
@ 2025-02-28 23:38         ` Fabiano Rosas
  2025-03-03  9:34           ` Cédric Le Goater
  2025-03-03 22:14           ` Maciej S. Szmigiero
  1 sibling, 2 replies; 120+ messages in thread
From: Fabiano Rosas @ 2025-02-28 23:38 UTC (permalink / raw)
  To: Cédric Le Goater, Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

Cédric Le Goater <clg@redhat.com> writes:

> On 2/27/25 23:01, Maciej S. Szmigiero wrote:
>> On 27.02.2025 07:59, Cédric Le Goater wrote:
>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Update the VFIO documentation at docs/devel/migration describing the
>>>> changes brought by the multifd device state transfer.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
>>>>   1 file changed, 71 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
>>>> index c49482eab66d..d9b169d29921 100644
>>>> --- a/docs/devel/migration/vfio.rst
>>>> +++ b/docs/devel/migration/vfio.rst
>>>> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>>>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>>>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>
>>> Please add a new "multifd" documentation subsection at the end of the file
>>> with this part :
>>>
>>>> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
>>>> +_STOP_COPY state via multifd channels. This helps reduce downtime - especially
>>>> +with multiple VFIO devices or with devices having a large migration state.
>>>> +As an additional benefit, setting the VFIO device to _STOP_COPY state and
>>>> +saving its config space is also parallelized (run in a separate thread) in
>>>> +such migration mode.
>>>> +
>>>> +The multifd VFIO device state transfer is controlled by
>>>> +"x-migration-multifd-transfer" VFIO device property. This property defaults to
>>>> +AUTO, which means that VFIO device state transfer via multifd channels is
>>>> +attempted in configurations that otherwise support it.
>>>> +
>> 
>> Done - I also moved the parts about x-migration-max-queued-buffers
>> and x-migration-load-config-after-iter description there since
>> obviously they wouldn't make sense being left alone in the top section.
>> 
>>> I was expecting a much more detailed explanation on the design too  :
>>>
>>>   * in the cover letter
>>>   * in the hw/vfio/migration-multifd.c
>>>   * in some new file under docs/devel/migration/
>
> I forgot to add  :
>
>       * guide on how to use this new feature from QEMU and libvirt.
>         something we can refer to for tests. That's a must have.
>       * usage scenarios
>         There are some benefits but it is not obvious a user would
>         like to use multiple VFs in one VM, please explain.
>         This is a major addition which needs justification anyhow
>       * pros and cons
>
>> I'm not sure what descriptions you exactly want in these places, 
>
> Looking from the VFIO subsystem, the way this series works is very opaque.
> There are a couple of a new migration handlers, new threads, new channels,
> etc. It has been discussed several times with migration folks, please provide
> a summary for a new reader as ignorant as everyone would be when looking at
> a new file.
>
>
>> but since
>> that's just documentation (not code) it could be added after the code freeze...
>
> That's the risk of not getting any ! and the initial proposal should be
> discussed before code freeze.
>
> For the general framework, I was expecting an extension of a "multifd"
> subsection under :
>
>    https://qemu.readthedocs.io/en/v9.2.0/devel/migration/features.html
>
> but it doesn't exist :/

Hi, see if this helps. Let me know what can be improved and if something
needs to be more detailed. Please ignore the formatting, I'll send a
proper patch after the carnaval.

@Maciej, it's probably better if you keep your docs separate anyway so
we don't add another dependency. I can merge them later.

multifd.rst:

Multifd
=======

Multifd is the name given for the migration capability that enables
data transfer using multiple threads. Multifd supports all the
transport types currently in use with migration (inet, unix, vsock,
fd, file).

Restrictions
------------

For migration to a file, support is conditional on the presence of the
mapped-ram capability, see #mapped-ram.

Snapshots are currently not supported.

Postcopy migration is currently not supported.

Usage
-----

On both source and destination, enable the ``multifd`` capability:

    ``migrate_set_capability multifd on``

Define a number of channels to use (default is 2, but 8 usually
provides best performance).

    ``migrate_set_parameter multifd-channels 8``

Components
----------

Multifd consists of:

- A client that produces the data on the migration source side and
  consumes it on the destination. Currently the main client code is
  ram.c, which selects the RAM pages for migration;

- A shared data structure (MultiFDSendData), used to transfer data
  between multifd and the client. On the source side, this structure
  is further subdivided into payload types (MultiFDPayload);

- An API operating on the shared data structure to allow the client
  code to interact with multifd;

  - multifd_send/recv(): A dispatcher that transfers work to/from the
    channels.

  - multifd_*payload_* and MultiFDPayloadType: Support defining an
    opaque payload. The payload is always wrapped by
    MultiFDSend|RecvData.

  - multifd_send_data_*: Used to manage the memory for the shared data
    structure.

- The threads that process the data (aka channels, due to a 1:1
  mapping to QIOChannels). Each multifd channel supports callbacks
  that can be used for fine-grained processing of the payload, such as
  compression and zero page detection.

- A packet which is the final result of all the data aggregation
  and/or transformation. The packet contains a header, a
  payload-specific header and a variable-size data portion.

   - The packet header: contains a magic number, a version number and
     flags that inform of special processing needed on the
     destination.

   - The payload-specific header: contains metadata referent to the
     packet's data portion, such as page counts.

   - The data portion: contains the actual opaque payload data.

  Note that due to historical reasons, the terminology around multifd
  packets is inconsistent.

  The mapped-ram feature ignores packets entirely.

Theory of operation
-------------------

The multifd channels operate in parallel with the main migration
thread. The transfer of data from a client code into multifd happens
from the main migration thread using the multifd API.

The interaction between the client code and the multifd channels
happens in the multifd_send() and multifd_recv() methods. These are
reponsible for selecting the next idle channel and making the shared
data structure containing the payload accessible to that channel. The
client code receives back an empty object which it then uses for the
next iteration of data transfer.

The selection of idle channels is simply a round-robin over the idle
channels (!p->pending_job). Channels wait at a semaphore, once a
channel is released, it starts operating on the data immediately.

Aside from eventually transmitting the data over the underlying
QIOChannel, a channel's operation also includes calling back to the
client code at pre-determined points to allow for client-specific
handling such as data transformation (e.g. compression), creation of
the packet header and arranging the data into iovs (struct
iovec). Iovs are the type of data on which the QIOChannel operates.

Client code (migration thread):
1. Populate shared structure with opaque data (ram pages, device state)
2. Call multifd_send()
   2a. Loop over the channels until one is idle
   2b. Switch pointers between client data and channel data
   2c. Release channel semaphore
3. Receive back empty object
4. Repeat

Multifd channel (multifd thread):
1. Channel idle
2. Gets released by multifd_send()
3. Call multifd_ops methods to fill iov
   3a. Compression may happen
   3b. Zero page detection may happen
   3c. Packet is written
   3d. iov is written
4. Pass iov into QIOChannel for transferring
5. Repeat

The destination side operates similarly but with multifd_recv(),
decompression instead of compression, etc. One important aspect is
that when receiving the data, the iov will contain host virtual
addresses, so guest memory is written to directly from multifd
threads.

About flags
-----------
The main thread orchestrates the migration by issuing control flags on
the migration stream (QEMU_VM_*).

The main memory is migrated by ram.c and includes specific control
flags that are also put on the main migration stream
(RAM_SAVE_FLAG_*).

Multifd has its own set of MULTIFD_FLAGs that are included into each
packet. These may inform about properties such as the compression
algorithm used if the data is compressed.

Synchronization
---------------

Since the migration process is iterative due to RAM dirty tracking, it
is necessary to invalidate data that is no longer current (e.g. due to
the source VM touching the page). This is done by having a
synchronization point triggered by the migration thread at key points
during the migration. Data that's received after the synchronization
point is allowed to overwrite data received prior to that point.

To perform the synchronization, multifd provides the
multifd_send_sync_main() and multifd_recv_sync_main() helpers. These
are called whenever the client code whishes to ensure that all data
sent previously has now been received by the destination.

The synchronization process involves performing a flush of the
ramaining client data still left to be transmitted and issuing a
multifd packet containing the MULTIFD_FLAG_SYNC flag. This flag
informs the receiving end that it should finish reading the data and
wait for a synchronization point.

To complete the sync, the main migration stream issues a
RAM_SAVE_FLAG_MULTIFD_FLUSH flag. When that flag is received by the
destination, it ensures all of its channels have seen the
MULTIFD_FLAG_SYNC and moves them to an idle state.

The client code can then continue with a second round of data by
issuing multifd_send() once again.

The synchronization process also ensures that internal synchronization
happens, i.e. between each thread. This is necessary to avoid threads
lagging behind sending or receiving when the migration approaches
completion.

The mapped-ram feature has different synchronization requirements
because it's an asynchronous migration (source and destination not
migrating at the same time). For that feature, only the internal sync
is relevant.

Data transformation
-------------------

Each multifd channel executes a set of callbacks before transmitting
the data. These callbacks allow the client code to alter the data
format right before sending and after receiving.

Since the object of the RAM migration is always the memory page and
the only processing done for memory pages is zero page detection,
which is already part of compression in a sense, the multifd_ops
functions are mutually exclusively divided into compression and
no-compression.

The migration without compression (i.e. regular ram migration) has a
further specificity as mentioned of possibly doing zero page detection
(see zero-page-detection migration parameter). This consists of
sending all pages to multifd and letting the detection of a zero page
happen in the multifd channels instead of doing it beforehand on the
main migration thread as it was done in the past.

Code structure
--------------

Multifd code is divided into:

The main file containing the core routines

- multifd.c

RAM migration

- multifd-nocomp.c (nocomp, for "no compression")
- multifd-zero-page.c
- ram.c (also involved in non-multifd migrations + snapshots)

Compressors

- multifd-uadk.c
- multifd-qatzip.c
- multifd-zlib.c
- multifd-qpl.c
- multifd-zstd.c


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/36] migration/multifd: Device state transfer support - receive side
  2025-02-19 20:33 ` [PATCH v5 11/36] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2025-03-02 12:42   ` Avihai Horon
  2025-03-03 22:14     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 12:42 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

Hi Maciej,

Sorry for the long delay, I have been busy with other tasks.
I got some small comments for the series.

On 19/02/2025 22:33, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Add a basic support for receiving device state via multifd channels -
> channels that are shared with RAM transfers.
>
> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
> packet header either device state (MultiFDPacketDeviceState_t) or RAM
> data (existing MultiFDPacket_t) is read.
>
> The received device state data is provided to
> qemu_loadvm_load_state_buffer() function for processing in the
> device's load_state_buffer handler.
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   migration/multifd.c | 99 ++++++++++++++++++++++++++++++++++++++++-----
>   migration/multifd.h | 26 +++++++++++-
>   2 files changed, 113 insertions(+), 12 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 3b47e63c2c4a..700a385447c7 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -21,6 +21,7 @@
>   #include "file.h"
>   #include "migration.h"
>   #include "migration-stats.h"
> +#include "savevm.h"
>   #include "socket.h"
>   #include "tls.h"
>   #include "qemu-file.h"
> @@ -252,14 +253,24 @@ static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
>       return 0;
>   }
>
> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
> +                                                   Error **errp)
> +{
> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
> +
> +    packet->instance_id = be32_to_cpu(packet->instance_id);
> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
> +
> +    return 0;
> +}
> +
> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
>   {
>       const MultiFDPacket_t *packet = p->packet;
>       int ret = 0;
>
>       p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>       p->packet_num = be64_to_cpu(packet->packet_num);
> -    p->packets_recved++;
>
>       /* Always unfill, old QEMUs (<9.0) send data along with SYNC */
>       ret = multifd_ram_unfill_packet(p, errp);
> @@ -270,6 +281,17 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>       return ret;
>   }
>
> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +{
> +    p->packets_recved++;
> +
> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
> +        return multifd_recv_unfill_packet_device_state(p, errp);
> +    }
> +
> +    return multifd_recv_unfill_packet_ram(p, errp);
> +}
> +
>   static bool multifd_send_should_exit(void)
>   {
>       return qatomic_read(&multifd_send_state->exiting);
> @@ -1057,6 +1079,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
>       p->packet_len = 0;
>       g_free(p->packet);
>       p->packet = NULL;
> +    g_clear_pointer(&p->packet_dev_state, g_free);
>       g_free(p->normal);
>       p->normal = NULL;
>       g_free(p->zero);
> @@ -1158,6 +1181,32 @@ void multifd_recv_sync_main(void)
>       trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
>   }
>
> +static int multifd_device_state_recv(MultiFDRecvParams *p, Error **errp)
> +{
> +    g_autofree char *idstr = NULL;
> +    g_autofree char *dev_state_buf = NULL;
> +    int ret;
> +
> +    dev_state_buf = g_malloc(p->next_packet_size);
> +
> +    ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, errp);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +
> +    idstr = g_strndup(p->packet_dev_state->idstr,
> +                      sizeof(p->packet_dev_state->idstr));
> +
> +    if (!qemu_loadvm_load_state_buffer(idstr,
> +                                       p->packet_dev_state->instance_id,
> +                                       dev_state_buf, p->next_packet_size,
> +                                       errp)) {
> +        ret = -1;
> +    }
> +
> +    return ret;
> +}
> +
>   static void *multifd_recv_thread(void *opaque)
>   {
>       MigrationState *s = migrate_get_current();
> @@ -1176,6 +1225,7 @@ static void *multifd_recv_thread(void *opaque)
>       while (true) {
>           MultiFDPacketHdr_t hdr;
>           uint32_t flags = 0;
> +        bool is_device_state = false;
>           bool has_data = false;
>           uint8_t *pkt_buf;
>           size_t pkt_len;
> @@ -1209,8 +1259,14 @@ static void *multifd_recv_thread(void *opaque)
>                   break;
>               }
>
> -            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
> -            pkt_len = p->packet_len - sizeof(hdr);
> +            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
> +            if (is_device_state) {
> +                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
> +                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
> +            } else {
> +                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
> +                pkt_len = p->packet_len - sizeof(hdr);
> +            }
>
>               ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
>                                              &local_err);
> @@ -1235,12 +1291,17 @@ static void *multifd_recv_thread(void *opaque)
>               /* recv methods don't know how to handle the SYNC flag */
>               p->flags &= ~MULTIFD_FLAG_SYNC;
>
> -            /*
> -             * Even if it's a SYNC packet, this needs to be set
> -             * because older QEMUs (<9.0) still send data along with
> -             * the SYNC packet.
> -             */
> -            has_data = p->normal_num || p->zero_num;
> +            if (is_device_state) {
> +                has_data = p->next_packet_size > 0;
> +            } else {
> +                /*
> +                 * Even if it's a SYNC packet, this needs to be set
> +                 * because older QEMUs (<9.0) still send data along with
> +                 * the SYNC packet.
> +                 */
> +                has_data = p->normal_num || p->zero_num;
> +            }
> +
>               qemu_mutex_unlock(&p->mutex);
>           } else {
>               /*
> @@ -1269,14 +1330,29 @@ static void *multifd_recv_thread(void *opaque)
>           }
>
>           if (has_data) {
> -            ret = multifd_recv_state->ops->recv(p, &local_err);
> +            if (is_device_state) {
> +                assert(use_packets);
> +                ret = multifd_device_state_recv(p, &local_err);
> +            } else {
> +                ret = multifd_recv_state->ops->recv(p, &local_err);
> +            }
>               if (ret != 0) {
>                   break;
>               }
> +        } else if (is_device_state) {
> +            error_setg(&local_err,
> +                       "multifd: received empty device state packet");
> +            break;
>           }
>
>           if (use_packets) {
>               if (flags & MULTIFD_FLAG_SYNC) {
> +                if (is_device_state) {
> +                    error_setg(&local_err,
> +                               "multifd: received SYNC device state packet");
> +                    break;
> +                }
> +
>                   qemu_sem_post(&multifd_recv_state->sem_sync);
>                   qemu_sem_wait(&p->sem_sync);
>               }
> @@ -1345,6 +1421,7 @@ int multifd_recv_setup(Error **errp)
>               p->packet_len = sizeof(MultiFDPacket_t)
>                   + sizeof(uint64_t) * page_count;
>               p->packet = g_malloc0(p->packet_len);
> +            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
>           }
>           p->name = g_strdup_printf(MIGRATION_THREAD_DST_MULTIFD, i);
>           p->normal = g_new0(ram_addr_t, page_count);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index f7156f66c0f6..c2ebef2d319e 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>   #define MULTIFD_FLAG_UADK (8 << 1)
>   #define MULTIFD_FLAG_QATZIP (16 << 1)
>
> +/*
> + * If set it means that this packet contains device state
> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
> + */
> +#define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
> +
>   /* This value needs to be a multiple of qemu_target_page_size() */
>   #define MULTIFD_PACKET_SIZE (512 * 1024)
>
> @@ -94,6 +100,16 @@ typedef struct {
>       uint64_t offset[];
>   } __attribute__((packed)) MultiFDPacket_t;
>
> +typedef struct {
> +    MultiFDPacketHdr_t hdr;
> +
> +    char idstr[256] QEMU_NONSTRING;
> +    uint32_t instance_id;
> +
> +    /* size of the next packet that contains the actual data */
> +    uint32_t next_packet_size;
> +} __attribute__((packed)) MultiFDPacketDeviceState_t;
> +
>   typedef struct {
>       /* number of used pages */
>       uint32_t num;
> @@ -111,6 +127,13 @@ struct MultiFDRecvData {
>       off_t file_offset;
>   };
>
> +typedef struct {
> +    char *idstr;
> +    uint32_t instance_id;
> +    char *buf;
> +    size_t buf_len;
> +} MultiFDDeviceState_t;

This is only used in patch #14. Maybe move it there?

Thanks.

> +
>   typedef enum {
>       MULTIFD_PAYLOAD_NONE,
>       MULTIFD_PAYLOAD_RAM,
> @@ -227,8 +250,9 @@ typedef struct {
>
>       /* thread local variables. No locking required */
>
> -    /* pointer to the packet */
> +    /* pointers to the possible packet types */
>       MultiFDPacket_t *packet;
> +    MultiFDPacketDeviceState_t *packet_dev_state;
>       /* size of the next packet that contains pages */
>       uint32_t next_packet_size;
>       /* packets received through this channel */


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 14/36] migration/multifd: Device state transfer support - send side
  2025-02-19 20:33 ` [PATCH v5 14/36] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2025-03-02 12:46   ` Avihai Horon
  2025-03-03 22:15     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 12:46 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:33, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> A new function multifd_queue_device_state() is provided for device to queue
> its state for transmission via a multifd channel.
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/misc.h         |   4 ++
>   migration/meson.build            |   1 +
>   migration/multifd-device-state.c | 115 +++++++++++++++++++++++++++++++
>   migration/multifd-nocomp.c       |  14 +++-
>   migration/multifd.c              |  42 +++++++++--
>   migration/multifd.h              |  27 +++++---
>   6 files changed, 187 insertions(+), 16 deletions(-)
>   create mode 100644 migration/multifd-device-state.c
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 4c171f4e897e..bd3b725fa0b7 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -118,4 +118,8 @@ bool migrate_is_uri(const char *uri);
>   bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
>                          Error **errp);
>
> +/* migration/multifd-device-state.c */
> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> +                                char *data, size_t len);
> +
>   #endif
> diff --git a/migration/meson.build b/migration/meson.build
> index d3bfe84d6204..9aa48b290e2a 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -25,6 +25,7 @@ system_ss.add(files(
>     'migration-hmp-cmds.c',
>     'migration.c',
>     'multifd.c',
> +  'multifd-device-state.c',
>     'multifd-nocomp.c',
>     'multifd-zlib.c',
>     'multifd-zero-page.c',
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> new file mode 100644
> index 000000000000..ab83773e2d62
> --- /dev/null
> +++ b/migration/multifd-device-state.c
> @@ -0,0 +1,115 @@
> +/*
> + * Multifd device state migration
> + *
> + * Copyright (C) 2024,2025 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/lockable.h"
> +#include "migration/misc.h"
> +#include "multifd.h"
> +
> +static struct {
> +    QemuMutex queue_job_mutex;
> +
> +    MultiFDSendData *send_data;
> +} *multifd_send_device_state;
> +
> +size_t multifd_device_state_payload_size(void)
> +{
> +    return sizeof(MultiFDDeviceState_t);
> +}
> +
> +void multifd_device_state_send_setup(void)
> +{
> +    assert(!multifd_send_device_state);
> +    multifd_send_device_state = g_malloc(sizeof(*multifd_send_device_state));
> +
> +    qemu_mutex_init(&multifd_send_device_state->queue_job_mutex);
> +
> +    multifd_send_device_state->send_data = multifd_send_data_alloc();
> +}
> +
> +void multifd_device_state_send_cleanup(void)
> +{
> +    g_clear_pointer(&multifd_send_device_state->send_data,
> +                    multifd_send_data_free);
> +
> +    qemu_mutex_destroy(&multifd_send_device_state->queue_job_mutex);
> +
> +    g_clear_pointer(&multifd_send_device_state, g_free);
> +}
> +
> +void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state)
> +{
> +    g_clear_pointer(&device_state->idstr, g_free);
> +    g_clear_pointer(&device_state->buf, g_free);
> +}
> +
> +static void multifd_device_state_fill_packet(MultiFDSendParams *p)
> +{
> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
> +    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
> +
> +    packet->hdr.flags = cpu_to_be32(p->flags);
> +    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));

(I think we talked about this in v2):
Looking at idstr creation code, idstr is always NULL terminated. It's 
also treated everywhere as a NULL terminated string.
For consistency and to avoid confusion, I'd treat it as a NULL 
terminated string here too (use strcpy, remove the QEMU_NONSTRING from 
its definition, etc.).
This will also avoid strncpy() unnecessary zeroing of the extra bytes.

Thanks.

> +    packet->instance_id = cpu_to_be32(device_state->instance_id);
> +    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
> +}
> +
> +static void multifd_prepare_header_device_state(MultiFDSendParams *p)
> +{
> +    p->iov[0].iov_len = sizeof(*p->packet_device_state);
> +    p->iov[0].iov_base = p->packet_device_state;
> +    p->iovs_num++;
> +}
> +
> +void multifd_device_state_send_prepare(MultiFDSendParams *p)
> +{
> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
> +
> +    assert(multifd_payload_device_state(p->data));
> +
> +    multifd_prepare_header_device_state(p);
> +
> +    assert(!(p->flags & MULTIFD_FLAG_SYNC));
> +
> +    p->next_packet_size = device_state->buf_len;
> +    if (p->next_packet_size > 0) {
> +        p->iov[p->iovs_num].iov_base = device_state->buf;
> +        p->iov[p->iovs_num].iov_len = p->next_packet_size;
> +        p->iovs_num++;
> +    }
> +
> +    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
> +
> +    multifd_device_state_fill_packet(p);
> +}
> +
> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> +                                char *data, size_t len)
> +{
> +    /* Device state submissions can come from multiple threads */
> +    QEMU_LOCK_GUARD(&multifd_send_device_state->queue_job_mutex);
> +    MultiFDDeviceState_t *device_state;
> +
> +    assert(multifd_payload_empty(multifd_send_device_state->send_data));
> +
> +    multifd_set_payload_type(multifd_send_device_state->send_data,
> +                             MULTIFD_PAYLOAD_DEVICE_STATE);
> +    device_state = &multifd_send_device_state->send_data->u.device_state;
> +    device_state->idstr = g_strdup(idstr);
> +    device_state->instance_id = instance_id;
> +    device_state->buf = g_memdup2(data, len);
> +    device_state->buf_len = len;
> +
> +    if (!multifd_send(&multifd_send_device_state->send_data)) {
> +        multifd_send_data_clear(multifd_send_device_state->send_data);
> +        return false;
> +    }
> +
> +    return true;
> +}
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index e46e79d8b272..c00804652383 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -14,6 +14,7 @@
>   #include "exec/ramblock.h"
>   #include "exec/target_page.h"
>   #include "file.h"
> +#include "migration-stats.h"
>   #include "multifd.h"
>   #include "options.h"
>   #include "qapi/error.h"
> @@ -85,6 +86,13 @@ static void multifd_nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
>       return;
>   }
>
> +static void multifd_ram_prepare_header(MultiFDSendParams *p)
> +{
> +    p->iov[0].iov_len = p->packet_len;
> +    p->iov[0].iov_base = p->packet;
> +    p->iovs_num++;
> +}
> +
>   static void multifd_send_prepare_iovs(MultiFDSendParams *p)
>   {
>       MultiFDPages_t *pages = &p->data->u.ram;
> @@ -118,7 +126,7 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
>            * Only !zerocopy needs the header in IOV; zerocopy will
>            * send it separately.
>            */
> -        multifd_send_prepare_header(p);
> +        multifd_ram_prepare_header(p);
>       }
>
>       multifd_send_prepare_iovs(p);
> @@ -133,6 +141,8 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
>           if (ret != 0) {
>               return -1;
>           }
> +
> +        stat64_add(&mig_stats.multifd_bytes, p->packet_len);
>       }
>
>       return 0;
> @@ -431,7 +441,7 @@ int multifd_ram_flush_and_sync(QEMUFile *f)
>   bool multifd_send_prepare_common(MultiFDSendParams *p)
>   {
>       MultiFDPages_t *pages = &p->data->u.ram;
> -    multifd_send_prepare_header(p);
> +    multifd_ram_prepare_header(p);
>       multifd_send_zero_page_detect(p);
>
>       if (!pages->normal_num) {
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 0092547a4f97..3394c2ae12fd 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -12,6 +12,7 @@
>
>   #include "qemu/osdep.h"
>   #include "qemu/cutils.h"
> +#include "qemu/iov.h"
>   #include "qemu/rcu.h"
>   #include "exec/target_page.h"
>   #include "system/system.h"
> @@ -19,6 +20,7 @@
>   #include "qemu/error-report.h"
>   #include "qapi/error.h"
>   #include "file.h"
> +#include "migration/misc.h"
>   #include "migration.h"
>   #include "migration-stats.h"
>   #include "savevm.h"
> @@ -111,7 +113,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
>        * added to the union in the future are larger than
>        * (MultiFDPages_t + flex array).
>        */
> -    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
> +    max_payload_size = MAX(multifd_ram_payload_size(),
> +                           multifd_device_state_payload_size());
> +    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
>
>       /*
>        * Account for any holes the compiler might insert. We can't pack
> @@ -130,6 +134,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
>       }
>
>       switch (data->type) {
> +    case MULTIFD_PAYLOAD_DEVICE_STATE:
> +        multifd_send_data_clear_device_state(&data->u.device_state);
> +        break;
>       default:
>           /* Nothing to do */
>           break;
> @@ -232,6 +239,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
>       return msg.id;
>   }
>
> +/* Fills a RAM multifd packet */
>   void multifd_send_fill_packet(MultiFDSendParams *p)
>   {
>       MultiFDPacket_t *packet = p->packet;
> @@ -524,6 +532,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>       p->name = NULL;
>       g_clear_pointer(&p->data, multifd_send_data_free);
>       p->packet_len = 0;
> +    g_clear_pointer(&p->packet_device_state, g_free);
>       g_free(p->packet);
>       p->packet = NULL;
>       multifd_send_state->ops->send_cleanup(p, errp);
> @@ -536,6 +545,7 @@ static void multifd_send_cleanup_state(void)
>   {
>       file_cleanup_outgoing_migration();
>       socket_cleanup_outgoing_migration();
> +    multifd_device_state_send_cleanup();
>       qemu_sem_destroy(&multifd_send_state->channels_created);
>       qemu_sem_destroy(&multifd_send_state->channels_ready);
>       qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
> @@ -694,16 +704,32 @@ static void *multifd_send_thread(void *opaque)
>            * qatomic_store_release() in multifd_send().
>            */
>           if (qatomic_load_acquire(&p->pending_job)) {
> +            bool is_device_state = multifd_payload_device_state(p->data);
> +            size_t total_size;
> +
>               p->flags = 0;
>               p->iovs_num = 0;
>               assert(!multifd_payload_empty(p->data));
>
> -            ret = multifd_send_state->ops->send_prepare(p, &local_err);
> -            if (ret != 0) {
> -                break;
> +            if (is_device_state) {
> +                multifd_device_state_send_prepare(p);
> +            } else {
> +                ret = multifd_send_state->ops->send_prepare(p, &local_err);
> +                if (ret != 0) {
> +                    break;
> +                }
>               }
>
> +            /*
> +             * The packet header in the zerocopy RAM case is accounted for
> +             * in multifd_nocomp_send_prepare() - where it is actually
> +             * being sent.
> +             */
> +            total_size = iov_size(p->iov, p->iovs_num);
> +
>               if (migrate_mapped_ram()) {
> +                assert(!is_device_state);
> +
>                   ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
>                                                 &p->data->u.ram, &local_err);
>               } else {
> @@ -716,8 +742,7 @@ static void *multifd_send_thread(void *opaque)
>                   break;
>               }
>
> -            stat64_add(&mig_stats.multifd_bytes,
> -                       (uint64_t)p->next_packet_size + p->packet_len);
> +            stat64_add(&mig_stats.multifd_bytes, total_size);
>
>               p->next_packet_size = 0;
>               multifd_send_data_clear(p->data);
> @@ -938,6 +963,9 @@ bool multifd_send_setup(void)
>               p->packet_len = sizeof(MultiFDPacket_t)
>                             + sizeof(uint64_t) * page_count;
>               p->packet = g_malloc0(p->packet_len);
> +            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
> +            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
> +            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>           }
>           p->name = g_strdup_printf(MIGRATION_THREAD_SRC_MULTIFD, i);
>           p->write_flags = 0;
> @@ -973,6 +1001,8 @@ bool multifd_send_setup(void)
>           assert(p->iov);
>       }
>
> +    multifd_device_state_send_setup();
> +
>       return true;
>
>   err:
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 20a4bba58ef4..883a43c1d79e 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -137,10 +137,12 @@ typedef struct {
>   typedef enum {
>       MULTIFD_PAYLOAD_NONE,
>       MULTIFD_PAYLOAD_RAM,
> +    MULTIFD_PAYLOAD_DEVICE_STATE,
>   } MultiFDPayloadType;
>
>   typedef union MultiFDPayload {
>       MultiFDPages_t ram;
> +    MultiFDDeviceState_t device_state;
>   } MultiFDPayload;
>
>   struct MultiFDSendData {
> @@ -153,6 +155,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
>       return data->type == MULTIFD_PAYLOAD_NONE;
>   }
>
> +static inline bool multifd_payload_device_state(MultiFDSendData *data)
> +{
> +    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
> +}
> +
>   static inline void multifd_set_payload_type(MultiFDSendData *data,
>                                               MultiFDPayloadType type)
>   {
> @@ -205,8 +212,9 @@ typedef struct {
>
>       /* thread local variables. No locking required */
>
> -    /* pointer to the packet */
> +    /* pointers to the possible packet types */
>       MultiFDPacket_t *packet;
> +    MultiFDPacketDeviceState_t *packet_device_state;
>       /* size of the next packet that contains pages */
>       uint32_t next_packet_size;
>       /* packets sent through this channel */
> @@ -365,13 +373,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p);
>   void multifd_send_zero_page_detect(MultiFDSendParams *p);
>   void multifd_recv_zero_page_process(MultiFDRecvParams *p);
>
> -static inline void multifd_send_prepare_header(MultiFDSendParams *p)
> -{
> -    p->iov[0].iov_len = p->packet_len;
> -    p->iov[0].iov_base = p->packet;
> -    p->iovs_num++;
> -}
> -
>   void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
>   bool multifd_send(MultiFDSendData **send_data);
>   MultiFDSendData *multifd_send_data_alloc(void);
> @@ -396,4 +397,14 @@ bool multifd_ram_sync_per_section(void);
>   size_t multifd_ram_payload_size(void);
>   void multifd_ram_fill_packet(MultiFDSendParams *p);
>   int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
> +
> +size_t multifd_device_state_payload_size(void);
> +
> +void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state);
> +
> +void multifd_device_state_send_setup(void);
> +void multifd_device_state_send_cleanup(void);
> +
> +void multifd_device_state_send_prepare(MultiFDSendParams *p);
> +
>   #endif


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s)
  2025-02-19 20:34 ` [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
  2025-02-26  8:54   ` Cédric Le Goater
@ 2025-03-02 13:00   ` Avihai Horon
  2025-03-02 15:14     ` Maciej S. Szmigiero
  2025-03-03  6:42     ` Cédric Le Goater
  1 sibling, 2 replies; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 13:00 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Add VFIOStateBuffer(s) types and the associated methods.
>
> These store received device state buffers and config state waiting to get
> loaded into the device.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c | 54 +++++++++++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 0c3185a26242..760b110a39b9 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -29,3 +29,57 @@ typedef struct VFIODeviceStatePacket {
>       uint32_t flags;
>       uint8_t data[0];
>   } QEMU_PACKED VFIODeviceStatePacket;
> +
> +/* type safety */
> +typedef struct VFIOStateBuffers {
> +    GArray *array;
> +} VFIOStateBuffers;
> +
> +typedef struct VFIOStateBuffer {
> +    bool is_present;
> +    char *data;
> +    size_t len;
> +} VFIOStateBuffer;
> +
> +static void vfio_state_buffer_clear(gpointer data)
> +{
> +    VFIOStateBuffer *lb = data;
> +
> +    if (!lb->is_present) {
> +        return;
> +    }
> +
> +    g_clear_pointer(&lb->data, g_free);
> +    lb->is_present = false;
> +}
> +
> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
> +{
> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
> +}
> +
> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
> +{
> +    g_clear_pointer(&bufs->array, g_array_unref);
> +}
> +
> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
> +{
> +    assert(bufs->array);
> +}
> +
> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
> +{
> +    return bufs->array->len;
> +}
> +
> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
> +{
> +    g_array_set_size(bufs->array, size);
> +}
> +
> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
> +{
> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
> +}

This patch breaks compilation as non of the functions are used, e.g.: 
error: ‘vfio_state_buffers_init’ defined but not used I can think of 
three options to solve it: 1. Move these functions to their own file and 
export them, e.g., hw/vfio/state-buffer.{c,h}. But this seems like an 
overkill for such a small API. 2. Add __attribute__((unused)) tags and 
remove them in patch #26 where the functions are actually used. A bit 
ugly. 3. Squash this patch into patch #26. I prefer option 3 as this is 
a small API closely related to patch #26 (and patch #26 will still 
remain rather small).

Thanks.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-19 20:34 ` [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
  2025-02-26 10:43   ` Cédric Le Goater
@ 2025-03-02 13:12   ` Avihai Horon
  2025-03-03 22:15     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 13:12 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> The multifd received data needs to be reassembled since device state
> packets sent via different multifd channels can arrive out-of-order.
>
> Therefore, each VFIO device state packet carries a header indicating its
> position in the stream.
> The raw device state data is saved into a VFIOStateBuffer for later
> in-order loading into the device.
>
> The last such VFIO device state packet should have
> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c | 103 ++++++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h |   3 ++
>   hw/vfio/migration.c         |   1 +
>   hw/vfio/trace-events        |   1 +
>   4 files changed, 108 insertions(+)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index c2defc0efef0..5d5ee1393674 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -42,6 +42,11 @@ typedef struct VFIOStateBuffer {
>   } VFIOStateBuffer;
>
>   typedef struct VFIOMultifd {
> +    VFIOStateBuffers load_bufs;
> +    QemuCond load_bufs_buffer_ready_cond;
> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
> +    uint32_t load_buf_idx;
> +    uint32_t load_buf_idx_last;
>   } VFIOMultifd;
>
>   static void vfio_state_buffer_clear(gpointer data)
> @@ -87,15 +92,113 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>   }
>
> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
> +                                          VFIODeviceStatePacket *packet,
> +                                          size_t packet_total_size,
> +                                          Error **errp)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIOStateBuffer *lb;
> +
> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
> +    }
> +
> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
> +    if (lb->is_present) {
> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
> +                   packet->idx);

Let's add vbasedev->name to the error message so we know which device 
caused the error.

> +        return false;
> +    }
> +
> +    assert(packet->idx >= multifd->load_buf_idx);
> +
> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
> +    lb->len = packet_total_size - sizeof(*packet);
> +    lb->is_present = true;
> +
> +    return true;
> +}
> +
> +bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                            Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
> +
> +    /*
> +     * Holding BQL here would violate the lock order and can cause
> +     * a deadlock once we attempt to lock load_bufs_mutex below.
> +     */
> +    assert(!bql_locked());

To be clearer, I'd move the assert down to be just above 
"QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);".

> +
> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
> +        error_setg(errp,
> +                   "got device state packet but not doing multifd transfer");

Let's add vbasedev->name to the error message so we know which device 
caused the error.

> +        return false;
> +    }
> +
> +    assert(multifd);
> +
> +    if (data_size < sizeof(*packet)) {
> +        error_setg(errp, "packet too short at %zu (min is %zu)",
> +                   data_size, sizeof(*packet));

Ditto.

> +        return false;
> +    }
> +
> +    if (packet->version != VFIO_DEVICE_STATE_PACKET_VER_CURRENT) {
> +        error_setg(errp, "packet has unknown version %" PRIu32,
> +                   packet->version);

Ditto.

> +        return false;
> +    }
> +
> +    if (packet->idx == UINT32_MAX) {
> +        error_setg(errp, "packet has too high idx");

Ditto.

> +        return false;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
> +
> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
> +
> +    /* config state packet should be the last one in the stream */
> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
> +        multifd->load_buf_idx_last = packet->idx;
> +    }
> +
> +    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {
> +        return false;
> +    }
> +
> +    qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
> +
> +    return true;
> +}
> +
>   VFIOMultifd *vfio_multifd_new(void)
>   {
>       VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>
> +    vfio_state_buffers_init(&multifd->load_bufs);
> +
> +    qemu_mutex_init(&multifd->load_bufs_mutex);

Nit: move qemu_mutex_init() just above qemu_cond_init()?

Thanks.

> +
> +    multifd->load_buf_idx = 0;
> +    multifd->load_buf_idx_last = UINT32_MAX;
> +    qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
> +
>       return multifd;
>   }
>
>   void vfio_multifd_free(VFIOMultifd *multifd)
>   {
> +    qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
> +    qemu_mutex_destroy(&multifd->load_bufs_mutex);
> +
>       g_free(multifd);
>   }
>
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 1eefba3b2eed..d5ab7d6f85f5 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -22,4 +22,7 @@ bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev);
>
>   bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>
> +bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                            Error **errp);
> +
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4311de763885..abaf4d08d4a9 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -806,6 +806,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
>   };
>
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1bebe9877d88..042a3dc54a33 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -153,6 +153,7 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
>   vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
> +vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-02-19 20:34 ` [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
  2025-02-26 13:49   ` Cédric Le Goater
@ 2025-03-02 14:15   ` Avihai Horon
  2025-03-03 22:16     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 14:15 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Maybe add a sentence talking about the load thread itself first? E.g.:

Add a thread which loads the VFIO device state buffers that were 
received and via multifd.
Each VFIO device that has multifd device state transfer enabled has one 
such thread, which is created using migration core API 
qemu_loadvm_start_load_thread().

Since it's important to finish...

> Since it's important to finish loading device state transferred via the
> main migration channel (via save_live_iterate SaveVMHandler) before
> starting loading the data asynchronously transferred via multifd the thread
> doing the actual loading of the multifd transferred data is only started
> from switchover_start SaveVMHandler.
>
> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>
> This sub-command is only sent after all save_live_iterate data have already
> been posted so it is safe to commence loading of the multifd-transferred
> device state upon receiving it - loading of save_live_iterate data happens
> synchronously in the main migration thread (much like the processing of
> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
> processed all the proceeding data must have already been loaded.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h |   2 +
>   hw/vfio/migration.c         |  12 ++
>   hw/vfio/trace-events        |   5 +
>   4 files changed, 244 insertions(+)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 5d5ee1393674..b3a88c062769 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>   } VFIOStateBuffer;
>
>   typedef struct VFIOMultifd {
> +    QemuThread load_bufs_thread;

This can be dropped.

> +    bool load_bufs_thread_running;
> +    bool load_bufs_thread_want_exit;
> +
>       VFIOStateBuffers load_bufs;
>       QemuCond load_bufs_buffer_ready_cond;
> +    QemuCond load_bufs_thread_finished_cond;
>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>       uint32_t load_buf_idx;
>       uint32_t load_buf_idx_last;
> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>       return true;
>   }
>
> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> +{
> +    return -EINVAL;
> +}
> +
> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
> +{
> +    VFIOStateBuffer *lb;
> +    guint bufs_len;
> +
> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
> +    if (multifd->load_buf_idx >= bufs_len) {
> +        assert(multifd->load_buf_idx == bufs_len);
> +        return NULL;
> +    }
> +
> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
> +                               multifd->load_buf_idx);
> +    if (!lb->is_present) {
> +        return NULL;
> +    }
> +
> +    return lb;
> +}
> +
> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
> +                                         VFIOStateBuffer *lb,
> +                                         Error **errp)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    g_autofree char *buf = NULL;
> +    char *buf_cur;
> +    size_t buf_len;
> +
> +    if (!lb->len) {
> +        return true;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
> +                                                   multifd->load_buf_idx);
> +
> +    /* lb might become re-allocated when we drop the lock */
> +    buf = g_steal_pointer(&lb->data);
> +    buf_cur = buf;
> +    buf_len = lb->len;
> +    while (buf_len > 0) {
> +        ssize_t wr_ret;
> +        int errno_save;
> +
> +        /*
> +         * Loading data to the device takes a while,
> +         * drop the lock during this process.
> +         */
> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);
> +        errno_save = errno;
> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
> +
> +        if (wr_ret < 0) {
> +            error_setg(errp,
> +                       "writing state buffer %" PRIu32 " failed: %d",
> +                       multifd->load_buf_idx, errno_save);

Let's add vbasedev->name to the error message so we know which device 
caused the error.

> +            return false;
> +        }
> +
> +        assert(wr_ret <= buf_len);

I think this assert is redundant: we write buf_len bytes and by 
definition of write() wr_ret will be <= buf_len.

> +        buf_len -= wr_ret;
> +        buf_cur += wr_ret;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
> +                                                 multifd->load_buf_idx);
> +
> +    return true;
> +}
> +
> +static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
> +                                            bool *should_quit)
> +{
> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
> +}
> +
> +/*
> + * This thread is spawned by vfio_multifd_switchover_start() which gets
> + * called upon encountering the switchover point marker in main migration
> + * stream.
> + *
> + * It exits after either:
> + * * completing loading the remaining device state and device config, OR:
> + * * encountering some error while doing the above, OR:
> + * * being forcefully aborted by the migration core by it setting should_quit
> + *   or by vfio_load_cleanup_load_bufs_thread() setting
> + *   multifd->load_bufs_thread_want_exit.
> + */
> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    bool ret = true;
> +    int config_ret;
> +
> +    assert(multifd);
> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
> +
> +    assert(multifd->load_bufs_thread_running);
> +
> +    while (true) {
> +        VFIOStateBuffer *lb;
> +
> +        /*
> +         * Always check cancellation first after the buffer_ready wait below in
> +         * case that cond was signalled by vfio_load_cleanup_load_bufs_thread().
> +         */
> +        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
> +            error_setg(errp, "operation cancelled");
> +            ret = false;
> +            goto ret_signal;

IIUC, if vfio_load_bufs_thread_want_exit() returns true, it means that 
some other code part already failed and set migration error, no?
If so, shouldn't we return true here? After all, vfio_load_bufs_thread 
didn't really fail, it just got signal to terminate itself.

> +        }
> +
> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
> +
> +        lb = vfio_load_state_buffer_get(multifd);
> +        if (!lb) {
> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
> +                                                        multifd->load_buf_idx);
> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
> +                           &multifd->load_bufs_mutex);
> +            continue;
> +        }
> +
> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
> +            break;
> +        }
> +
> +        if (multifd->load_buf_idx == 0) {
> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
> +        }
> +
> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
> +            ret = false;
> +            goto ret_signal;
> +        }
> +
> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
> +        }
> +
> +        multifd->load_buf_idx++;
> +    }
> +
> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
> +    if (config_ret) {
> +        error_setg(errp, "load config state failed: %d", config_ret);

Let's add vbasedev->name to the error message so we know which device 
caused the error.

> +        ret = false;
> +    }
> +
> +ret_signal:
> +    /*
> +     * Notify possibly waiting vfio_load_cleanup_load_bufs_thread() that
> +     * this thread is exiting.
> +     */
> +    multifd->load_bufs_thread_running = false;
> +    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
> +
> +    return ret;
> +}
> +
>   VFIOMultifd *vfio_multifd_new(void)
>   {
>       VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
> @@ -191,11 +365,42 @@ VFIOMultifd *vfio_multifd_new(void)
>       multifd->load_buf_idx_last = UINT32_MAX;
>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>
> +    multifd->load_bufs_thread_running = false;
> +    multifd->load_bufs_thread_want_exit = false;
> +    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
> +
>       return multifd;
>   }
>
> +/*
> + * Terminates vfio_load_bufs_thread by setting
> + * multifd->load_bufs_thread_want_exit and signalling all the conditions
> + * the thread could be blocked on.
> + *
> + * Waits for the thread to signal that it had finished.
> + */
> +static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
> +{
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> +        while (multifd->load_bufs_thread_running) {
> +            multifd->load_bufs_thread_want_exit = true;
> +
> +            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
> +            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
> +                           &multifd->load_bufs_mutex);
> +        }
> +    }
> +    bql_lock();
> +}
> +
>   void vfio_multifd_free(VFIOMultifd *multifd)
>   {
> +    vfio_load_cleanup_load_bufs_thread(multifd);
> +
> +    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
> +    vfio_state_buffers_destroy(&multifd->load_bufs);

vfio_state_buffers_destroy(&multifd->load_bufs); belongs to patch #26, no?

Thanks.

>       qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
>       qemu_mutex_destroy(&multifd->load_bufs_mutex);
>
> @@ -225,3 +430,23 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>
>       return true;
>   }
> +
> +int vfio_multifd_switchover_start(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +
> +    assert(multifd);
> +
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> +        assert(!multifd->load_bufs_thread_running);
> +        multifd->load_bufs_thread_running = true;
> +    }
> +    bql_lock();
> +
> +    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
> +
> +    return 0;
> +}
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index d5ab7d6f85f5..09cbb437d9d1 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -25,4 +25,6 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>                               Error **errp);
>
> +int vfio_multifd_switchover_start(VFIODevice *vbasedev);
> +
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index abaf4d08d4a9..85f54cb22df2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -793,6 +793,17 @@ static bool vfio_switchover_ack_needed(void *opaque)
>       return vfio_precopy_supported(vbasedev);
>   }
>
> +static int vfio_switchover_start(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        return vfio_multifd_switchover_start(vbasedev);
> +    }
> +
> +    return 0;
> +}
> +
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
> @@ -808,6 +819,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_state = vfio_load_state,
>       .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
> +    .switchover_start = vfio_switchover_start,
>   };
>
>   /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 042a3dc54a33..418b378ebd29 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -154,6 +154,11 @@ vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
>   vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_start(const char *name) " (%s)"
> +vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_end(const char *name) " (%s)"
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-02-26 13:49   ` Cédric Le Goater
  2025-02-26 21:05     ` Maciej S. Szmigiero
@ 2025-03-02 14:19     ` Avihai Horon
  2025-03-03 22:16       ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 14:19 UTC (permalink / raw)
  To: Cédric Le Goater, Maciej S. Szmigiero, Peter Xu,
	Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel


On 26/02/2025 15:49, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Since it's important to finish loading device state transferred via the
>> main migration channel (via save_live_iterate SaveVMHandler) before
>> starting loading the data asynchronously transferred via multifd the 
>> thread
>> doing the actual loading of the multifd transferred data is only started
>> from switchover_start SaveVMHandler.
>>
>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>> sub-command of QEMU_VM_COMMAND is received via the main migration 
>> channel.
>>
>> This sub-command is only sent after all save_live_iterate data have 
>> already
>> been posted so it is safe to commence loading of the multifd-transferred
>> device state upon receiving it - loading of save_live_iterate data 
>> happens
>> synchronously in the main migration thread (much like the processing of
>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>> processed all the proceeding data must have already been loaded.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h |   2 +
>>   hw/vfio/migration.c         |  12 ++
>>   hw/vfio/trace-events        |   5 +
>>   4 files changed, 244 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 5d5ee1393674..b3a88c062769 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>
>>   typedef struct VFIOMultifd {
>> +    QemuThread load_bufs_thread;
>> +    bool load_bufs_thread_running;
>> +    bool load_bufs_thread_want_exit;
>> +
>>       VFIOStateBuffers load_bufs;
>>       QemuCond load_bufs_buffer_ready_cond;
>> +    QemuCond load_bufs_thread_finished_cond;
>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>       uint32_t load_buf_idx;
>>       uint32_t load_buf_idx_last;
>> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char 
>> *data, size_t data_size,
>>       return true;
>>   }
>>
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    return -EINVAL;
>> +}
>
>
> please move to next patch.
>
>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd 
>> *multifd)
>> +{
>> +    VFIOStateBuffer *lb;
>> +    guint bufs_len;
>
> guint:  I guess it's ok to use here. It is not common practice in VFIO.

Glib documentation says that in new code unsigned int is preferred over 
guint [1].

Thanks.

[1] https://docs.gtk.org/glib/types.html#guint



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support
  2025-02-19 20:34 ` [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
  2025-02-26 13:52   ` Cédric Le Goater
@ 2025-03-02 14:25   ` Avihai Horon
  2025-03-03 22:17     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 14:25 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Load device config received via multifd using the existing machinery
> behind vfio_load_device_config_state().
>
> Also, make sure to process the relevant main migration channel flags.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 47 ++++++++++++++++++++++++++++++++++-
>   hw/vfio/migration.c           |  8 +++++-
>   include/hw/vfio/vfio-common.h |  2 ++
>   3 files changed, 55 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index b3a88c062769..7200f6f1c2a2 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -15,6 +15,7 @@
>   #include "qemu/lockable.h"
>   #include "qemu/main-loop.h"
>   #include "qemu/thread.h"
> +#include "io/channel-buffer.h"
>   #include "migration/qemu-file.h"
>   #include "migration-multifd.h"
>   #include "trace.h"
> @@ -186,7 +187,51 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>
>   static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>   {
> -    return -EINVAL;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIOStateBuffer *lb;
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    QEMUFile *f_out = NULL, *f_in = NULL;

Can we move patch #29 before this one and use g_autoptr() for f_out an f_in?

> +    uint64_t mig_header;
> +    int ret;
> +
> +    assert(multifd->load_buf_idx == multifd->load_buf_idx_last);
> +    lb = vfio_state_buffers_at(&multifd->load_bufs, multifd->load_buf_idx);
> +    assert(lb->is_present);
> +
> +    bioc = qio_channel_buffer_new(lb->len);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
> +
> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
> +
> +    ret = qemu_fflush(f_out);
> +    if (ret) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        return ret;
> +    }
> +
> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
> +
> +    mig_header = qemu_get_be64(f_in);
> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        g_clear_pointer(&f_in, qemu_fclose);
> +        return -EINVAL;
> +    }
> +
> +    bql_lock();
> +    ret = vfio_load_device_config_state(f_in, vbasedev);
> +    bql_unlock();
> +
> +    g_clear_pointer(&f_out, qemu_fclose);
> +    g_clear_pointer(&f_in, qemu_fclose);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return 0;
>   }
>
>   static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 85f54cb22df2..b962309f7c27 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -264,7 +264,7 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>       return ret;
>   }
>
> -static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
>       uint64_t data;
> @@ -728,6 +728,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            if (vfio_multifd_transfer_enabled(vbasedev)) {
> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
> +                             vbasedev->name);

To make clearer, maybe change to:
"%s: got DEV_CONFIG_STATE in main migration channel but doing multifd 
transfer"

Thanks.

> +                return -EINVAL;
> +            }
> +
>               return vfio_load_device_config_state(f, opaque);
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ab110198bd6b..ce2bdea8a2c2 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -298,6 +298,8 @@ void vfio_add_bytes_transferred(unsigned long val);
>   bool vfio_device_state_is_running(VFIODevice *vbasedev);
>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>
> +int vfio_load_device_config_state(QEMUFile *f, void *opaque);
> +
>   #ifdef CONFIG_LINUX
>   int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                            struct vfio_region_info **info);


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side
  2025-02-19 20:34 ` [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
  2025-02-26 16:43   ` Cédric Le Goater
@ 2025-03-02 14:41   ` Avihai Horon
  2025-03-03 22:17     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 14:41 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Implement the multifd device state transfer via additional per-device
> thread inside save_live_complete_precopy_thread handler.
>
> Switch between doing the data transfer in the new handler and doing it
> in the old save_state handler depending on the
> x-migration-multifd-transfer device property value.

x-migration-multifd-transfer is not yet introduced. Maybe rephrase to:

... depending if VFIO multifd transfer is enabled or not.

>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 139 ++++++++++++++++++++++++++++++++++
>   hw/vfio/migration-multifd.h   |   5 ++
>   hw/vfio/migration.c           |  26 +++++--
>   hw/vfio/trace-events          |   2 +
>   include/hw/vfio/vfio-common.h |   8 ++
>   5 files changed, 174 insertions(+), 6 deletions(-)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 7200f6f1c2a2..0cfa9d31732a 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -476,6 +476,145 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>       return true;
>   }
>
> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    assert(vfio_multifd_transfer_enabled(vbasedev));
> +
> +    /*
> +     * Emit dummy NOP data on the main migration channel since the actual
> +     * device state transfer is done via multifd channels.
> +     */
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +}
> +
> +static bool
> +vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev,
> +                                               char *idstr,
> +                                               uint32_t instance_id,
> +                                               uint32_t idx,
> +                                               Error **errp)
> +{
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    g_autoptr(QEMUFile) f = NULL;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    size_t packet_len;
> +
> +    bioc = qio_channel_buffer_new(0);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
> +
> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
> +
> +    if (vfio_save_device_config_state(f, vbasedev, errp)) {
> +        return false;
> +    }
> +
> +    ret = qemu_fflush(f);
> +    if (ret) {
> +        error_setg(errp, "save config state flush failed: %d", ret);

Let's add vbasedev->name to the error message so we know which device 
caused the error.

> +        return false;
> +    }
> +
> +    packet_len = sizeof(*packet) + bioc->usage;
> +    packet = g_malloc0(packet_len);
> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
> +    packet->idx = idx;
> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
> +    memcpy(&packet->data, bioc->data, bioc->usage);
> +
> +    if (!multifd_queue_device_state(idstr, instance_id,
> +                                    (char *)packet, packet_len)) {
> +        error_setg(errp, "multifd config data queuing failed");

Ditto.

> +        return false;
> +    }
> +
> +    vfio_add_bytes_transferred(packet_len);
> +
> +    return true;
> +}
> +
> +/*
> + * This thread is spawned by the migration core directly via
> + * .save_live_complete_precopy_thread SaveVMHandler.
> + *
> + * It exits after either:
> + * * completing saving the remaining device state and device config, OR:
> + * * encountering some error while doing the above, OR:
> + * * being forcefully aborted by the migration core by
> + *   multifd_device_state_save_thread_should_exit() returning true.
> + */
> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
> +                                       Error **errp)
> +{
> +    VFIODevice *vbasedev = d->handler_opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    bool ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    uint32_t idx;
> +
> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
> +        return true;
> +    }
> +
> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
> +                                                  d->idstr, d->instance_id);
> +
> +    /* We reach here with device state STOP or STOP_COPY only */
> +    if (vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +                                 VFIO_DEVICE_STATE_STOP, errp)) {
> +        ret = false;
> +        goto ret_finish;
> +    }
> +
> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
> +
> +    for (idx = 0; ; idx++) {
> +        ssize_t data_size;
> +        size_t packet_size;
> +
> +        if (multifd_device_state_save_thread_should_exit()) {
> +            error_setg(errp, "operation cancelled");

Same comment as in patch #27:

IIUC, if multifd_device_state_save_thread_should_exit() returns true, it 
means that some other code part already failed and set migration error, no?
If so, shouldn't we return true here? After all, 
vfio_save_complete_precopy_thread didn't really fail, it just got signal 
to terminate itself

> +            ret = false;
> +            goto ret_finish;
> +        }
> +
> +        data_size = read(migration->data_fd, &packet->data,
> +                         migration->data_buffer_size);
> +        if (data_size < 0) {
> +            error_setg(errp, "reading state buffer %" PRIu32 " failed: %d",
> +                       idx, errno);

Let's add vbasedev->name to the error message so we know which device 
caused the error.

> +            ret = false;
> +            goto ret_finish;
> +        } else if (data_size == 0) {
> +            break;
> +        }
> +
> +        packet->idx = idx;
> +        packet_size = sizeof(*packet) + data_size;
> +
> +        if (!multifd_queue_device_state(d->idstr, d->instance_id,
> +                                        (char *)packet, packet_size)) {
> +            error_setg(errp, "multifd data queuing failed");

Ditto.

Thanks.

> +            ret = false;
> +            goto ret_finish;
> +        }
> +
> +        vfio_add_bytes_transferred(packet_size);
> +    }
> +
> +    ret = vfio_save_complete_precopy_thread_config_state(vbasedev,
> +                                                         d->idstr,
> +                                                         d->instance_id,
> +                                                         idx, errp);
> +
> +ret_finish:
> +    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
> +
> +    return ret;
> +}
> +
>   int vfio_multifd_switchover_start(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
> diff --git a/hw/vfio/migration-multifd.h b/hw/vfio/migration-multifd.h
> index 09cbb437d9d1..79780d7b5392 100644
> --- a/hw/vfio/migration-multifd.h
> +++ b/hw/vfio/migration-multifd.h
> @@ -25,6 +25,11 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp);
>   bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>                               Error **errp);
>
> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f);
> +
> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
> +                                       Error **errp);
> +
>   int vfio_multifd_switchover_start(VFIODevice *vbasedev);
>
>   #endif
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index b962309f7c27..69dcf2dac2fa 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -120,10 +120,10 @@ static void vfio_migration_set_device_state(VFIODevice *vbasedev,
>       vfio_migration_send_event(vbasedev);
>   }
>
> -static int vfio_migration_set_state(VFIODevice *vbasedev,
> -                                    enum vfio_device_mig_state new_state,
> -                                    enum vfio_device_mig_state recover_state,
> -                                    Error **errp)
> +int vfio_migration_set_state(VFIODevice *vbasedev,
> +                             enum vfio_device_mig_state new_state,
> +                             enum vfio_device_mig_state recover_state,
> +                             Error **errp)
>   {
>       VFIOMigration *migration = vbasedev->migration;
>       uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> @@ -238,8 +238,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>       return ret;
>   }
>
> -static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
> -                                         Error **errp)
> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
>       int ret;
> @@ -453,6 +452,10 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>       int ret;
>
> +    if (!vfio_multifd_transfer_setup(vbasedev, errp)) {
> +        return -EINVAL;
> +    }
> +
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> @@ -631,6 +634,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       int ret;
>       Error *local_err = NULL;
>
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        vfio_multifd_emit_dummy_eos(vbasedev, f);
> +        return 0;
> +    }
> +
>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>
>       /* We reach here with device state STOP or STOP_COPY only */
> @@ -662,6 +670,11 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>       Error *local_err = NULL;
>       int ret;
>
> +    if (vfio_multifd_transfer_enabled(vbasedev)) {
> +        vfio_multifd_emit_dummy_eos(vbasedev, f);
> +        return;
> +    }
> +
>       ret = vfio_save_device_config_state(f, opaque, &local_err);
>       if (ret) {
>           error_prepend(&local_err,
> @@ -819,6 +832,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
>       .save_state = vfio_save_state,
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 418b378ebd29..039979bdd98f 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -168,6 +168,8 @@ vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>   vfio_save_complete_precopy_start(const char *name) " (%s)"
> +vfio_save_complete_precopy_thread_start(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
> +vfio_save_complete_precopy_thread_end(const char *name, int ret) " (%s) ret %d"
>   vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_save_iterate_start(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ce2bdea8a2c2..ba851917f9fc 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -298,6 +298,14 @@ void vfio_add_bytes_transferred(unsigned long val);
>   bool vfio_device_state_is_running(VFIODevice *vbasedev);
>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>
> +#ifdef CONFIG_LINUX
> +int vfio_migration_set_state(VFIODevice *vbasedev,
> +                             enum vfio_device_mig_state new_state,
> +                             enum vfio_device_mig_state recover_state,
> +                             Error **errp);
> +#endif
> +
> +int vfio_save_device_config_state(QEMUFile *f, void *opaque, Error **errp);
>   int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>
>   #ifdef CONFIG_LINUX


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-02-19 20:34 ` [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
  2025-02-27  6:45   ` Cédric Le Goater
@ 2025-03-02 14:48   ` Avihai Horon
  2025-03-03 22:17     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 14:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This property allows configuring at runtime whether to transfer the

IIUC, in this patch it's not configurable at runtime, so let's drop "at 
runtime".

> particular device state via multifd channels when live migrating that
> device.
>
> It defaults to AUTO, which means that VFIO device state transfer via
> multifd channels is attempted in configurations that otherwise support it.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 17 ++++++++++++++++-
>   hw/vfio/pci.c                 |  3 +++
>   include/hw/vfio/vfio-common.h |  2 ++
>   3 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 0cfa9d31732a..18a5ff964a37 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -460,11 +460,26 @@ bool vfio_multifd_transfer_supported(void)
>
>   bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>   {
> -    return false;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->multifd_transfer;
>   }
>
>   bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>   {
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    /*
> +     * Make a copy of this setting at the start in case it is changed
> +     * mid-migration.
> +     */
> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> +    } else {
> +        migration->multifd_transfer =
> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> +    }

Making a copy of this value is only relevant for the next patch where 
it's turned mutable, so let's move this code to patch #32.

Thanks.

> +
>       if (vfio_multifd_transfer_enabled(vbasedev) &&
>           !vfio_multifd_transfer_supported()) {
>           error_setg(errp,
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 89d900e9cf0c..184ff882f9d1 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3377,6 +3377,9 @@ static const Property vfio_pci_dev_properties[] = {
>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_ON_OFF_AUTO("x-migration-multifd-transfer", VFIOPCIDevice,
> +                            vbasedev.migration_multifd_transfer,
> +                            ON_OFF_AUTO_AUTO),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ba851917f9fc..3006931accf6 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -91,6 +91,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    bool multifd_transfer;
>       VFIOMultifd *multifd;
>       bool initial_data_sent;
>
> @@ -153,6 +154,7 @@ typedef struct VFIODevice {
>       bool no_mmap;
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
> +    OnOffAuto migration_multifd_transfer;
>       bool migration_events;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-02-19 20:34 ` [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit Maciej S. Szmigiero
  2025-02-27  6:48   ` Cédric Le Goater
@ 2025-03-02 14:53   ` Avihai Horon
  2025-03-02 14:54     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 14:53 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Allow capping the maximum count of in-flight VFIO device state buffers
> queued at the destination, otherwise a malicious QEMU source could
> theoretically cause the target QEMU to allocate unlimited amounts of memory
> for buffers-in-flight.

I still think it's better to limit the number of bytes rather than 
number of buffers:
1. To the average user the number of buffers doesn't really mean 
anything. They have to open the code and see what is the size of a 
single buffer and then choose their value.
2. Currently VFIO migration buffer size is 1MB. If later it's changed to 
2MB for example, users will have to adjust their configuration 
accordingly. With number of bytes, the configuration remains the same no 
matter what is the VFIO migration buffer size.

>
> Since this is not expected to be a realistic threat in most of VFIO live
> migration use cases and the right value depends on the particular setup
> disable the limit by default by setting it to UINT64_MAX.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration-multifd.c   | 14 ++++++++++++++
>   hw/vfio/pci.c                 |  2 ++
>   include/hw/vfio/vfio-common.h |  1 +
>   3 files changed, 17 insertions(+)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index 18a5ff964a37..04aa3f4a6596 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -53,6 +53,7 @@ typedef struct VFIOMultifd {
>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>       uint32_t load_buf_idx;
>       uint32_t load_buf_idx_last;
> +    uint32_t load_buf_queued_pending_buffers;
>   } VFIOMultifd;
>
>   static void vfio_state_buffer_clear(gpointer data)
> @@ -121,6 +122,15 @@ static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>
>       assert(packet->idx >= multifd->load_buf_idx);
>
> +    multifd->load_buf_queued_pending_buffers++;
> +    if (multifd->load_buf_queued_pending_buffers >
> +        vbasedev->migration_max_queued_buffers) {
> +        error_setg(errp,
> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
> +                   packet->idx, vbasedev->migration_max_queued_buffers);

Let's add vbasedev->name to the error message so we know which device 
caused the error.

Thanks.

> +        return false;
> +    }
> +
>       lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>       lb->len = packet_total_size - sizeof(*packet);
>       lb->is_present = true;
> @@ -374,6 +384,9 @@ static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>               goto ret_signal;
>           }
>
> +        assert(multifd->load_buf_queued_pending_buffers > 0);
> +        multifd->load_buf_queued_pending_buffers--;
> +
>           if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>               trace_vfio_load_state_device_buffer_end(vbasedev->name);
>           }
> @@ -408,6 +421,7 @@ VFIOMultifd *vfio_multifd_new(void)
>
>       multifd->load_buf_idx = 0;
>       multifd->load_buf_idx_last = UINT32_MAX;
> +    multifd->load_buf_queued_pending_buffers = 0;
>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>
>       multifd->load_bufs_thread_running = false;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 9111805ae06c..247418f0fce2 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3383,6 +3383,8 @@ static const Property vfio_pci_dev_properties[] = {
>                   vbasedev.migration_multifd_transfer,
>                   qdev_prop_on_off_auto_mutable, OnOffAuto,
>                   .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 3006931accf6..30a5bb9af61b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -155,6 +155,7 @@ typedef struct VFIODevice {
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
>       OnOffAuto migration_multifd_transfer;
> +    uint64_t migration_max_queued_buffers;
>       bool migration_events;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-03-02 14:53   ` Avihai Horon
@ 2025-03-02 14:54     ` Maciej S. Szmigiero
  2025-03-02 14:59       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-02 14:54 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 15:53, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Allow capping the maximum count of in-flight VFIO device state buffers
>> queued at the destination, otherwise a malicious QEMU source could
>> theoretically cause the target QEMU to allocate unlimited amounts of memory
>> for buffers-in-flight.
> 
> I still think it's better to limit the number of bytes rather than number of buffers:
> 1. To the average user the number of buffers doesn't really mean anything. They have to open the code and see what is the size of a single buffer and then choose their value.
> 2. Currently VFIO migration buffer size is 1MB. If later it's changed to 2MB for example, users will have to adjust their configuration accordingly. With number of bytes, the configuration remains the same no matter what is the VFIO migration buffer size.

Sorry Avihai, but we're a little more than week from code freeze
so it's really not a time for more than cosmetic changes.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-03-02 14:54     ` Maciej S. Szmigiero
@ 2025-03-02 14:59       ` Maciej S. Szmigiero
  2025-03-02 16:28         ` Avihai Horon
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-02 14:59 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 15:54, Maciej S. Szmigiero wrote:
> On 2.03.2025 15:53, Avihai Horon wrote:
>>
>> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Allow capping the maximum count of in-flight VFIO device state buffers
>>> queued at the destination, otherwise a malicious QEMU source could
>>> theoretically cause the target QEMU to allocate unlimited amounts of memory
>>> for buffers-in-flight.
>>
>> I still think it's better to limit the number of bytes rather than number of buffers:
>> 1. To the average user the number of buffers doesn't really mean anything. They have to open the code and see what is the size of a single buffer and then choose their value.
>> 2. Currently VFIO migration buffer size is 1MB. If later it's changed to 2MB for example, users will have to adjust their configuration accordingly. With number of bytes, the configuration remains the same no matter what is the VFIO migration buffer size.
> 
> Sorry Avihai, but we're a little more than week from code freeze
> so it's really not a time for more than cosmetic changes.

And if you really, really want to have queued buffers size limit
that's something could be added later as additional
x-migration-max-queued-buffers-size or something property
since these limits aren't exclusive.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s)
  2025-03-02 13:00   ` Avihai Horon
@ 2025-03-02 15:14     ` Maciej S. Szmigiero
  2025-03-03  6:42     ` Cédric Le Goater
  1 sibling, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-02 15:14 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 14:00, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add VFIOStateBuffer(s) types and the associated methods.
>>
>> These store received device state buffers and config state waiting to get
>> loaded into the device.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c | 54 +++++++++++++++++++++++++++++++++++++
>>   1 file changed, 54 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 0c3185a26242..760b110a39b9 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -29,3 +29,57 @@ typedef struct VFIODeviceStatePacket {
>>       uint32_t flags;
>>       uint8_t data[0];
>>   } QEMU_PACKED VFIODeviceStatePacket;
>> +
>> +/* type safety */
>> +typedef struct VFIOStateBuffers {
>> +    GArray *array;
>> +} VFIOStateBuffers;
>> +
>> +typedef struct VFIOStateBuffer {
>> +    bool is_present;
>> +    char *data;
>> +    size_t len;
>> +} VFIOStateBuffer;
>> +
>> +static void vfio_state_buffer_clear(gpointer data)
>> +{
>> +    VFIOStateBuffer *lb = data;
>> +
>> +    if (!lb->is_present) {
>> +        return;
>> +    }
>> +
>> +    g_clear_pointer(&lb->data, g_free);
>> +    lb->is_present = false;
>> +}
>> +
>> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
>> +{
>> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
>> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
>> +}
>> +
>> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
>> +{
>> +    g_clear_pointer(&bufs->array, g_array_unref);
>> +}
>> +
>> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
>> +{
>> +    assert(bufs->array);
>> +}
>> +
>> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
>> +{
>> +    return bufs->array->len;
>> +}
>> +
>> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
>> +{
>> +    g_array_set_size(bufs->array, size);
>> +}
>> +
>> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>> +{
>> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>> +}
> 
> This patch breaks compilation as non of the functions are used, e.g.: error: ‘vfio_state_buffers_init’ defined but not used I can think of three options to solve it: 1. Move these functions to their own file and export them, e.g., hw/vfio/state-buffer.{c,h}. But this seems like an overkill for such a small API. 2. Add __attribute__((unused)) tags and remove them in patch #26 where the functions are actually used. A bit ugly. 3. Squash this patch into patch #26. I prefer option 3 as this is a small API closely related to patch #26 (and patch #26 will still remain rather small).

Looks like some build configs use -Werror, as unused functions aren't normally
an error.

Will have look at this tomorrow.

> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit
  2025-03-02 14:59       ` Maciej S. Szmigiero
@ 2025-03-02 16:28         ` Avihai Horon
  0 siblings, 0 replies; 120+ messages in thread
From: Avihai Horon @ 2025-03-02 16:28 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 02/03/2025 16:59, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 2.03.2025 15:54, Maciej S. Szmigiero wrote:
>> On 2.03.2025 15:53, Avihai Horon wrote:
>>>
>>> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Allow capping the maximum count of in-flight VFIO device state buffers
>>>> queued at the destination, otherwise a malicious QEMU source could
>>>> theoretically cause the target QEMU to allocate unlimited amounts 
>>>> of memory
>>>> for buffers-in-flight.
>>>
>>> I still think it's better to limit the number of bytes rather than 
>>> number of buffers:
>>> 1. To the average user the number of buffers doesn't really mean 
>>> anything. They have to open the code and see what is the size of a 
>>> single buffer and then choose their value.
>>> 2. Currently VFIO migration buffer size is 1MB. If later it's 
>>> changed to 2MB for example, users will have to adjust their 
>>> configuration accordingly. With number of bytes, the configuration 
>>> remains the same no matter what is the VFIO migration buffer size.
>>
>> Sorry Avihai, but we're a little more than week from code freeze
>> so it's really not a time for more than cosmetic changes.
>
> And if you really, really want to have queued buffers size limit
> that's something could be added later as additional
> x-migration-max-queued-buffers-size or something property
> since these limits aren't exclusive.
>
Sure, I agree.
It's not urgent nor mandatory for now, I just wanted to express my 
opinion :)

Thanks.

> Thanks,
> Maciej
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s)
  2025-03-02 13:00   ` Avihai Horon
  2025-03-02 15:14     ` Maciej S. Szmigiero
@ 2025-03-03  6:42     ` Cédric Le Goater
  2025-03-03 22:14       ` Maciej S. Szmigiero
  1 sibling, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-03-03  6:42 UTC (permalink / raw)
  To: Avihai Horon, Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel

On 3/2/25 14:00, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add VFIOStateBuffer(s) types and the associated methods.
>>
>> These store received device state buffers and config state waiting to get
>> loaded into the device.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c | 54 +++++++++++++++++++++++++++++++++++++
>>   1 file changed, 54 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 0c3185a26242..760b110a39b9 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -29,3 +29,57 @@ typedef struct VFIODeviceStatePacket {
>>       uint32_t flags;
>>       uint8_t data[0];
>>   } QEMU_PACKED VFIODeviceStatePacket;
>> +
>> +/* type safety */
>> +typedef struct VFIOStateBuffers {
>> +    GArray *array;
>> +} VFIOStateBuffers;
>> +
>> +typedef struct VFIOStateBuffer {
>> +    bool is_present;
>> +    char *data;
>> +    size_t len;
>> +} VFIOStateBuffer;
>> +
>> +static void vfio_state_buffer_clear(gpointer data)
>> +{
>> +    VFIOStateBuffer *lb = data;
>> +
>> +    if (!lb->is_present) {
>> +        return;
>> +    }
>> +
>> +    g_clear_pointer(&lb->data, g_free);
>> +    lb->is_present = false;
>> +}
>> +
>> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
>> +{
>> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
>> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
>> +}
>> +
>> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
>> +{
>> +    g_clear_pointer(&bufs->array, g_array_unref);
>> +}
>> +
>> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
>> +{
>> +    assert(bufs->array);
>> +}
>> +
>> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
>> +{
>> +    return bufs->array->len;
>> +}
>> +
>> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
>> +{
>> +    g_array_set_size(bufs->array, size);
>> +}
>> +
>> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>> +{
>> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>> +}
> 
> This patch breaks compilation as non of the functions are used, e.g.: error: ‘vfio_state_buffers_init’ defined but not used I can think of three options to solve it: 1. Move these functions to their own file and export them, e.g., hw/vfio/state-buffer.{c,h}. But this seems like an overkill for such a small API. 2. Add __attribute__((unused)) tags and remove them in patch #26 where the functions are actually used. A bit ugly. 

>
> 3. Squash this patch into patch #26. I prefer option 3 as this is a small API closely related to patch #26 (and patch #26 will still remain rather small).

I vote for option 3 too.

vfio_state_buffers_init is only called once, it's 2 lines,
it could be merged in vfio_multifd_new() too.


Thanks,

C.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-28 23:38         ` Fabiano Rosas
@ 2025-03-03  9:34           ` Cédric Le Goater
  2025-03-03 22:14           ` Maciej S. Szmigiero
  1 sibling, 0 replies; 120+ messages in thread
From: Cédric Le Goater @ 2025-03-03  9:34 UTC (permalink / raw)
  To: Fabiano Rosas, Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 3/1/25 00:38, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
> 
>> On 2/27/25 23:01, Maciej S. Szmigiero wrote:
>>> On 27.02.2025 07:59, Cédric Le Goater wrote:
>>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> Update the VFIO documentation at docs/devel/migration describing the
>>>>> changes brought by the multifd device state transfer.
>>>>>
>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>> ---
>>>>>    docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
>>>>>    1 file changed, 71 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
>>>>> index c49482eab66d..d9b169d29921 100644
>>>>> --- a/docs/devel/migration/vfio.rst
>>>>> +++ b/docs/devel/migration/vfio.rst
>>>>> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>>>>    support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>>>>    VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>>
>>>> Please add a new "multifd" documentation subsection at the end of the file
>>>> with this part :
>>>>
>>>>> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
>>>>> +_STOP_COPY state via multifd channels. This helps reduce downtime - especially
>>>>> +with multiple VFIO devices or with devices having a large migration state.
>>>>> +As an additional benefit, setting the VFIO device to _STOP_COPY state and
>>>>> +saving its config space is also parallelized (run in a separate thread) in
>>>>> +such migration mode.
>>>>> +
>>>>> +The multifd VFIO device state transfer is controlled by
>>>>> +"x-migration-multifd-transfer" VFIO device property. This property defaults to
>>>>> +AUTO, which means that VFIO device state transfer via multifd channels is
>>>>> +attempted in configurations that otherwise support it.
>>>>> +
>>>
>>> Done - I also moved the parts about x-migration-max-queued-buffers
>>> and x-migration-load-config-after-iter description there since
>>> obviously they wouldn't make sense being left alone in the top section.
>>>
>>>> I was expecting a much more detailed explanation on the design too  :
>>>>
>>>>    * in the cover letter
>>>>    * in the hw/vfio/migration-multifd.c
>>>>    * in some new file under docs/devel/migration/
>>
>> I forgot to add  :
>>
>>        * guide on how to use this new feature from QEMU and libvirt.
>>          something we can refer to for tests. That's a must have.
>>        * usage scenarios
>>          There are some benefits but it is not obvious a user would
>>          like to use multiple VFs in one VM, please explain.
>>          This is a major addition which needs justification anyhow
>>        * pros and cons
>>
>>> I'm not sure what descriptions you exactly want in these places,
>>
>> Looking from the VFIO subsystem, the way this series works is very opaque.
>> There are a couple of a new migration handlers, new threads, new channels,
>> etc. It has been discussed several times with migration folks, please provide
>> a summary for a new reader as ignorant as everyone would be when looking at
>> a new file.
>>
>>
>>> but since
>>> that's just documentation (not code) it could be added after the code freeze...
>>
>> That's the risk of not getting any ! and the initial proposal should be
>> discussed before code freeze.
>>
>> For the general framework, I was expecting an extension of a "multifd"
>> subsection under :
>>
>>     https://qemu.readthedocs.io/en/v9.2.0/devel/migration/features.html
>>
>> but it doesn't exist :/
> 
> Hi, see if this helps. Let me know what can be improved and if something
> needs to be more detailed. Please ignore the formatting, I'll send a
> proper patch after the carnaval.

This is very good !  Thanks a lot Fabiano for providing this input.

> @Maciej, it's probably better if you keep your docs separate anyway so
> we don't add another dependency. I can merge them later.

Perfect. Maciej, We will adjust the file to apply it to before merging.


Thanks,

C.



> 
> multifd.rst:
> 
> Multifd
> =======
> 
> Multifd is the name given for the migration capability that enables
> data transfer using multiple threads. Multifd supports all the
> transport types currently in use with migration (inet, unix, vsock,
> fd, file).
> 
> Restrictions
> ------------
> 
> For migration to a file, support is conditional on the presence of the
> mapped-ram capability, see #mapped-ram.
> 
> Snapshots are currently not supported.
> 
> Postcopy migration is currently not supported.
> 
> Usage
> -----
> 
> On both source and destination, enable the ``multifd`` capability:
> 
>      ``migrate_set_capability multifd on``
> 
> Define a number of channels to use (default is 2, but 8 usually
> provides best performance).
> 
>      ``migrate_set_parameter multifd-channels 8``
> 
> Components
> ----------
> 
> Multifd consists of:
> 
> - A client that produces the data on the migration source side and
>    consumes it on the destination. Currently the main client code is
>    ram.c, which selects the RAM pages for migration;
> 
> - A shared data structure (MultiFDSendData), used to transfer data
>    between multifd and the client. On the source side, this structure
>    is further subdivided into payload types (MultiFDPayload);
> 
> - An API operating on the shared data structure to allow the client
>    code to interact with multifd;
> 
>    - multifd_send/recv(): A dispatcher that transfers work to/from the
>      channels.
> 
>    - multifd_*payload_* and MultiFDPayloadType: Support defining an
>      opaque payload. The payload is always wrapped by
>      MultiFDSend|RecvData.
> 
>    - multifd_send_data_*: Used to manage the memory for the shared data
>      structure.
> 
> - The threads that process the data (aka channels, due to a 1:1
>    mapping to QIOChannels). Each multifd channel supports callbacks
>    that can be used for fine-grained processing of the payload, such as
>    compression and zero page detection.
> 
> - A packet which is the final result of all the data aggregation
>    and/or transformation. The packet contains a header, a
>    payload-specific header and a variable-size data portion.
> 
>     - The packet header: contains a magic number, a version number and
>       flags that inform of special processing needed on the
>       destination.
> 
>     - The payload-specific header: contains metadata referent to the
>       packet's data portion, such as page counts.
> 
>     - The data portion: contains the actual opaque payload data.
> 
>    Note that due to historical reasons, the terminology around multifd
>    packets is inconsistent.
> 
>    The mapped-ram feature ignores packets entirely.
> 
> Theory of operation
> -------------------
> 
> The multifd channels operate in parallel with the main migration
> thread. The transfer of data from a client code into multifd happens
> from the main migration thread using the multifd API.
> 
> The interaction between the client code and the multifd channels
> happens in the multifd_send() and multifd_recv() methods. These are
> reponsible for selecting the next idle channel and making the shared
> data structure containing the payload accessible to that channel. The
> client code receives back an empty object which it then uses for the
> next iteration of data transfer.
> 
> The selection of idle channels is simply a round-robin over the idle
> channels (!p->pending_job). Channels wait at a semaphore, once a
> channel is released, it starts operating on the data immediately.
> 
> Aside from eventually transmitting the data over the underlying
> QIOChannel, a channel's operation also includes calling back to the
> client code at pre-determined points to allow for client-specific
> handling such as data transformation (e.g. compression), creation of
> the packet header and arranging the data into iovs (struct
> iovec). Iovs are the type of data on which the QIOChannel operates.
> 
> Client code (migration thread):
> 1. Populate shared structure with opaque data (ram pages, device state)
> 2. Call multifd_send()
>     2a. Loop over the channels until one is idle
>     2b. Switch pointers between client data and channel data
>     2c. Release channel semaphore
> 3. Receive back empty object
> 4. Repeat
> 
> Multifd channel (multifd thread):
> 1. Channel idle
> 2. Gets released by multifd_send()
> 3. Call multifd_ops methods to fill iov
>     3a. Compression may happen
>     3b. Zero page detection may happen
>     3c. Packet is written
>     3d. iov is written
> 4. Pass iov into QIOChannel for transferring
> 5. Repeat
> 
> The destination side operates similarly but with multifd_recv(),
> decompression instead of compression, etc. One important aspect is
> that when receiving the data, the iov will contain host virtual
> addresses, so guest memory is written to directly from multifd
> threads.
> 
> About flags
> -----------
> The main thread orchestrates the migration by issuing control flags on
> the migration stream (QEMU_VM_*).
> 
> The main memory is migrated by ram.c and includes specific control
> flags that are also put on the main migration stream
> (RAM_SAVE_FLAG_*).
> 
> Multifd has its own set of MULTIFD_FLAGs that are included into each
> packet. These may inform about properties such as the compression
> algorithm used if the data is compressed.
> 
> Synchronization
> ---------------
> 
> Since the migration process is iterative due to RAM dirty tracking, it
> is necessary to invalidate data that is no longer current (e.g. due to
> the source VM touching the page). This is done by having a
> synchronization point triggered by the migration thread at key points
> during the migration. Data that's received after the synchronization
> point is allowed to overwrite data received prior to that point.
> 
> To perform the synchronization, multifd provides the
> multifd_send_sync_main() and multifd_recv_sync_main() helpers. These
> are called whenever the client code whishes to ensure that all data
> sent previously has now been received by the destination.
> 
> The synchronization process involves performing a flush of the
> ramaining client data still left to be transmitted and issuing a
> multifd packet containing the MULTIFD_FLAG_SYNC flag. This flag
> informs the receiving end that it should finish reading the data and
> wait for a synchronization point.
> 
> To complete the sync, the main migration stream issues a
> RAM_SAVE_FLAG_MULTIFD_FLUSH flag. When that flag is received by the
> destination, it ensures all of its channels have seen the
> MULTIFD_FLAG_SYNC and moves them to an idle state.
> 
> The client code can then continue with a second round of data by
> issuing multifd_send() once again.
> 
> The synchronization process also ensures that internal synchronization
> happens, i.e. between each thread. This is necessary to avoid threads
> lagging behind sending or receiving when the migration approaches
> completion.
> 
> The mapped-ram feature has different synchronization requirements
> because it's an asynchronous migration (source and destination not
> migrating at the same time). For that feature, only the internal sync
> is relevant.
> 
> Data transformation
> -------------------
> 
> Each multifd channel executes a set of callbacks before transmitting
> the data. These callbacks allow the client code to alter the data
> format right before sending and after receiving.
> 
> Since the object of the RAM migration is always the memory page and
> the only processing done for memory pages is zero page detection,
> which is already part of compression in a sense, the multifd_ops
> functions are mutually exclusively divided into compression and
> no-compression.
> 
> The migration without compression (i.e. regular ram migration) has a
> further specificity as mentioned of possibly doing zero page detection
> (see zero-page-detection migration parameter). This consists of
> sending all pages to multifd and letting the detection of a zero page
> happen in the multifd channels instead of doing it beforehand on the
> main migration thread as it was done in the past.
> 
> Code structure
> --------------
> 
> Multifd code is divided into:
> 
> The main file containing the core routines
> 
> - multifd.c
> 
> RAM migration
> 
> - multifd-nocomp.c (nocomp, for "no compression")
> - multifd-zero-page.c
> - ram.c (also involved in non-multifd migrations + snapshots)
> 
> Compressors
> 
> - multifd-uadk.c
> - multifd-qatzip.c
> - multifd-zlib.c
> - multifd-qpl.c
> - multifd-zstd.c
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation
  2025-02-28 23:38         ` Fabiano Rosas
  2025-03-03  9:34           ` Cédric Le Goater
@ 2025-03-03 22:14           ` Maciej S. Szmigiero
  1 sibling, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:14 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake, Peter Xu,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 1.03.2025 00:38, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
> 
>> On 2/27/25 23:01, Maciej S. Szmigiero wrote:
>>> On 27.02.2025 07:59, Cédric Le Goater wrote:
>>>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> Update the VFIO documentation at docs/devel/migration describing the
>>>>> changes brought by the multifd device state transfer.
>>>>>
>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>> ---
>>>>>    docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
>>>>>    1 file changed, 71 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
>>>>> index c49482eab66d..d9b169d29921 100644
>>>>> --- a/docs/devel/migration/vfio.rst
>>>>> +++ b/docs/devel/migration/vfio.rst
>>>>> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>>>>    support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>>>>    VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>>
>>>> Please add a new "multifd" documentation subsection at the end of the file
>>>> with this part :
>>>>
>>>>> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
>>>>> +_STOP_COPY state via multifd channels. This helps reduce downtime - especially
>>>>> +with multiple VFIO devices or with devices having a large migration state.
>>>>> +As an additional benefit, setting the VFIO device to _STOP_COPY state and
>>>>> +saving its config space is also parallelized (run in a separate thread) in
>>>>> +such migration mode.
>>>>> +
>>>>> +The multifd VFIO device state transfer is controlled by
>>>>> +"x-migration-multifd-transfer" VFIO device property. This property defaults to
>>>>> +AUTO, which means that VFIO device state transfer via multifd channels is
>>>>> +attempted in configurations that otherwise support it.
>>>>> +
>>>
>>> Done - I also moved the parts about x-migration-max-queued-buffers
>>> and x-migration-load-config-after-iter description there since
>>> obviously they wouldn't make sense being left alone in the top section.
>>>
>>>> I was expecting a much more detailed explanation on the design too  :
>>>>
>>>>    * in the cover letter
>>>>    * in the hw/vfio/migration-multifd.c
>>>>    * in some new file under docs/devel/migration/
>>
>> I forgot to add  :
>>
>>        * guide on how to use this new feature from QEMU and libvirt.
>>          something we can refer to for tests. That's a must have.
>>        * usage scenarios
>>          There are some benefits but it is not obvious a user would
>>          like to use multiple VFs in one VM, please explain.
>>          This is a major addition which needs justification anyhow
>>        * pros and cons
>>
>>> I'm not sure what descriptions you exactly want in these places,
>>
>> Looking from the VFIO subsystem, the way this series works is very opaque.
>> There are a couple of a new migration handlers, new threads, new channels,
>> etc. It has been discussed several times with migration folks, please provide
>> a summary for a new reader as ignorant as everyone would be when looking at
>> a new file.
>>
>>
>>> but since
>>> that's just documentation (not code) it could be added after the code freeze...
>>
>> That's the risk of not getting any ! and the initial proposal should be
>> discussed before code freeze.
>>
>> For the general framework, I was expecting an extension of a "multifd"
>> subsection under :
>>
>>     https://qemu.readthedocs.io/en/v9.2.0/devel/migration/features.html
>>
>> but it doesn't exist :/
> 
> Hi, see if this helps. Let me know what can be improved and if something
> needs to be more detailed. Please ignore the formatting, I'll send a
> proper patch after the carnaval.
> 
> @Maciej, it's probably better if you keep your docs separate anyway so
> we don't add another dependency. I can merge them later.

That's a very good idea, thanks for writing this multifd doc Fabiano!

> multifd.rst:
> 
> Multifd
> =======
> 
> Multifd is the name given for the migration capability that enables
> data transfer using multiple threads. Multifd supports all the
> transport types currently in use with migration (inet, unix, vsock,
> fd, file).
(..)

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s)
  2025-03-03  6:42     ` Cédric Le Goater
@ 2025-03-03 22:14       ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:14 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon
  Cc: Alex Williamson, Eric Blake, Markus Armbruster, Peter Xu,
	Fabiano Rosas, Daniel P . Berrangé, Joao Martins, qemu-devel

On 3.03.2025 07:42, Cédric Le Goater wrote:
> On 3/2/25 14:00, Avihai Horon wrote:
>>
>> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Add VFIOStateBuffer(s) types and the associated methods.
>>>
>>> These store received device state buffers and config state waiting to get
>>> loaded into the device.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c | 54 +++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 54 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 0c3185a26242..760b110a39b9 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -29,3 +29,57 @@ typedef struct VFIODeviceStatePacket {
>>>       uint32_t flags;
>>>       uint8_t data[0];
>>>   } QEMU_PACKED VFIODeviceStatePacket;
>>> +
>>> +/* type safety */
>>> +typedef struct VFIOStateBuffers {
>>> +    GArray *array;
>>> +} VFIOStateBuffers;
>>> +
>>> +typedef struct VFIOStateBuffer {
>>> +    bool is_present;
>>> +    char *data;
>>> +    size_t len;
>>> +} VFIOStateBuffer;
>>> +
>>> +static void vfio_state_buffer_clear(gpointer data)
>>> +{
>>> +    VFIOStateBuffer *lb = data;
>>> +
>>> +    if (!lb->is_present) {
>>> +        return;
>>> +    }
>>> +
>>> +    g_clear_pointer(&lb->data, g_free);
>>> +    lb->is_present = false;
>>> +}
>>> +
>>> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
>>> +{
>>> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
>>> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
>>> +}
>>> +
>>> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
>>> +{
>>> +    g_clear_pointer(&bufs->array, g_array_unref);
>>> +}
>>> +
>>> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
>>> +{
>>> +    assert(bufs->array);
>>> +}
>>> +
>>> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
>>> +{
>>> +    return bufs->array->len;
>>> +}
>>> +
>>> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
>>> +{
>>> +    g_array_set_size(bufs->array, size);
>>> +}
>>> +
>>> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>> +{
>>> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>> +}
>>
>> This patch breaks compilation as non of the functions are used, e.g.: error: ‘vfio_state_buffers_init’ defined but not used I can think of three options to solve it: 1. Move these functions to their own file and export them, e.g., hw/vfio/state-buffer.{c,h}. But this seems like an overkill for such a small API. 2. Add __attribute__((unused)) tags and remove them in patch #26 where the functions are actually used. A bit ugly. 
> 
>>
>> 3. Squash this patch into patch #26. I prefer option 3 as this is a small API closely related to patch #26 (and patch #26 will still remain rather small).
> 
> I vote for option 3 too.

Merged this patch into the "received buffers queuing" one (#26) now.

> vfio_state_buffers_init is only called once, it's 2 lines,
> it could be merged in vfio_multifd_new() too.

Most of these helpers are even shorter (1 line), but the whole
point of them is to abstract the GArray rather than open-code
these accesses.

This was discussed two versions ago:
https://lore.kernel.org/qemu-devel/9106d15e-3ff5-4d42-880d-0de70a4caa1c@maciej.szmigiero.name/

> 
> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 11/36] migration/multifd: Device state transfer support - receive side
  2025-03-02 12:42   ` Avihai Horon
@ 2025-03-03 22:14     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:14 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 13:42, Avihai Horon wrote:
> Hi Maciej,
> 
> Sorry for the long delay, I have been busy with other tasks.
> I got some small comments for the series.
> 
> On 19/02/2025 22:33, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add a basic support for receiving device state via multifd channels -
>> channels that are shared with RAM transfers.
>>
>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>> data (existing MultiFDPacket_t) is read.
>>
>> The received device state data is provided to
>> qemu_loadvm_load_state_buffer() function for processing in the
>> device's load_state_buffer handler.
>>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   migration/multifd.c | 99 ++++++++++++++++++++++++++++++++++++++++-----
>>   migration/multifd.h | 26 +++++++++++-
>>   2 files changed, 113 insertions(+), 12 deletions(-)
>>
(..)
>> index f7156f66c0f6..c2ebef2d319e 100644
>> --- a/migration/multifd.h
>> +++ b/migration/multifd.h
>> @@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>   #define MULTIFD_FLAG_UADK (8 << 1)
>>   #define MULTIFD_FLAG_QATZIP (16 << 1)
>>
>> +/*
>> + * If set it means that this packet contains device state
>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>> + */
>> +#define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
>> +
>>   /* This value needs to be a multiple of qemu_target_page_size() */
>>   #define MULTIFD_PACKET_SIZE (512 * 1024)
>>
>> @@ -94,6 +100,16 @@ typedef struct {
>>       uint64_t offset[];
>>   } __attribute__((packed)) MultiFDPacket_t;
>>
>> +typedef struct {
>> +    MultiFDPacketHdr_t hdr;
>> +
>> +    char idstr[256] QEMU_NONSTRING;
>> +    uint32_t instance_id;
>> +
>> +    /* size of the next packet that contains the actual data */
>> +    uint32_t next_packet_size;
>> +} __attribute__((packed)) MultiFDPacketDeviceState_t;
>> +
>>   typedef struct {
>>       /* number of used pages */
>>       uint32_t num;
>> @@ -111,6 +127,13 @@ struct MultiFDRecvData {
>>       off_t file_offset;
>>   };
>>
>> +typedef struct {
>> +    char *idstr;
>> +    uint32_t instance_id;
>> +    char *buf;
>> +    size_t buf_len;
>> +} MultiFDDeviceState_t;
> 
> This is only used in patch #14. Maybe move it there?

Moved it to "send side" patch.

> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 14/36] migration/multifd: Device state transfer support - send side
  2025-03-02 12:46   ` Avihai Horon
@ 2025-03-03 22:15     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:15 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 13:46, Avihai Horon wrote:
> 
> On 19/02/2025 22:33, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> A new function multifd_queue_device_state() is provided for device to queue
>> its state for transmission via a multifd channel.
>>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h         |   4 ++
>>   migration/meson.build            |   1 +
>>   migration/multifd-device-state.c | 115 +++++++++++++++++++++++++++++++
>>   migration/multifd-nocomp.c       |  14 +++-
>>   migration/multifd.c              |  42 +++++++++--
>>   migration/multifd.h              |  27 +++++---
>>   6 files changed, 187 insertions(+), 16 deletions(-)
>>   create mode 100644 migration/multifd-device-state.c
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 4c171f4e897e..bd3b725fa0b7 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -118,4 +118,8 @@ bool migrate_is_uri(const char *uri);
>>   bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
>>                          Error **errp);
>>
>> +/* migration/multifd-device-state.c */
>> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> +                                char *data, size_t len);
>> +
>>   #endif
>> diff --git a/migration/meson.build b/migration/meson.build
>> index d3bfe84d6204..9aa48b290e2a 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -25,6 +25,7 @@ system_ss.add(files(
>>     'migration-hmp-cmds.c',
>>     'migration.c',
>>     'multifd.c',
>> +  'multifd-device-state.c',
>>     'multifd-nocomp.c',
>>     'multifd-zlib.c',
>>     'multifd-zero-page.c',
>> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> new file mode 100644
>> index 000000000000..ab83773e2d62
>> --- /dev/null
>> +++ b/migration/multifd-device-state.c
>> @@ -0,0 +1,115 @@
>> +/*
>> + * Multifd device state migration
>> + *
>> + * Copyright (C) 2024,2025 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/lockable.h"
>> +#include "migration/misc.h"
>> +#include "multifd.h"
>> +
>> +static struct {
>> +    QemuMutex queue_job_mutex;
>> +
>> +    MultiFDSendData *send_data;
>> +} *multifd_send_device_state;
>> +
>> +size_t multifd_device_state_payload_size(void)
>> +{
>> +    return sizeof(MultiFDDeviceState_t);
>> +}
>> +
>> +void multifd_device_state_send_setup(void)
>> +{
>> +    assert(!multifd_send_device_state);
>> +    multifd_send_device_state = g_malloc(sizeof(*multifd_send_device_state));
>> +
>> +    qemu_mutex_init(&multifd_send_device_state->queue_job_mutex);
>> +
>> +    multifd_send_device_state->send_data = multifd_send_data_alloc();
>> +}
>> +
>> +void multifd_device_state_send_cleanup(void)
>> +{
>> +    g_clear_pointer(&multifd_send_device_state->send_data,
>> +                    multifd_send_data_free);
>> +
>> +    qemu_mutex_destroy(&multifd_send_device_state->queue_job_mutex);
>> +
>> +    g_clear_pointer(&multifd_send_device_state, g_free);
>> +}
>> +
>> +void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state)
>> +{
>> +    g_clear_pointer(&device_state->idstr, g_free);
>> +    g_clear_pointer(&device_state->buf, g_free);
>> +}
>> +
>> +static void multifd_device_state_fill_packet(MultiFDSendParams *p)
>> +{
>> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
>> +    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
>> +
>> +    packet->hdr.flags = cpu_to_be32(p->flags);
>> +    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
> 
> (I think we talked about this in v2):
> Looking at idstr creation code, idstr is always NULL terminated. It's also treated everywhere as a NULL terminated string.
> For consistency and to avoid confusion, I'd treat it as a NULL terminated string here too (use strcpy, remove the QEMU_NONSTRING from its definition, etc.).

Changed to NULL-terminated since AFAIK RAM idstr was also changed to such
in the meantime.

> This will also avoid strncpy() unnecessary zeroing of the extra bytes.

Zeroing of remaining space is still necessary since it's a wire packet
data structure that's re-used between packets so it still can contain
remainder of previous longer idstr.
  
> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-03-02 13:12   ` Avihai Horon
@ 2025-03-03 22:15     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:15 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 14:12, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The multifd received data needs to be reassembled since device state
>> packets sent via different multifd channels can arrive out-of-order.
>>
>> Therefore, each VFIO device state packet carries a header indicating its
>> position in the stream.
>> The raw device state data is saved into a VFIOStateBuffer for later
>> in-order loading into the device.
>>
>> The last such VFIO device state packet should have
>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c | 103 ++++++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h |   3 ++
>>   hw/vfio/migration.c         |   1 +
>>   hw/vfio/trace-events        |   1 +
>>   4 files changed, 108 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index c2defc0efef0..5d5ee1393674 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -42,6 +42,11 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>
>>   typedef struct VFIOMultifd {
>> +    VFIOStateBuffers load_bufs;
>> +    QemuCond load_bufs_buffer_ready_cond;
>> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>> +    uint32_t load_buf_idx;
>> +    uint32_t load_buf_idx_last;
>>   } VFIOMultifd;
>>
>>   static void vfio_state_buffer_clear(gpointer data)
>> @@ -87,15 +92,113 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>   }
>>
>> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>> +                                          VFIODeviceStatePacket *packet,
>> +                                          size_t packet_total_size,
>> +                                          Error **errp)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIOStateBuffer *lb;
>> +
>> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
>> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
>> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
>> +    if (lb->is_present) {
>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>> +                   packet->idx);
> 
> Let's add vbasedev->name to the error message so we know which device caused the error.

Done.

>> +        return false;
>> +    }
>> +
>> +    assert(packet->idx >= multifd->load_buf_idx);
>> +
>> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>> +    lb->len = packet_total_size - sizeof(*packet);
>> +    lb->is_present = true;
>> +
>> +    return true;
>> +}
>> +
>> +bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>> +                            Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>> +
>> +    /*
>> +     * Holding BQL here would violate the lock order and can cause
>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>> +     */
>> +    assert(!bql_locked());
> 
> To be clearer, I'd move the assert down to be just above "QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);".

Moved there.
  
>> +
>> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
>> +        error_setg(errp,
>> +                   "got device state packet but not doing multifd transfer");
> 
> Let's add vbasedev->name to the error message so we know which device caused the error.

Done.

>> +        return false;
>> +    }
>> +
>> +    assert(multifd);
>> +
>> +    if (data_size < sizeof(*packet)) {
>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>> +                   data_size, sizeof(*packet));
> 
> Ditto.

Done.

>> +        return false;
>> +    }
>> +
>> +    if (packet->version != VFIO_DEVICE_STATE_PACKET_VER_CURRENT) {
>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>> +                   packet->version);
> 
> Ditto.

Done.

>> +        return false;
>> +    }
>> +
>> +    if (packet->idx == UINT32_MAX) {
>> +        error_setg(errp, "packet has too high idx");
> 
> Ditto.

Done.

>> +        return false;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>> +
>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>> +
>> +    /* config state packet should be the last one in the stream */
>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>> +        multifd->load_buf_idx_last = packet->idx;
>> +    }
>> +
>> +    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {
>> +        return false;
>> +    }
>> +
>> +    qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
>> +
>> +    return true;
>> +}
>> +
>>   VFIOMultifd *vfio_multifd_new(void)
>>   {
>>       VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>>
>> +    vfio_state_buffers_init(&multifd->load_bufs);
>> +
>> +    qemu_mutex_init(&multifd->load_bufs_mutex);
> 
> Nit: move qemu_mutex_init() just above qemu_cond_init()?

It's in a separate "block" because it is common for all 3
conditions in the ultimate form of code (and most of the
later variables too), rather just this single condition.

  > Thanks.

Thanks,
Maciej

>> +
>> +    multifd->load_buf_idx = 0;
>> +    multifd->load_buf_idx_last = UINT32_MAX;
>> +    qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>> +
>>       return multifd;
>>   }
>>




^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-03-02 14:15   ` Avihai Horon
@ 2025-03-03 22:16     ` Maciej S. Szmigiero
  2025-03-04 11:21       ` Avihai Horon
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:16 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 15:15, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Maybe add a sentence talking about the load thread itself first? E.g.:
> 
> Add a thread which loads the VFIO device state buffers that were received and via multifd.
> Each VFIO device that has multifd device state transfer enabled has one such thread, which is created using migration core API qemu_loadvm_start_load_thread().
> 
> Since it's important to finish...

Added such leading text to the commit message for this patch.

>> Since it's important to finish loading device state transferred via the
>> main migration channel (via save_live_iterate SaveVMHandler) before
>> starting loading the data asynchronously transferred via multifd the thread
>> doing the actual loading of the multifd transferred data is only started
>> from switchover_start SaveVMHandler.
>>
>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>
>> This sub-command is only sent after all save_live_iterate data have already
>> been posted so it is safe to commence loading of the multifd-transferred
>> device state upon receiving it - loading of save_live_iterate data happens
>> synchronously in the main migration thread (much like the processing of
>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>> processed all the proceeding data must have already been loaded.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h |   2 +
>>   hw/vfio/migration.c         |  12 ++
>>   hw/vfio/trace-events        |   5 +
>>   4 files changed, 244 insertions(+)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 5d5ee1393674..b3a88c062769 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>
>>   typedef struct VFIOMultifd {
>> +    QemuThread load_bufs_thread;
> 
> This can be dropped.

Yeah - it was a remainder from pre-load-thread days of v2.

Dropped now.

>> +    bool load_bufs_thread_running;
>> +    bool load_bufs_thread_want_exit;
>> +
>>       VFIOStateBuffers load_bufs;
>>       QemuCond load_bufs_buffer_ready_cond;
>> +    QemuCond load_bufs_thread_finished_cond;
>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>       uint32_t load_buf_idx;
>>       uint32_t load_buf_idx_last;
>> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>       return true;
>>   }
>>
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    return -EINVAL;
>> +}
>> +
>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>> +{
>> +    VFIOStateBuffer *lb;
>> +    guint bufs_len;
>> +
>> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
>> +    if (multifd->load_buf_idx >= bufs_len) {
>> +        assert(multifd->load_buf_idx == bufs_len);
>> +        return NULL;
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
>> +                               multifd->load_buf_idx);
>> +    if (!lb->is_present) {
>> +        return NULL;
>> +    }
>> +
>> +    return lb;
>> +}
>> +
>> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>> +                                         VFIOStateBuffer *lb,
>> +                                         Error **errp)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    g_autofree char *buf = NULL;
>> +    char *buf_cur;
>> +    size_t buf_len;
>> +
>> +    if (!lb->len) {
>> +        return true;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>> +                                                   multifd->load_buf_idx);
>> +
>> +    /* lb might become re-allocated when we drop the lock */
>> +    buf = g_steal_pointer(&lb->data);
>> +    buf_cur = buf;
>> +    buf_len = lb->len;
>> +    while (buf_len > 0) {
>> +        ssize_t wr_ret;
>> +        int errno_save;
>> +
>> +        /*
>> +         * Loading data to the device takes a while,
>> +         * drop the lock during this process.
>> +         */
>> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
>> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);
>> +        errno_save = errno;
>> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
>> +
>> +        if (wr_ret < 0) {
>> +            error_setg(errp,
>> +                       "writing state buffer %" PRIu32 " failed: %d",
>> +                       multifd->load_buf_idx, errno_save);
> 
> Let's add vbasedev->name to the error message so we know which device caused the error.

Done.

>> +            return false;
>> +        }
>> +
>> +        assert(wr_ret <= buf_len);
> 
> I think this assert is redundant: we write buf_len bytes and by definition of write() wr_ret will be <= buf_len.

It's for catching when the "definition" for some reason does not match reality
since this would result in a reading well past the buffer.

That's why it's an assert, not an error return.

>> +        buf_len -= wr_ret;
>> +        buf_cur += wr_ret;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>> +                                                 multifd->load_buf_idx);
>> +
>> +    return true;
>> +}
>> +
>> +static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
>> +                                            bool *should_quit)
>> +{
>> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
>> +}
>> +
>> +/*
>> + * This thread is spawned by vfio_multifd_switchover_start() which gets
>> + * called upon encountering the switchover point marker in main migration
>> + * stream.
>> + *
>> + * It exits after either:
>> + * * completing loading the remaining device state and device config, OR:
>> + * * encountering some error while doing the above, OR:
>> + * * being forcefully aborted by the migration core by it setting should_quit
>> + *   or by vfio_load_cleanup_load_bufs_thread() setting
>> + *   multifd->load_bufs_thread_want_exit.
>> + */
>> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    bool ret = true;
>> +    int config_ret;
>> +
>> +    assert(multifd);
>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>> +
>> +    assert(multifd->load_bufs_thread_running);
>> +
>> +    while (true) {
>> +        VFIOStateBuffer *lb;
>> +
>> +        /*
>> +         * Always check cancellation first after the buffer_ready wait below in
>> +         * case that cond was signalled by vfio_load_cleanup_load_bufs_thread().
>> +         */
>> +        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
>> +            error_setg(errp, "operation cancelled");
>> +            ret = false;
>> +            goto ret_signal;
> 
> IIUC, if vfio_load_bufs_thread_want_exit() returns true, it means that some other code part already failed and set migration error, no?
> If so, shouldn't we return true here? After all, vfio_load_bufs_thread didn't really fail, it just got signal to terminate itself.

The thread didn't succeed with loading all the data either, but got cancelled.

It's a similar logic as Glib's GIO returning G_IO_ERROR_CANCELLED if the operation
got cancelled.

In a GTask a pending cancellation will even overwrite any other error or value
that the task tried to return (at least by default).

>> +        }
>> +
>> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
>> +
>> +        lb = vfio_load_state_buffer_get(multifd);
>> +        if (!lb) {
>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>> +                                                        multifd->load_buf_idx);
>> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
>> +                           &multifd->load_bufs_mutex);
>> +            continue;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
>> +            break;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == 0) {
>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>> +        }
>> +
>> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
>> +            ret = false;
>> +            goto ret_signal;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>> +        }
>> +
>> +        multifd->load_buf_idx++;
>> +    }
>> +
>> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
>> +    if (config_ret) {
>> +        error_setg(errp, "load config state failed: %d", config_ret);
> 
> Let's add vbasedev->name to the error message so we know which device caused the error.

This line is not present anymore in the current version of the code,
but applied such change to all error_setg() calls in vfio_load_bufs_thread_load_config()
instead.

>> +        ret = false;
>> +    }
>> +
>> +ret_signal:
>> +    /*
>> +     * Notify possibly waiting vfio_load_cleanup_load_bufs_thread() that
>> +     * this thread is exiting.
>> +     */
>> +    multifd->load_bufs_thread_running = false;
>> +    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
>> +
>> +    return ret;
>> +}
>> +
>>   VFIOMultifd *vfio_multifd_new(void)
>>   {
>>       VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>> @@ -191,11 +365,42 @@ VFIOMultifd *vfio_multifd_new(void)
>>       multifd->load_buf_idx_last = UINT32_MAX;
>>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>>
>> +    multifd->load_bufs_thread_running = false;
>> +    multifd->load_bufs_thread_want_exit = false;
>> +    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
>> +
>>       return multifd;
>>   }
>>
>> +/*
>> + * Terminates vfio_load_bufs_thread by setting
>> + * multifd->load_bufs_thread_want_exit and signalling all the conditions
>> + * the thread could be blocked on.
>> + *
>> + * Waits for the thread to signal that it had finished.
>> + */
>> +static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
>> +{
>> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>> +    bql_unlock();
>> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
>> +        while (multifd->load_bufs_thread_running) {
>> +            multifd->load_bufs_thread_want_exit = true;
>> +
>> +            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
>> +            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
>> +                           &multifd->load_bufs_mutex);
>> +        }
>> +    }
>> +    bql_lock();
>> +}
>> +
>>   void vfio_multifd_free(VFIOMultifd *multifd)
>>   {
>> +    vfio_load_cleanup_load_bufs_thread(multifd);
>> +
>> +    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
>> +    vfio_state_buffers_destroy(&multifd->load_bufs);
> 
> vfio_state_buffers_destroy(&multifd->load_bufs); belongs to patch #26, no?

Yeah - moved it there.
  
> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-03-02 14:19     ` Avihai Horon
@ 2025-03-03 22:16       ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:16 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Eric Blake, Cédric Le Goater, Peter Xu,
	Fabiano Rosas, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 15:19, Avihai Horon wrote:
> 
> On 26/02/2025 15:49, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2/19/25 21:34, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Since it's important to finish loading device state transferred via the
>>> main migration channel (via save_live_iterate SaveVMHandler) before
>>> starting loading the data asynchronously transferred via multifd the thread
>>> doing the actual loading of the multifd transferred data is only started
>>> from switchover_start SaveVMHandler.
>>>
>>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>>
>>> This sub-command is only sent after all save_live_iterate data have already
>>> been posted so it is safe to commence loading of the multifd-transferred
>>> device state upon receiving it - loading of save_live_iterate data happens
>>> synchronously in the main migration thread (much like the processing of
>>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>>> processed all the proceeding data must have already been loaded.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/migration-multifd.h |   2 +
>>>   hw/vfio/migration.c         |  12 ++
>>>   hw/vfio/trace-events        |   5 +
>>>   4 files changed, 244 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 5d5ee1393674..b3a88c062769 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>>>   } VFIOStateBuffer;
>>>
>>>   typedef struct VFIOMultifd {
>>> +    QemuThread load_bufs_thread;
>>> +    bool load_bufs_thread_running;
>>> +    bool load_bufs_thread_want_exit;
>>> +
>>>       VFIOStateBuffers load_bufs;
>>>       QemuCond load_bufs_buffer_ready_cond;
>>> +    QemuCond load_bufs_thread_finished_cond;
>>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>       uint32_t load_buf_idx;
>>>       uint32_t load_buf_idx_last;
>>> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>>       return true;
>>>   }
>>>
>>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>>> +{
>>> +    return -EINVAL;
>>> +}
>>
>>
>> please move to next patch.
>>
>>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>>> +{
>>> +    VFIOStateBuffer *lb;
>>> +    guint bufs_len;
>>
>> guint:  I guess it's ok to use here. It is not common practice in VFIO.
> 
> Glib documentation says that in new code unsigned int is preferred over guint [1].

I turned guints into unsigned ints where I spotted them in this patch set.
  
> Thanks.

Thanks,
Maciej

> [1] https://docs.gtk.org/glib/types.html#guint
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support
  2025-03-02 14:25   ` Avihai Horon
@ 2025-03-03 22:17     ` Maciej S. Szmigiero
  2025-03-04  7:41       ` Cédric Le Goater
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:17 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 15:25, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Load device config received via multifd using the existing machinery
>> behind vfio_load_device_config_state().
>>
>> Also, make sure to process the relevant main migration channel flags.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 47 ++++++++++++++++++++++++++++++++++-
>>   hw/vfio/migration.c           |  8 +++++-
>>   include/hw/vfio/vfio-common.h |  2 ++
>>   3 files changed, 55 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index b3a88c062769..7200f6f1c2a2 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -15,6 +15,7 @@
>>   #include "qemu/lockable.h"
>>   #include "qemu/main-loop.h"
>>   #include "qemu/thread.h"
>> +#include "io/channel-buffer.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration-multifd.h"
>>   #include "trace.h"
>> @@ -186,7 +187,51 @@ bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>
>>   static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>>   {
>> -    return -EINVAL;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIOStateBuffer *lb;
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    QEMUFile *f_out = NULL, *f_in = NULL;
> 
> Can we move patch #29 before this one and use g_autoptr() for f_out an f_in?

Sure, that's a good idea - done now.

>> +    uint64_t mig_header;
>> +    int ret;
>> +
>> +    assert(multifd->load_buf_idx == multifd->load_buf_idx_last);
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, multifd->load_buf_idx);
>> +    assert(lb->is_present);
>> +
>> +    bioc = qio_channel_buffer_new(lb->len);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
>> +
>> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
>> +
>> +    ret = qemu_fflush(f_out);
>> +    if (ret) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        return ret;
>> +    }
>> +
>> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
>> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
>> +
>> +    mig_header = qemu_get_be64(f_in);
>> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        g_clear_pointer(&f_in, qemu_fclose);
>> +        return -EINVAL;
>> +    }
>> +
>> +    bql_lock();
>> +    ret = vfio_load_device_config_state(f_in, vbasedev);
>> +    bql_unlock();
>> +
>> +    g_clear_pointer(&f_out, qemu_fclose);
>> +    g_clear_pointer(&f_in, qemu_fclose);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>>   }
>>
>>   static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 85f54cb22df2..b962309f7c27 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -264,7 +264,7 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>       return ret;
>>   }
>>
>> -static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>> +int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>>       uint64_t data;
>> @@ -728,6 +728,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>           switch (data) {
>>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>           {
>> +            if (vfio_multifd_transfer_enabled(vbasedev)) {
>> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
>> +                             vbasedev->name);
> 
> To make clearer, maybe change to:
> "%s: got DEV_CONFIG_STATE in main migration channel but doing multifd transfer"

That normally would be good idea, however we are already at 83 characters in this
line here and will not fit that many more words to this string.
  
> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side
  2025-03-02 14:41   ` Avihai Horon
@ 2025-03-03 22:17     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:17 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 15:41, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Implement the multifd device state transfer via additional per-device
>> thread inside save_live_complete_precopy_thread handler.
>>
>> Switch between doing the data transfer in the new handler and doing it
>> in the old save_state handler depending on the
>> x-migration-multifd-transfer device property value.
> 
> x-migration-multifd-transfer is not yet introduced. Maybe rephrase to:
> 
> ... depending if VFIO multifd transfer is enabled or not.

Changed accordingly.

>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 139 ++++++++++++++++++++++++++++++++++
>>   hw/vfio/migration-multifd.h   |   5 ++
>>   hw/vfio/migration.c           |  26 +++++--
>>   hw/vfio/trace-events          |   2 +
>>   include/hw/vfio/vfio-common.h |   8 ++
>>   5 files changed, 174 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 7200f6f1c2a2..0cfa9d31732a 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -476,6 +476,145 @@ bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>>       return true;
>>   }
>>
>> +void vfio_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    assert(vfio_multifd_transfer_enabled(vbasedev));
>> +
>> +    /*
>> +     * Emit dummy NOP data on the main migration channel since the actual
>> +     * device state transfer is done via multifd channels.
>> +     */
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +}
>> +
>> +static bool
>> +vfio_save_complete_precopy_thread_config_state(VFIODevice *vbasedev,
>> +                                               char *idstr,
>> +                                               uint32_t instance_id,
>> +                                               uint32_t idx,
>> +                                               Error **errp)
>> +{
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    g_autoptr(QEMUFile) f = NULL;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    size_t packet_len;
>> +
>> +    bioc = qio_channel_buffer_new(0);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>> +
>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +
>> +    if (vfio_save_device_config_state(f, vbasedev, errp)) {
>> +        return false;
>> +    }
>> +
>> +    ret = qemu_fflush(f);
>> +    if (ret) {
>> +        error_setg(errp, "save config state flush failed: %d", ret);
> 
> Let's add vbasedev->name to the error message so we know which device caused the error.

Done.

>> +        return false;
>> +    }
>> +
>> +    packet_len = sizeof(*packet) + bioc->usage;
>> +    packet = g_malloc0(packet_len);
>> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
>> +    packet->idx = idx;
>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>> +
>> +    if (!multifd_queue_device_state(idstr, instance_id,
>> +                                    (char *)packet, packet_len)) {
>> +        error_setg(errp, "multifd config data queuing failed");
> 
> Ditto.

Done.

>> +        return false;
>> +    }
>> +
>> +    vfio_add_bytes_transferred(packet_len);
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * This thread is spawned by the migration core directly via
>> + * .save_live_complete_precopy_thread SaveVMHandler.
>> + *
>> + * It exits after either:
>> + * * completing saving the remaining device state and device config, OR:
>> + * * encountering some error while doing the above, OR:
>> + * * being forcefully aborted by the migration core by
>> + *   multifd_device_state_save_thread_should_exit() returning true.
>> + */
>> +bool vfio_save_complete_precopy_thread(SaveLiveCompletePrecopyThreadData *d,
>> +                                       Error **errp)
>> +{
>> +    VFIODevice *vbasedev = d->handler_opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    bool ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    uint32_t idx;
>> +
>> +    if (!vfio_multifd_transfer_enabled(vbasedev)) {
>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>> +        return true;
>> +    }
>> +
>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>> +                                                  d->idstr, d->instance_id);
>> +
>> +    /* We reach here with device state STOP or STOP_COPY only */
>> +    if (vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>> +                                 VFIO_DEVICE_STATE_STOP, errp)) {
>> +        ret = false;
>> +        goto ret_finish;
>> +    }
>> +
>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>> +    packet->version = VFIO_DEVICE_STATE_PACKET_VER_CURRENT;
>> +
>> +    for (idx = 0; ; idx++) {
>> +        ssize_t data_size;
>> +        size_t packet_size;
>> +
>> +        if (multifd_device_state_save_thread_should_exit()) {
>> +            error_setg(errp, "operation cancelled");
> 
> Same comment as in patch #27:
> 
> IIUC, if multifd_device_state_save_thread_should_exit() returns true, it means that some other code part already failed and set migration error, no?
> If so, shouldn't we return true here? After all, vfio_save_complete_precopy_thread didn't really fail, it just got signal to terminate itself

Same as in the "load thread" case - the thread didn't succeed with saving all the data either,
but got cancelled.

> 
>> +            ret = false;
>> +            goto ret_finish;
>> +        }
>> +
>> +        data_size = read(migration->data_fd, &packet->data,
>> +                         migration->data_buffer_size);
>> +        if (data_size < 0) {
>> +            error_setg(errp, "reading state buffer %" PRIu32 " failed: %d",
>> +                       idx, errno);
> 
> Let's add vbasedev->name to the error message so we know which device caused the error.

Done.

>> +            ret = false;
>> +            goto ret_finish;
>> +        } else if (data_size == 0) {
>> +            break;
>> +        }
>> +
>> +        packet->idx = idx;
>> +        packet_size = sizeof(*packet) + data_size;
>> +
>> +        if (!multifd_queue_device_state(d->idstr, d->instance_id,
>> +                                        (char *)packet, packet_size)) {
>> +            error_setg(errp, "multifd data queuing failed");
> 
> Ditto.

Done.

> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-03-02 14:48   ` Avihai Horon
@ 2025-03-03 22:17     ` Maciej S. Szmigiero
  2025-03-04 11:29       ` Avihai Horon
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-03 22:17 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 2.03.2025 15:48, Avihai Horon wrote:
> 
> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This property allows configuring at runtime whether to transfer the
> 
> IIUC, in this patch it's not configurable at runtime, so let's drop "at runtime".

Dropped this expression from this patch description.

>> particular device state via multifd channels when live migrating that
>> device.
>>
>> It defaults to AUTO, which means that VFIO device state transfer via
>> multifd channels is attempted in configurations that otherwise support it.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration-multifd.c   | 17 ++++++++++++++++-
>>   hw/vfio/pci.c                 |  3 +++
>>   include/hw/vfio/vfio-common.h |  2 ++
>>   3 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index 0cfa9d31732a..18a5ff964a37 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -460,11 +460,26 @@ bool vfio_multifd_transfer_supported(void)
>>
>>   bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>>   {
>> -    return false;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->multifd_transfer;
>>   }
>>
>>   bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>>   {
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    /*
>> +     * Make a copy of this setting at the start in case it is changed
>> +     * mid-migration.
>> +     */
>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>> +    } else {
>> +        migration->multifd_transfer =
>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>> +    }
> 
> Making a copy of this value is only relevant for the next patch where it's turned mutable, so let's move this code to patch #32.

But we still need to handle the "AUTO" condition so it would need
very similar code just to get reworked into the above in the next
patch.
I think that's just not worth code churn between patches.

> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support
  2025-03-03 22:17     ` Maciej S. Szmigiero
@ 2025-03-04  7:41       ` Cédric Le Goater
  2025-03-04 21:50         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Cédric Le Goater @ 2025-03-04  7:41 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

>>> @@ -728,6 +728,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>           switch (data) {
>>>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>>           {
>>> +            if (vfio_multifd_transfer_enabled(vbasedev)) {
>>> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
>>> +                             vbasedev->name);
>>
>> To make clearer, maybe change to:
>> "%s: got DEV_CONFIG_STATE in main migration channel but doing multifd transfer"
> 
> That normally would be good idea, however we are already at 83 characters in this
> line here and will not fit that many more words to this string.

The 80 characters "rule" is not strict. A clear error report is
always good to have !

Thanks,

C.




^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread
  2025-03-03 22:16     ` Maciej S. Szmigiero
@ 2025-03-04 11:21       ` Avihai Horon
  0 siblings, 0 replies; 120+ messages in thread
From: Avihai Horon @ 2025-03-04 11:21 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 04/03/2025 0:16, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 2.03.2025 15:15, Avihai Horon wrote:
>>
>> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Maybe add a sentence talking about the load thread itself first? E.g.:
>>
>> Add a thread which loads the VFIO device state buffers that were 
>> received and via multifd.
>> Each VFIO device that has multifd device state transfer enabled has 
>> one such thread, which is created using migration core API 
>> qemu_loadvm_start_load_thread().
>>
>> Since it's important to finish...
>
> Added such leading text to the commit message for this patch.
>
>>> Since it's important to finish loading device state transferred via the
>>> main migration channel (via save_live_iterate SaveVMHandler) before
>>> starting loading the data asynchronously transferred via multifd the 
>>> thread
>>> doing the actual loading of the multifd transferred data is only 
>>> started
>>> from switchover_start SaveVMHandler.
>>>
>>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>>> sub-command of QEMU_VM_COMMAND is received via the main migration 
>>> channel.
>>>
>>> This sub-command is only sent after all save_live_iterate data have 
>>> already
>>> been posted so it is safe to commence loading of the 
>>> multifd-transferred
>>> device state upon receiving it - loading of save_live_iterate data 
>>> happens
>>> synchronously in the main migration thread (much like the processing of
>>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>>> processed all the proceeding data must have already been loaded.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c | 225 
>>> ++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/migration-multifd.h |   2 +
>>>   hw/vfio/migration.c         |  12 ++
>>>   hw/vfio/trace-events        |   5 +
>>>   4 files changed, 244 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 5d5ee1393674..b3a88c062769 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
>>>   } VFIOStateBuffer;
>>>
>>>   typedef struct VFIOMultifd {
>>> +    QemuThread load_bufs_thread;
>>
>> This can be dropped.
>
> Yeah - it was a remainder from pre-load-thread days of v2.
>
> Dropped now.
>
>>> +    bool load_bufs_thread_running;
>>> +    bool load_bufs_thread_want_exit;
>>> +
>>>       VFIOStateBuffers load_bufs;
>>>       QemuCond load_bufs_buffer_ready_cond;
>>> +    QemuCond load_bufs_thread_finished_cond;
>>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>       uint32_t load_buf_idx;
>>>       uint32_t load_buf_idx_last;
>>> @@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char 
>>> *data, size_t data_size,
>>>       return true;
>>>   }
>>>
>>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>>> +{
>>> +    return -EINVAL;
>>> +}
>>> +
>>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd 
>>> *multifd)
>>> +{
>>> +    VFIOStateBuffer *lb;
>>> +    guint bufs_len;
>>> +
>>> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
>>> +    if (multifd->load_buf_idx >= bufs_len) {
>>> +        assert(multifd->load_buf_idx == bufs_len);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
>>> +                               multifd->load_buf_idx);
>>> +    if (!lb->is_present) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    return lb;
>>> +}
>>> +
>>> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>>> +                                         VFIOStateBuffer *lb,
>>> +                                         Error **errp)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    g_autofree char *buf = NULL;
>>> +    char *buf_cur;
>>> +    size_t buf_len;
>>> +
>>> +    if (!lb->len) {
>>> +        return true;
>>> +    }
>>> +
>>> + trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>>> + multifd->load_buf_idx);
>>> +
>>> +    /* lb might become re-allocated when we drop the lock */
>>> +    buf = g_steal_pointer(&lb->data);
>>> +    buf_cur = buf;
>>> +    buf_len = lb->len;
>>> +    while (buf_len > 0) {
>>> +        ssize_t wr_ret;
>>> +        int errno_save;
>>> +
>>> +        /*
>>> +         * Loading data to the device takes a while,
>>> +         * drop the lock during this process.
>>> +         */
>>> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
>>> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);
>>> +        errno_save = errno;
>>> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
>>> +
>>> +        if (wr_ret < 0) {
>>> +            error_setg(errp,
>>> +                       "writing state buffer %" PRIu32 " failed: %d",
>>> +                       multifd->load_buf_idx, errno_save);
>>
>> Let's add vbasedev->name to the error message so we know which device 
>> caused the error.
>
> Done.
>
>>> +            return false;
>>> +        }
>>> +
>>> +        assert(wr_ret <= buf_len);
>>
>> I think this assert is redundant: we write buf_len bytes and by 
>> definition of write() wr_ret will be <= buf_len.
>
> It's for catching when the "definition" for some reason does not match 
> reality
> since this would result in a reading well past the buffer.
>
> That's why it's an assert, not an error return.

Yes, but it's highly unlikely that write() will not match reality.
But that's a minor, so whatever you prefer.

>
>>> +        buf_len -= wr_ret;
>>> +        buf_cur += wr_ret;
>>> +    }
>>> +
>>> + trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>>> + multifd->load_buf_idx);
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
>>> +                                            bool *should_quit)
>>> +{
>>> +    return multifd->load_bufs_thread_want_exit || 
>>> qatomic_read(should_quit);
>>> +}
>>> +
>>> +/*
>>> + * This thread is spawned by vfio_multifd_switchover_start() which 
>>> gets
>>> + * called upon encountering the switchover point marker in main 
>>> migration
>>> + * stream.
>>> + *
>>> + * It exits after either:
>>> + * * completing loading the remaining device state and device 
>>> config, OR:
>>> + * * encountering some error while doing the above, OR:
>>> + * * being forcefully aborted by the migration core by it setting 
>>> should_quit
>>> + *   or by vfio_load_cleanup_load_bufs_thread() setting
>>> + *   multifd->load_bufs_thread_want_exit.
>>> + */
>>> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, 
>>> Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    bool ret = true;
>>> +    int config_ret;
>>> +
>>> +    assert(multifd);
>>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>>> +
>>> +    assert(multifd->load_bufs_thread_running);
>>> +
>>> +    while (true) {
>>> +        VFIOStateBuffer *lb;
>>> +
>>> +        /*
>>> +         * Always check cancellation first after the buffer_ready 
>>> wait below in
>>> +         * case that cond was signalled by 
>>> vfio_load_cleanup_load_bufs_thread().
>>> +         */
>>> +        if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
>>> +            error_setg(errp, "operation cancelled");
>>> +            ret = false;
>>> +            goto ret_signal;
>>
>> IIUC, if vfio_load_bufs_thread_want_exit() returns true, it means 
>> that some other code part already failed and set migration error, no?
>> If so, shouldn't we return true here? After all, 
>> vfio_load_bufs_thread didn't really fail, it just got signal to 
>> terminate itself.
>
> The thread didn't succeed with loading all the data either, but got 
> cancelled.
>
> It's a similar logic as Glib's GIO returning G_IO_ERROR_CANCELLED if 
> the operation
> got cancelled.
>
> In a GTask a pending cancellation will even overwrite any other error 
> or value
> that the task tried to return (at least by default).

Ah I see.
I was looking on multifd_{send,recv}_thread and there they don't set an 
error if cancelled.

Anyway, what confused me is that we set an error here only so 
qemu_loadvm_load_thread() will try to migrate_set_error(), but that 
won't work because a migration error is already expected to be present.
If that's indeed so, then to me it looks a bit redundant to set this 
error here.

Thanks.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-03-03 22:17     ` Maciej S. Szmigiero
@ 2025-03-04 11:29       ` Avihai Horon
  2025-03-04 21:50         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 120+ messages in thread
From: Avihai Horon @ 2025-03-04 11:29 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 04/03/2025 0:17, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 2.03.2025 15:48, Avihai Horon wrote:
>>
>> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This property allows configuring at runtime whether to transfer the
>>
>> IIUC, in this patch it's not configurable at runtime, so let's drop 
>> "at runtime".
>
> Dropped this expression from this patch description.
>
>>> particular device state via multifd channels when live migrating that
>>> device.
>>>
>>> It defaults to AUTO, which means that VFIO device state transfer via
>>> multifd channels is attempted in configurations that otherwise 
>>> support it.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration-multifd.c   | 17 ++++++++++++++++-
>>>   hw/vfio/pci.c                 |  3 +++
>>>   include/hw/vfio/vfio-common.h |  2 ++
>>>   3 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index 0cfa9d31732a..18a5ff964a37 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -460,11 +460,26 @@ bool vfio_multifd_transfer_supported(void)
>>>
>>>   bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>>>   {
>>> -    return false;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    return migration->multifd_transfer;
>>>   }
>>>
>>>   bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>>>   {
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    /*
>>> +     * Make a copy of this setting at the start in case it is changed
>>> +     * mid-migration.
>>> +     */
>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>> +        migration->multifd_transfer = 
>>> vfio_multifd_transfer_supported();
>>> +    } else {
>>> +        migration->multifd_transfer =
>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>> +    }
>>
>> Making a copy of this value is only relevant for the next patch where 
>> it's turned mutable, so let's move this code to patch #32.
>
> But we still need to handle the "AUTO" condition so it would need
> very similar code just to get reworked into the above in the next
> patch.
> I think that's just not worth code churn between patches.

Ah, I understand.
In that case, we can move only the comment "Make a copy of this setting 
..." to patch #32.

Thanks.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler
  2025-02-26 16:43   ` Peter Xu
@ 2025-03-04 21:50     ` Maciej S. Szmigiero
  2025-03-04 22:03       ` Peter Xu
  0 siblings, 1 reply; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-04 21:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.02.2025 17:43, Peter Xu wrote:
> On Wed, Feb 19, 2025 at 09:33:59PM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This SaveVMHandler helps device provide its own asynchronous transmission
>> of the remaining data at the end of a precopy phase via multifd channels,
>> in parallel with the transfer done by save_live_complete_precopy handlers.
>>
>> These threads are launched only when multifd device state transfer is
>> supported.
>>
>> Management of these threads in done in the multifd migration code,
>> wrapping them in the generic thread pool.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h         | 17 +++++++
>>   include/migration/register.h     | 19 +++++++
>>   include/qemu/typedefs.h          |  3 ++
>>   migration/multifd-device-state.c | 85 ++++++++++++++++++++++++++++++++
>>   migration/savevm.c               | 35 ++++++++++++-
>>   5 files changed, 158 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 273ebfca6256..8fd36eba1da7 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -119,8 +119,25 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
>>                          Error **errp);
>>   
>>   /* migration/multifd-device-state.c */
>> +typedef struct SaveLiveCompletePrecopyThreadData {
>> +    SaveLiveCompletePrecopyThreadHandler hdlr;
>> +    char *idstr;
>> +    uint32_t instance_id;
>> +    void *handler_opaque;
>> +} SaveLiveCompletePrecopyThreadData;
>> +
>>   bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>>                                   char *data, size_t len);
>>   bool multifd_device_state_supported(void);
>>   
>> +void
>> +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
>> +                                       char *idstr, uint32_t instance_id,
>> +                                       void *opaque);
>> +
>> +bool multifd_device_state_save_thread_should_exit(void);
>> +
>> +void multifd_abort_device_state_save_threads(void);
>> +bool multifd_join_device_state_save_threads(void);
>> +
>>   #endif
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 58891aa54b76..c041ce32f2fc 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -105,6 +105,25 @@ typedef struct SaveVMHandlers {
>>        */
>>       int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>>   
>> +    /**
>> +     * @save_live_complete_precopy_thread (invoked in a separate thread)
>> +     *
>> +     * Called at the end of a precopy phase from a separate worker thread
>> +     * in configurations where multifd device state transfer is supported
>> +     * in order to perform asynchronous transmission of the remaining data in
>> +     * parallel with @save_live_complete_precopy handlers.
>> +     * When postcopy is enabled, devices that support postcopy will skip this
>> +     * step.
>> +     *
>> +     * @d: a #SaveLiveCompletePrecopyThreadData containing parameters that the
>> +     * handler may need, including this device section idstr and instance_id,
>> +     * and opaque data pointer passed to register_savevm_live().
>> +     * @errp: pointer to Error*, to store an error if it happens.
>> +     *
>> +     * Returns true to indicate success and false for errors.
>> +     */
>> +    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
>> +
>>       /* This runs both outside and inside the BQL.  */
>>   
>>       /**
>> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
>> index fd23ff7771b1..42ed4e6be150 100644
>> --- a/include/qemu/typedefs.h
>> +++ b/include/qemu/typedefs.h
>> @@ -108,6 +108,7 @@ typedef struct QString QString;
>>   typedef struct RAMBlock RAMBlock;
>>   typedef struct Range Range;
>>   typedef struct ReservedRegion ReservedRegion;
>> +typedef struct SaveLiveCompletePrecopyThreadData SaveLiveCompletePrecopyThreadData;
>>   typedef struct SHPCDevice SHPCDevice;
>>   typedef struct SSIBus SSIBus;
>>   typedef struct TCGCPUOps TCGCPUOps;
>> @@ -133,5 +134,7 @@ typedef struct IRQState *qemu_irq;
>>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>>   typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
>>                                       Error **errp);
>> +typedef bool (*SaveLiveCompletePrecopyThreadHandler)(SaveLiveCompletePrecopyThreadData *d,
>> +                                                     Error **errp);
>>   
>>   #endif /* QEMU_TYPEDEFS_H */
>> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> index 5de3cf27d6e8..63f021fb8dad 100644
>> --- a/migration/multifd-device-state.c
>> +++ b/migration/multifd-device-state.c
>> @@ -8,7 +8,10 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qapi/error.h"
>>   #include "qemu/lockable.h"
>> +#include "block/thread-pool.h"
>> +#include "migration.h"
>>   #include "migration/misc.h"
>>   #include "multifd.h"
>>   #include "options.h"
>> @@ -17,6 +20,9 @@ static struct {
>>       QemuMutex queue_job_mutex;
>>   
>>       MultiFDSendData *send_data;
>> +
>> +    ThreadPool *threads;
>> +    bool threads_abort;
>>   } *multifd_send_device_state;
>>   
>>   void multifd_device_state_send_setup(void)
>> @@ -27,10 +33,14 @@ void multifd_device_state_send_setup(void)
>>       qemu_mutex_init(&multifd_send_device_state->queue_job_mutex);
>>   
>>       multifd_send_device_state->send_data = multifd_send_data_alloc();
>> +
>> +    multifd_send_device_state->threads = thread_pool_new();
>> +    multifd_send_device_state->threads_abort = false;
>>   }
>>   
>>   void multifd_device_state_send_cleanup(void)
>>   {
>> +    g_clear_pointer(&multifd_send_device_state->threads, thread_pool_free);
>>       g_clear_pointer(&multifd_send_device_state->send_data,
>>                       multifd_send_data_free);
>>   
>> @@ -115,3 +125,78 @@ bool multifd_device_state_supported(void)
>>       return migrate_multifd() && !migrate_mapped_ram() &&
>>           migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
>>   }
>> +
>> +static void multifd_device_state_save_thread_data_free(void *opaque)
>> +{
>> +    SaveLiveCompletePrecopyThreadData *data = opaque;
>> +
>> +    g_clear_pointer(&data->idstr, g_free);
>> +    g_free(data);
>> +}
>> +
>> +static int multifd_device_state_save_thread(void *opaque)
>> +{
>> +    SaveLiveCompletePrecopyThreadData *data = opaque;
>> +    g_autoptr(Error) local_err = NULL;
>> +
>> +    if (!data->hdlr(data, &local_err)) {
>> +        MigrationState *s = migrate_get_current();
>> +
>> +        assert(local_err);
>> +
>> +        /*
>> +         * In case of multiple save threads failing which thread error
>> +         * return we end setting is purely arbitrary.
>> +         */
>> +        migrate_set_error(s, local_err);
> 
> Where did you kick off all the threads when one hit error?  I wonder if
> migrate_set_error() should just set quit flag for everything, but for this
> series it might be easier to use multifd_abort_device_state_save_threads().

I've now added call to multifd_abort_device_state_save_threads() if a migration
error is already set to avoid needlessly waiting for the remaining threads to
do all of their work.

> Other than that, looks good to me, thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support
  2025-03-04  7:41       ` Cédric Le Goater
@ 2025-03-04 21:50         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-04 21:50 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

On 4.03.2025 08:41, Cédric Le Goater wrote:
>>>> @@ -728,6 +728,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>>           switch (data) {
>>>>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>>>           {
>>>> +            if (vfio_multifd_transfer_enabled(vbasedev)) {
>>>> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
>>>> +                             vbasedev->name);
>>>
>>> To make clearer, maybe change to:
>>> "%s: got DEV_CONFIG_STATE in main migration channel but doing multifd transfer"
>>
>> That normally would be good idea, however we are already at 83 characters in this
>> line here and will not fit that many more words to this string.
> 
> The 80 characters "rule" is not strict. A clear error report is
> always good to have !

Changed the message to the suggested by splitting that string into two to avoid going
above 100 characters per line.

> Thanks,
> 
> C.
> 
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-03-04 11:29       ` Avihai Horon
@ 2025-03-04 21:50         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 120+ messages in thread
From: Maciej S. Szmigiero @ 2025-03-04 21:50 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 4.03.2025 12:29, Avihai Horon wrote:
> 
> On 04/03/2025 0:17, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2.03.2025 15:48, Avihai Horon wrote:
>>>
>>> On 19/02/2025 22:34, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> This property allows configuring at runtime whether to transfer the
>>>
>>> IIUC, in this patch it's not configurable at runtime, so let's drop "at runtime".
>>
>> Dropped this expression from this patch description.
>>
>>>> particular device state via multifd channels when live migrating that
>>>> device.
>>>>
>>>> It defaults to AUTO, which means that VFIO device state transfer via
>>>> multifd channels is attempted in configurations that otherwise support it.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration-multifd.c   | 17 ++++++++++++++++-
>>>>   hw/vfio/pci.c                 |  3 +++
>>>>   include/hw/vfio/vfio-common.h |  2 ++
>>>>   3 files changed, 21 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>>> index 0cfa9d31732a..18a5ff964a37 100644
>>>> --- a/hw/vfio/migration-multifd.c
>>>> +++ b/hw/vfio/migration-multifd.c
>>>> @@ -460,11 +460,26 @@ bool vfio_multifd_transfer_supported(void)
>>>>
>>>>   bool vfio_multifd_transfer_enabled(VFIODevice *vbasedev)
>>>>   {
>>>> -    return false;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    return migration->multifd_transfer;
>>>>   }
>>>>
>>>>   bool vfio_multifd_transfer_setup(VFIODevice *vbasedev, Error **errp)
>>>>   {
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    /*
>>>> +     * Make a copy of this setting at the start in case it is changed
>>>> +     * mid-migration.
>>>> +     */
>>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>>> +    } else {
>>>> +        migration->multifd_transfer =
>>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>>> +    }
>>>
>>> Making a copy of this value is only relevant for the next patch where it's turned mutable, so let's move this code to patch #32.
>>
>> But we still need to handle the "AUTO" condition so it would need
>> very similar code just to get reworked into the above in the next
>> patch.
>> I think that's just not worth code churn between patches.
> 
> Ah, I understand.
> In that case, we can move only the comment "Make a copy of this setting ..." to patch #32.

All right, comment moved.

> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler
  2025-03-04 21:50     ` Maciej S. Szmigiero
@ 2025-03-04 22:03       ` Peter Xu
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Xu @ 2025-03-04 22:03 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Mar 04, 2025 at 10:50:29PM +0100, Maciej S. Szmigiero wrote:
> On 26.02.2025 17:43, Peter Xu wrote:
> > On Wed, Feb 19, 2025 at 09:33:59PM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > This SaveVMHandler helps device provide its own asynchronous transmission
> > > of the remaining data at the end of a precopy phase via multifd channels,
> > > in parallel with the transfer done by save_live_complete_precopy handlers.
> > > 
> > > These threads are launched only when multifd device state transfer is
> > > supported.
> > > 
> > > Management of these threads in done in the multifd migration code,
> > > wrapping them in the generic thread pool.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > ---
> > >   include/migration/misc.h         | 17 +++++++
> > >   include/migration/register.h     | 19 +++++++
> > >   include/qemu/typedefs.h          |  3 ++
> > >   migration/multifd-device-state.c | 85 ++++++++++++++++++++++++++++++++
> > >   migration/savevm.c               | 35 ++++++++++++-
> > >   5 files changed, 158 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > index 273ebfca6256..8fd36eba1da7 100644
> > > --- a/include/migration/misc.h
> > > +++ b/include/migration/misc.h
> > > @@ -119,8 +119,25 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
> > >                          Error **errp);
> > >   /* migration/multifd-device-state.c */
> > > +typedef struct SaveLiveCompletePrecopyThreadData {
> > > +    SaveLiveCompletePrecopyThreadHandler hdlr;
> > > +    char *idstr;
> > > +    uint32_t instance_id;
> > > +    void *handler_opaque;
> > > +} SaveLiveCompletePrecopyThreadData;
> > > +
> > >   bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> > >                                   char *data, size_t len);
> > >   bool multifd_device_state_supported(void);
> > > +void
> > > +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
> > > +                                       char *idstr, uint32_t instance_id,
> > > +                                       void *opaque);
> > > +
> > > +bool multifd_device_state_save_thread_should_exit(void);
> > > +
> > > +void multifd_abort_device_state_save_threads(void);
> > > +bool multifd_join_device_state_save_threads(void);
> > > +
> > >   #endif
> > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > index 58891aa54b76..c041ce32f2fc 100644
> > > --- a/include/migration/register.h
> > > +++ b/include/migration/register.h
> > > @@ -105,6 +105,25 @@ typedef struct SaveVMHandlers {
> > >        */
> > >       int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
> > > +    /**
> > > +     * @save_live_complete_precopy_thread (invoked in a separate thread)
> > > +     *
> > > +     * Called at the end of a precopy phase from a separate worker thread
> > > +     * in configurations where multifd device state transfer is supported
> > > +     * in order to perform asynchronous transmission of the remaining data in
> > > +     * parallel with @save_live_complete_precopy handlers.
> > > +     * When postcopy is enabled, devices that support postcopy will skip this
> > > +     * step.
> > > +     *
> > > +     * @d: a #SaveLiveCompletePrecopyThreadData containing parameters that the
> > > +     * handler may need, including this device section idstr and instance_id,
> > > +     * and opaque data pointer passed to register_savevm_live().
> > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > +     *
> > > +     * Returns true to indicate success and false for errors.
> > > +     */
> > > +    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
> > > +
> > >       /* This runs both outside and inside the BQL.  */
> > >       /**
> > > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > > index fd23ff7771b1..42ed4e6be150 100644
> > > --- a/include/qemu/typedefs.h
> > > +++ b/include/qemu/typedefs.h
> > > @@ -108,6 +108,7 @@ typedef struct QString QString;
> > >   typedef struct RAMBlock RAMBlock;
> > >   typedef struct Range Range;
> > >   typedef struct ReservedRegion ReservedRegion;
> > > +typedef struct SaveLiveCompletePrecopyThreadData SaveLiveCompletePrecopyThreadData;
> > >   typedef struct SHPCDevice SHPCDevice;
> > >   typedef struct SSIBus SSIBus;
> > >   typedef struct TCGCPUOps TCGCPUOps;
> > > @@ -133,5 +134,7 @@ typedef struct IRQState *qemu_irq;
> > >   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
> > >   typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
> > >                                       Error **errp);
> > > +typedef bool (*SaveLiveCompletePrecopyThreadHandler)(SaveLiveCompletePrecopyThreadData *d,
> > > +                                                     Error **errp);
> > >   #endif /* QEMU_TYPEDEFS_H */
> > > diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> > > index 5de3cf27d6e8..63f021fb8dad 100644
> > > --- a/migration/multifd-device-state.c
> > > +++ b/migration/multifd-device-state.c
> > > @@ -8,7 +8,10 @@
> > >    */
> > >   #include "qemu/osdep.h"
> > > +#include "qapi/error.h"
> > >   #include "qemu/lockable.h"
> > > +#include "block/thread-pool.h"
> > > +#include "migration.h"
> > >   #include "migration/misc.h"
> > >   #include "multifd.h"
> > >   #include "options.h"
> > > @@ -17,6 +20,9 @@ static struct {
> > >       QemuMutex queue_job_mutex;
> > >       MultiFDSendData *send_data;
> > > +
> > > +    ThreadPool *threads;
> > > +    bool threads_abort;
> > >   } *multifd_send_device_state;
> > >   void multifd_device_state_send_setup(void)
> > > @@ -27,10 +33,14 @@ void multifd_device_state_send_setup(void)
> > >       qemu_mutex_init(&multifd_send_device_state->queue_job_mutex);
> > >       multifd_send_device_state->send_data = multifd_send_data_alloc();
> > > +
> > > +    multifd_send_device_state->threads = thread_pool_new();
> > > +    multifd_send_device_state->threads_abort = false;
> > >   }
> > >   void multifd_device_state_send_cleanup(void)
> > >   {
> > > +    g_clear_pointer(&multifd_send_device_state->threads, thread_pool_free);
> > >       g_clear_pointer(&multifd_send_device_state->send_data,
> > >                       multifd_send_data_free);
> > > @@ -115,3 +125,78 @@ bool multifd_device_state_supported(void)
> > >       return migrate_multifd() && !migrate_mapped_ram() &&
> > >           migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
> > >   }
> > > +
> > > +static void multifd_device_state_save_thread_data_free(void *opaque)
> > > +{
> > > +    SaveLiveCompletePrecopyThreadData *data = opaque;
> > > +
> > > +    g_clear_pointer(&data->idstr, g_free);
> > > +    g_free(data);
> > > +}
> > > +
> > > +static int multifd_device_state_save_thread(void *opaque)
> > > +{
> > > +    SaveLiveCompletePrecopyThreadData *data = opaque;
> > > +    g_autoptr(Error) local_err = NULL;
> > > +
> > > +    if (!data->hdlr(data, &local_err)) {
> > > +        MigrationState *s = migrate_get_current();
> > > +
> > > +        assert(local_err);
> > > +
> > > +        /*
> > > +         * In case of multiple save threads failing which thread error
> > > +         * return we end setting is purely arbitrary.
> > > +         */
> > > +        migrate_set_error(s, local_err);
> > 
> > Where did you kick off all the threads when one hit error?  I wonder if
> > migrate_set_error() should just set quit flag for everything, but for this
> > series it might be easier to use multifd_abort_device_state_save_threads().
> 
> I've now added call to multifd_abort_device_state_save_threads() if a migration
> error is already set to avoid needlessly waiting for the remaining threads to
> do all of their work.

With that, feel free to take:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2025-03-04 22:04 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-19 20:33 [PATCH v5 00/36] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 01/36] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 02/36] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 03/36] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 04/36] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 05/36] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 06/36] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 07/36] migration: postcopy_ram_listen_thread() should take BQL for some calls Maciej S. Szmigiero
2025-02-25 17:16   ` Peter Xu
2025-02-25 21:08     ` Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 08/36] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 09/36] migration: Add thread pool of optional load threads Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 10/36] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 11/36] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2025-03-02 12:42   ` Avihai Horon
2025-03-03 22:14     ` Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 12/36] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 13/36] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 14/36] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2025-03-02 12:46   ` Avihai Horon
2025-03-03 22:15     ` Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 15/36] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 16/36] migration/multifd: Add multifd_device_state_supported() Maciej S. Szmigiero
2025-02-19 20:33 ` [PATCH v5 17/36] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
2025-02-26 16:43   ` Peter Xu
2025-03-04 21:50     ` Maciej S. Szmigiero
2025-03-04 22:03       ` Peter Xu
2025-02-19 20:34 ` [PATCH v5 18/36] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 19/36] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
2025-02-26  7:52   ` Cédric Le Goater
2025-02-26 13:55     ` Maciej S. Szmigiero
2025-02-26 15:56       ` Cédric Le Goater
2025-02-26 16:20   ` Cédric Le Goater
2025-02-19 20:34 ` [PATCH v5 20/36] vfio/migration: Add vfio_add_bytes_transferred() Maciej S. Szmigiero
2025-02-26  8:06   ` Cédric Le Goater
2025-02-26 15:45     ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 21/36] vfio/migration: Move migration channel flags to vfio-common.h header file Maciej S. Szmigiero
2025-02-26  8:19   ` Cédric Le Goater
2025-02-19 20:34 ` [PATCH v5 22/36] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
2025-02-26  8:52   ` Cédric Le Goater
2025-02-26 16:06     ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 23/36] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
2025-02-26  8:54   ` Cédric Le Goater
2025-03-02 13:00   ` Avihai Horon
2025-03-02 15:14     ` Maciej S. Szmigiero
2025-03-03  6:42     ` Cédric Le Goater
2025-03-03 22:14       ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 24/36] vfio/migration: Multifd device state transfer - add support checking function Maciej S. Szmigiero
2025-02-26  8:54   ` Cédric Le Goater
2025-02-19 20:34 ` [PATCH v5 25/36] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
2025-02-26 10:14   ` Cédric Le Goater
2025-02-26 17:22     ` Cédric Le Goater
2025-02-26 17:28       ` Maciej S. Szmigiero
2025-02-26 17:28   ` Cédric Le Goater
2025-02-27 22:00     ` Maciej S. Szmigiero
2025-02-26 17:46   ` Cédric Le Goater
2025-02-27 22:00     ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 26/36] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
2025-02-26 10:43   ` Cédric Le Goater
2025-02-26 21:04     ` Maciej S. Szmigiero
2025-02-28  8:09       ` Cédric Le Goater
2025-02-28 20:47         ` Maciej S. Szmigiero
2025-03-02 13:12   ` Avihai Horon
2025-03-03 22:15     ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 27/36] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
2025-02-26 13:49   ` Cédric Le Goater
2025-02-26 21:05     ` Maciej S. Szmigiero
2025-02-28  9:11       ` Cédric Le Goater
2025-02-28 20:48         ` Maciej S. Szmigiero
2025-03-02 14:19     ` Avihai Horon
2025-03-03 22:16       ` Maciej S. Szmigiero
2025-03-02 14:15   ` Avihai Horon
2025-03-03 22:16     ` Maciej S. Szmigiero
2025-03-04 11:21       ` Avihai Horon
2025-02-19 20:34 ` [PATCH v5 28/36] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
2025-02-26 13:52   ` Cédric Le Goater
2025-02-26 21:05     ` Maciej S. Szmigiero
2025-03-02 14:25   ` Avihai Horon
2025-03-03 22:17     ` Maciej S. Szmigiero
2025-03-04  7:41       ` Cédric Le Goater
2025-03-04 21:50         ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 29/36] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 30/36] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2025-02-26 16:43   ` Cédric Le Goater
2025-02-26 21:05     ` Maciej S. Szmigiero
2025-02-28  9:13       ` Cédric Le Goater
2025-02-28 20:49         ` Maciej S. Szmigiero
2025-03-02 14:41   ` Avihai Horon
2025-03-03 22:17     ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 31/36] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
2025-02-27  6:45   ` Cédric Le Goater
2025-03-02 14:48   ` Avihai Horon
2025-03-03 22:17     ` Maciej S. Szmigiero
2025-03-04 11:29       ` Avihai Horon
2025-03-04 21:50         ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 32/36] vfio/migration: Make x-migration-multifd-transfer VFIO property mutable Maciej S. Szmigiero
2025-02-26 17:59   ` Cédric Le Goater
2025-02-26 21:05     ` Maciej S. Szmigiero
2025-02-28  8:44       ` Cédric Le Goater
2025-02-28 20:47         ` Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 33/36] hw/core/machine: Add compat for x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
2025-02-26 17:59   ` Cédric Le Goater
2025-02-19 20:34 ` [PATCH v5 34/36] vfio/migration: Max in-flight VFIO device state buffer count limit Maciej S. Szmigiero
2025-02-27  6:48   ` Cédric Le Goater
2025-02-27 22:01     ` Maciej S. Szmigiero
2025-02-28  8:53       ` Cédric Le Goater
2025-02-28 20:48         ` Maciej S. Szmigiero
2025-03-02 14:53   ` Avihai Horon
2025-03-02 14:54     ` Maciej S. Szmigiero
2025-03-02 14:59       ` Maciej S. Szmigiero
2025-03-02 16:28         ` Avihai Horon
2025-02-19 20:34 ` [PATCH v5 35/36] vfio/migration: Add x-migration-load-config-after-iter VFIO property Maciej S. Szmigiero
2025-02-19 20:34 ` [PATCH v5 36/36] vfio/migration: Update VFIO migration documentation Maciej S. Szmigiero
2025-02-27  6:59   ` Cédric Le Goater
2025-02-27 22:01     ` Maciej S. Szmigiero
2025-02-28 10:05       ` Cédric Le Goater
2025-02-28 20:49         ` Maciej S. Szmigiero
2025-02-28 23:38         ` Fabiano Rosas
2025-03-03  9:34           ` Cédric Le Goater
2025-03-03 22:14           ` Maciej S. Szmigiero

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).