[PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
@ 2025-01-30 10:08 Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 01/33] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
                   ` (34 more replies)
  0 siblings, 35 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is an updated v4 patch series of the v3 series located here:
https://lore.kernel.org/qemu-devel/cover.1731773021.git.maciej.szmigiero@oracle.com/

Changes from v3:
* MigrationLoadThread now returns bool and an Error complex error type
instead of just an int.

* qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
instead of dedicated load_threads_ret variable.

* Since the change above uncovered an issue with respect to multifd send
channels not terminating TLS session properly QIOChannelTLS now allows
gracefully handling this situation.

* qemu_loadvm_load_thread_pool state is now part of MigrationIncomingState
instead of being stored in global variables.
This state now also has its own init/cleanup helpers.

* qemu_loadvm_load_thread_pool code is now moved into a separate section
of the savevm.c file, marked by an appropriate comment.

* thread_pool_free() is now documented to have wait-before-free semantic,
which allowed removal of explicit waits from thread pool cleanup paths.

* thread_pool_submit_immediate() method was added since this functionality
is used by both generic thread pool users in this patch set.

* postcopy_ram_listen_thread() now takes BQL around function calls that
ultimately call migration methods requiring BQL.
This fixes one of QEMU tests failing when explicitly BQL-sensitive code
is added later to these methods.

* qemu_loadvm_load_state_buffer() now returns a bool value instead of int.

* "Send final SYNC only after device state is complete" patch was
dropped since Peter implemented equivalent functionality upstream.

* "Document the BQL behavior of load SaveVMHandlers" patch was dropped
since that's something better done later, separately from this patch set.

* Header size is now added to mig_stats.multifd_bytes where it is actually
sent in the zero copy case - in multifd_nocomp_send_prepare().

* Spurious wakeups from qemu_cond_wait() are now handled properly as
pointed out by Avihai.

* VFIO migration FD now allows partial write() completion as pointed out
by Avihai.

* Patch "vfio/migration: Don't run load cleanup if load setup didn't run"
was dropped, instead all objects related to multifd load are now located in
their own VFIOMultifd struct which is allocated only if multifd device state
transfer is actually in use.

* Intermediate VFIOStateBuffers API as suggested by Avihai is now introduced
to simplify vfio_load_state_buffer() and vfio_load_bufs_thread().

* Optional VFIO device config state loading interlocking with loading
other iterables is now possible due to ARM64 platform VFIO dependency on
interrupt controller being loaded first as pointed out by Avihai.

* Patch "Multifd device state transfer support - receive side" was split
into a few smaller patches as suggested by Cédric.

* x-migration-multifd-transfer VFIO property compat changes were moved
into a separate patch as suggested by Cédric.

* Other small changes, like renamed functions and variables/members, added
review tags, code formatting, moved QEMU_LOCK_GUARD() instances closer to
actual protected blocks, etc.

========================================================================

This patch set is targeting QEMU 10.0.

What's not yet present is documentation update under docs/devel/migration
but I didn't want to delay posting the code any longer.
Such doc can still be merged later when the design is 100% finalized.

========================================================================

Maciej S. Szmigiero (32):
  migration: Clarify that {load,save}_cleanup handlers can run without
    setup
  thread-pool: Remove thread_pool_submit() function
  thread-pool: Rename AIO pool functions to *_aio() and data types to
    *Aio
  thread-pool: Implement generic (non-AIO) pool support
  migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  io: tls: Allow terminating the TLS session gracefully with EOF
  migration/multifd: Allow premature EOF on TLS incoming channels
  migration: postcopy_ram_listen_thread() needs to take BQL for some
    calls
  error: define g_autoptr() cleanup function for the Error type
  migration: Add thread pool of optional load threads
  migration/multifd: Split packet into header and RAM data
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Make multifd_send() thread safe
  migration/multifd: Add an explicit MultiFDSendData destructor
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add multifd_device_state_supported()
  migration: Add save_live_complete_precopy_thread handler
  vfio/migration: Add x-migration-load-config-after-iter VFIO property
  vfio/migration: Add load_device_config_state_start trace event
  vfio/migration: Convert bytes_transferred counter to atomic
  vfio/migration: Multifd device state transfer support - basic types
  vfio/migration: Multifd device state transfer support -
    VFIOStateBuffer(s)
  vfio/migration: Multifd device state transfer - add support checking
    function
  vfio/migration: Multifd device state transfer support - receive
    init/cleanup
  vfio/migration: Multifd device state transfer support - received
    buffers queuing
  vfio/migration: Multifd device state transfer support - load thread
  vfio/migration: Multifd device state transfer support - config loading
    support
  migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  vfio/migration: Multifd device state transfer support - send side
  vfio/migration: Add x-migration-multifd-transfer VFIO property
  hw/core/machine: Add compat for x-migration-multifd-transfer VFIO
    property

Peter Xu (1):
  migration/multifd: Make MultiFDSendData a struct

 hw/core/machine.c                  |   2 +
 hw/vfio/migration.c                | 754 ++++++++++++++++++++++++++++-
 hw/vfio/pci.c                      |  14 +
 hw/vfio/trace-events               |  11 +-
 include/block/aio.h                |   8 +-
 include/block/thread-pool.h        |  62 ++-
 include/hw/vfio/vfio-common.h      |   7 +
 include/io/channel-tls.h           |  11 +
 include/migration/client-options.h |   4 +
 include/migration/misc.h           |  16 +
 include/migration/register.h       |  54 ++-
 include/qapi/error.h               |   2 +
 include/qemu/typedefs.h            |   6 +
 io/channel-tls.c                   |   6 +
 migration/colo.c                   |   3 +
 migration/meson.build              |   1 +
 migration/migration-hmp-cmds.c     |   2 +
 migration/migration.c              |   6 +-
 migration/migration.h              |   7 +
 migration/multifd-device-state.c   | 192 ++++++++
 migration/multifd-nocomp.c         |  30 +-
 migration/multifd.c                | 248 ++++++++--
 migration/multifd.h                |  74 ++-
 migration/options.c                |   9 +
 migration/qemu-file.h              |   2 +
 migration/savevm.c                 | 195 +++++++-
 migration/savevm.h                 |   6 +-
 migration/trace-events             |   1 +
 scripts/analyze-migration.py       |  11 +
 tests/unit/test-thread-pool.c      |   6 +-
 util/async.c                       |   6 +-
 util/thread-pool.c                 | 184 +++++--
 util/trace-events                  |   6 +-
 33 files changed, 1814 insertions(+), 132 deletions(-)
 create mode 100644 migration/multifd-device-state.c

^ permalink raw reply	[flat|nested] 137+ messages in thread

* [PATCH v4 01/33] migration: Clarify that {load, save}_cleanup handlers can run without setup
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 02/33] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

It's possible for {load,save}_cleanup SaveVMHandlers to get called without
the corresponding {load,save}_setup handler being called first.

One such example is if {load,save}_setup handler of a proceeding device
returns error.
In this case the migration core cleanup code will call all corresponding
cleanup handlers, even for these devices which haven't had its setup
handler called.

Since this behavior can generate some surprises let's clearly document it
in these SaveVMHandlers description.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index f60e797894e5..0b0292738320 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -69,7 +69,9 @@ typedef struct SaveVMHandlers {
     /**
      * @save_cleanup
      *
-     * Uninitializes the data structures on the source
+     * Uninitializes the data structures on the source.
+     * Note that this handler can be called even if save_setup
+     * wasn't called earlier.
      *
      * @opaque: data pointer passed to register_savevm_live()
      */
@@ -244,6 +246,8 @@ typedef struct SaveVMHandlers {
      * @load_cleanup
      *
      * Uninitializes the data structures on the destination.
+     * Note that this handler can be called even if load_setup
+     * wasn't called earlier.
      *
      * @opaque: data pointer passed to register_savevm_live()
      *


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 02/33] thread-pool: Remove thread_pool_submit() function
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 01/33] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 03/33] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This function name conflicts with one used by a future generic thread pool
function and it was only used by one test anyway.

Update the trace event name in thread_pool_submit_aio() accordingly.

Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/thread-pool.h   | 3 +--
 tests/unit/test-thread-pool.c | 6 +++---
 util/thread-pool.c            | 7 +------
 util/trace-events             | 2 +-
 4 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 948ff5f30c31..4f6694026123 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -30,13 +30,12 @@ ThreadPool *thread_pool_new(struct AioContext *ctx);
 void thread_pool_free(ThreadPool *pool);
 
 /*
- * thread_pool_submit* API: submit I/O requests in the thread's
+ * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
  * current AioContext.
  */
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
-void thread_pool_submit(ThreadPoolFunc *func, void *arg);
 
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
index 1483e53473db..33407b595d35 100644
--- a/tests/unit/test-thread-pool.c
+++ b/tests/unit/test-thread-pool.c
@@ -43,10 +43,10 @@ static void done_cb(void *opaque, int ret)
     active--;
 }
 
-static void test_submit(void)
+static void test_submit_no_complete(void)
 {
     WorkerTestData data = { .n = 0 };
-    thread_pool_submit(worker_cb, &data);
+    thread_pool_submit_aio(worker_cb, &data, NULL, NULL);
     while (data.n == 0) {
         aio_poll(ctx, true);
     }
@@ -236,7 +236,7 @@ int main(int argc, char **argv)
     ctx = qemu_get_current_aio_context();
 
     g_test_init(&argc, &argv, NULL);
-    g_test_add_func("/thread-pool/submit", test_submit);
+    g_test_add_func("/thread-pool/submit-no-complete", test_submit_no_complete);
     g_test_add_func("/thread-pool/submit-aio", test_submit_aio);
     g_test_add_func("/thread-pool/submit-co", test_submit_co);
     g_test_add_func("/thread-pool/submit-many", test_submit_many);
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 27eb777e855b..2f751d55b33f 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -256,7 +256,7 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 
     QLIST_INSERT_HEAD(&pool->head, req, all);
 
-    trace_thread_pool_submit(pool, req, arg);
+    trace_thread_pool_submit_aio(pool, req, arg);
 
     qemu_mutex_lock(&pool->lock);
     if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
@@ -290,11 +290,6 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
     return tpc.ret;
 }
 
-void thread_pool_submit(ThreadPoolFunc *func, void *arg)
-{
-    thread_pool_submit_aio(func, arg, NULL, NULL);
-}
-
 void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
 {
     qemu_mutex_lock(&pool->lock);
diff --git a/util/trace-events b/util/trace-events
index 49a4962e1886..5be12d7fab89 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -14,7 +14,7 @@ aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
 reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
 
 # thread-pool.c
-thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
+thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 03/33] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 01/33] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 02/33] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 04/33] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

These names conflict with ones used by future generic thread pool
equivalents.
Generic names should belong to the generic pool type, not specific (AIO)
type.

Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/aio.h         |  8 ++---
 include/block/thread-pool.h |  8 ++---
 util/async.c                |  6 ++--
 util/thread-pool.c          | 58 ++++++++++++++++++-------------------
 util/trace-events           |  4 +--
 5 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 43883a8a33a8..b2ab3514de23 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -54,7 +54,7 @@ typedef void QEMUBHFunc(void *opaque);
 typedef bool AioPollFn(void *opaque);
 typedef void IOHandler(void *opaque);
 
-struct ThreadPool;
+struct ThreadPoolAio;
 struct LinuxAioState;
 typedef struct LuringState LuringState;
 
@@ -207,7 +207,7 @@ struct AioContext {
     /* Thread pool for performing work and receiving completion callbacks.
      * Has its own locking.
      */
-    struct ThreadPool *thread_pool;
+    struct ThreadPoolAio *thread_pool;
 
 #ifdef CONFIG_LINUX_AIO
     struct LinuxAioState *linux_aio;
@@ -500,8 +500,8 @@ void aio_set_event_notifier_poll(AioContext *ctx,
  */
 GSource *aio_get_g_source(AioContext *ctx);
 
-/* Return the ThreadPool bound to this AioContext */
-struct ThreadPool *aio_get_thread_pool(AioContext *ctx);
+/* Return the ThreadPoolAio bound to this AioContext */
+struct ThreadPoolAio *aio_get_thread_pool(AioContext *ctx);
 
 /* Setup the LinuxAioState bound to this AioContext */
 struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 4f6694026123..6f27eb085b45 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -24,10 +24,10 @@
 
 typedef int ThreadPoolFunc(void *opaque);
 
-typedef struct ThreadPool ThreadPool;
+typedef struct ThreadPoolAio ThreadPoolAio;
 
-ThreadPool *thread_pool_new(struct AioContext *ctx);
-void thread_pool_free(ThreadPool *pool);
+ThreadPoolAio *thread_pool_new_aio(struct AioContext *ctx);
+void thread_pool_free_aio(ThreadPoolAio *pool);
 
 /*
  * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
@@ -36,7 +36,7 @@ void thread_pool_free(ThreadPool *pool);
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
+void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
 
-void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
 #endif
diff --git a/util/async.c b/util/async.c
index 0fe29436090d..47e3d35a263f 100644
--- a/util/async.c
+++ b/util/async.c
@@ -369,7 +369,7 @@ aio_ctx_finalize(GSource     *source)
     QEMUBH *bh;
     unsigned flags;
 
-    thread_pool_free(ctx->thread_pool);
+    thread_pool_free_aio(ctx->thread_pool);
 
 #ifdef CONFIG_LINUX_AIO
     if (ctx->linux_aio) {
@@ -435,10 +435,10 @@ GSource *aio_get_g_source(AioContext *ctx)
     return &ctx->source;
 }
 
-ThreadPool *aio_get_thread_pool(AioContext *ctx)
+ThreadPoolAio *aio_get_thread_pool(AioContext *ctx)
 {
     if (!ctx->thread_pool) {
-        ctx->thread_pool = thread_pool_new(ctx);
+        ctx->thread_pool = thread_pool_new_aio(ctx);
     }
     return ctx->thread_pool;
 }
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 2f751d55b33f..908194dc070f 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -23,9 +23,9 @@
 #include "block/thread-pool.h"
 #include "qemu/main-loop.h"
 
-static void do_spawn_thread(ThreadPool *pool);
+static void do_spawn_thread(ThreadPoolAio *pool);
 
-typedef struct ThreadPoolElement ThreadPoolElement;
+typedef struct ThreadPoolElementAio ThreadPoolElementAio;
 
 enum ThreadState {
     THREAD_QUEUED,
@@ -33,9 +33,9 @@ enum ThreadState {
     THREAD_DONE,
 };
 
-struct ThreadPoolElement {
+struct ThreadPoolElementAio {
     BlockAIOCB common;
-    ThreadPool *pool;
+    ThreadPoolAio *pool;
     ThreadPoolFunc *func;
     void *arg;
 
@@ -47,13 +47,13 @@ struct ThreadPoolElement {
     int ret;
 
     /* Access to this list is protected by lock.  */
-    QTAILQ_ENTRY(ThreadPoolElement) reqs;
+    QTAILQ_ENTRY(ThreadPoolElementAio) reqs;
 
     /* This list is only written by the thread pool's mother thread.  */
-    QLIST_ENTRY(ThreadPoolElement) all;
+    QLIST_ENTRY(ThreadPoolElementAio) all;
 };
 
-struct ThreadPool {
+struct ThreadPoolAio {
     AioContext *ctx;
     QEMUBH *completion_bh;
     QemuMutex lock;
@@ -62,10 +62,10 @@ struct ThreadPool {
     QEMUBH *new_thread_bh;
 
     /* The following variables are only accessed from one AioContext. */
-    QLIST_HEAD(, ThreadPoolElement) head;
+    QLIST_HEAD(, ThreadPoolElementAio) head;
 
     /* The following variables are protected by lock.  */
-    QTAILQ_HEAD(, ThreadPoolElement) request_list;
+    QTAILQ_HEAD(, ThreadPoolElementAio) request_list;
     int cur_threads;
     int idle_threads;
     int new_threads;     /* backlog of threads we need to create */
@@ -76,14 +76,14 @@ struct ThreadPool {
 
 static void *worker_thread(void *opaque)
 {
-    ThreadPool *pool = opaque;
+    ThreadPoolAio *pool = opaque;
 
     qemu_mutex_lock(&pool->lock);
     pool->pending_threads--;
     do_spawn_thread(pool);
 
     while (pool->cur_threads <= pool->max_threads) {
-        ThreadPoolElement *req;
+        ThreadPoolElementAio *req;
         int ret;
 
         if (QTAILQ_EMPTY(&pool->request_list)) {
@@ -131,7 +131,7 @@ static void *worker_thread(void *opaque)
     return NULL;
 }
 
-static void do_spawn_thread(ThreadPool *pool)
+static void do_spawn_thread(ThreadPoolAio *pool)
 {
     QemuThread t;
 
@@ -148,14 +148,14 @@ static void do_spawn_thread(ThreadPool *pool)
 
 static void spawn_thread_bh_fn(void *opaque)
 {
-    ThreadPool *pool = opaque;
+    ThreadPoolAio *pool = opaque;
 
     qemu_mutex_lock(&pool->lock);
     do_spawn_thread(pool);
     qemu_mutex_unlock(&pool->lock);
 }
 
-static void spawn_thread(ThreadPool *pool)
+static void spawn_thread(ThreadPoolAio *pool)
 {
     pool->cur_threads++;
     pool->new_threads++;
@@ -173,8 +173,8 @@ static void spawn_thread(ThreadPool *pool)
 
 static void thread_pool_completion_bh(void *opaque)
 {
-    ThreadPool *pool = opaque;
-    ThreadPoolElement *elem, *next;
+    ThreadPoolAio *pool = opaque;
+    ThreadPoolElementAio *elem, *next;
 
     defer_call_begin(); /* cb() may use defer_call() to coalesce work */
 
@@ -184,8 +184,8 @@ restart:
             continue;
         }
 
-        trace_thread_pool_complete(pool, elem, elem->common.opaque,
-                                   elem->ret);
+        trace_thread_pool_complete_aio(pool, elem, elem->common.opaque,
+                                       elem->ret);
         QLIST_REMOVE(elem, all);
 
         if (elem->common.cb) {
@@ -217,10 +217,10 @@ restart:
 
 static void thread_pool_cancel(BlockAIOCB *acb)
 {
-    ThreadPoolElement *elem = (ThreadPoolElement *)acb;
-    ThreadPool *pool = elem->pool;
+    ThreadPoolElementAio *elem = (ThreadPoolElementAio *)acb;
+    ThreadPoolAio *pool = elem->pool;
 
-    trace_thread_pool_cancel(elem, elem->common.opaque);
+    trace_thread_pool_cancel_aio(elem, elem->common.opaque);
 
     QEMU_LOCK_GUARD(&pool->lock);
     if (elem->state == THREAD_QUEUED) {
@@ -234,16 +234,16 @@ static void thread_pool_cancel(BlockAIOCB *acb)
 }
 
 static const AIOCBInfo thread_pool_aiocb_info = {
-    .aiocb_size         = sizeof(ThreadPoolElement),
+    .aiocb_size         = sizeof(ThreadPoolElementAio),
     .cancel_async       = thread_pool_cancel,
 };
 
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque)
 {
-    ThreadPoolElement *req;
+    ThreadPoolElementAio *req;
     AioContext *ctx = qemu_get_current_aio_context();
-    ThreadPool *pool = aio_get_thread_pool(ctx);
+    ThreadPoolAio *pool = aio_get_thread_pool(ctx);
 
     /* Assert that the thread submitting work is the same running the pool */
     assert(pool->ctx == qemu_get_current_aio_context());
@@ -290,7 +290,7 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
     return tpc.ret;
 }
 
-void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
+void thread_pool_update_params(ThreadPoolAio *pool, AioContext *ctx)
 {
     qemu_mutex_lock(&pool->lock);
 
@@ -317,7 +317,7 @@ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
     qemu_mutex_unlock(&pool->lock);
 }
 
-static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
+static void thread_pool_init_one(ThreadPoolAio *pool, AioContext *ctx)
 {
     if (!ctx) {
         ctx = qemu_get_aio_context();
@@ -337,14 +337,14 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
     thread_pool_update_params(pool, ctx);
 }
 
-ThreadPool *thread_pool_new(AioContext *ctx)
+ThreadPoolAio *thread_pool_new_aio(AioContext *ctx)
 {
-    ThreadPool *pool = g_new(ThreadPool, 1);
+    ThreadPoolAio *pool = g_new(ThreadPoolAio, 1);
     thread_pool_init_one(pool, ctx);
     return pool;
 }
 
-void thread_pool_free(ThreadPool *pool)
+void thread_pool_free_aio(ThreadPoolAio *pool)
 {
     if (!pool) {
         return;
diff --git a/util/trace-events b/util/trace-events
index 5be12d7fab89..bd8f25fb5920 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -15,8 +15,8 @@ reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
 
 # thread-pool.c
 thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
-thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
+thread_pool_complete_aio(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
+thread_pool_cancel_aio(void *req, void *opaque) "req %p opaque %p"
 
 # buffer.c
 buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 04/33] thread-pool: Implement generic (non-AIO) pool support
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 03/33] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 05/33] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
                   ` (30 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Migration code wants to manage device data sending threads in one place.

QEMU has an existing thread pool implementation, however it is limited
to queuing AIO operations only and essentially has a 1:1 mapping between
the current AioContext and the AIO ThreadPool in use.

Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
GThreadPool.

This brings a few new operations on a pool:
* thread_pool_wait() operation waits until all the submitted work requests
have finished.

* thread_pool_set_max_threads() explicitly sets the maximum thread count
in the pool.

* thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
in the pool to equal the number of still waiting in queue or unfinished work.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/thread-pool.h |  51 ++++++++++++++++
 util/thread-pool.c          | 119 ++++++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 6f27eb085b45..dd48cf07e85f 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -38,5 +38,56 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
 void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
 
+/* ------------------------------------------- */
+/* Generic thread pool types and methods below */
+typedef struct ThreadPool ThreadPool;
+
+/* Create a new thread pool. Never returns NULL. */
+ThreadPool *thread_pool_new(void);
+
+/*
+ * Free the thread pool.
+ * Waits for all the previously submitted work to complete before performing
+ * the actual freeing operation.
+ */
+void thread_pool_free(ThreadPool *pool);
+
+/*
+ * Submit a new work (task) for the pool.
+ *
+ * @opaque_destroy is an optional GDestroyNotify for the @opaque argument
+ * to the work function at @func.
+ */
+void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                        void *opaque, GDestroyNotify opaque_destroy);
+
+/*
+ * Submit a new work (task) for the pool, making sure it starts getting
+ * processed immediately, launching a new thread for it if necessary.
+ *
+ * @opaque_destroy is an optional GDestroyNotify for the @opaque argument
+ * to the work function at @func.
+ */
+void thread_pool_submit_immediate(ThreadPool *pool, ThreadPoolFunc *func,
+                                  void *opaque, GDestroyNotify opaque_destroy);
+
+/*
+ * Wait for all previously submitted work to complete before returning.
+ *
+ * Can be used as a barrier between two sets of tasks executed on a thread
+ * pool without destroying it or in a performance sensitive path where the
+ * caller just wants to wait for all tasks to complete while deferring the
+ * pool free operation for later, less performance sensitive time.
+ */
+void thread_pool_wait(ThreadPool *pool);
+
+/* Set the maximum number of threads in the pool. */
+bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
+
+/*
+ * Adjust the maximum number of threads in the pool to give each task its
+ * own thread (exactly one thread per task).
+ */
+bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
 
 #endif
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 908194dc070f..d2ead6b72857 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -374,3 +374,122 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
     qemu_mutex_destroy(&pool->lock);
     g_free(pool);
 }
+
+struct ThreadPool {
+    GThreadPool *t;
+    size_t cur_work;
+    QemuMutex cur_work_lock;
+    QemuCond all_finished_cond;
+};
+
+typedef struct {
+    ThreadPoolFunc *func;
+    void *opaque;
+    GDestroyNotify opaque_destroy;
+} ThreadPoolElement;
+
+static void thread_pool_func(gpointer data, gpointer user_data)
+{
+    ThreadPool *pool = user_data;
+    g_autofree ThreadPoolElement *el = data;
+
+    el->func(el->opaque);
+
+    if (el->opaque_destroy) {
+        el->opaque_destroy(el->opaque);
+    }
+
+    QEMU_LOCK_GUARD(&pool->cur_work_lock);
+
+    assert(pool->cur_work > 0);
+    pool->cur_work--;
+
+    if (pool->cur_work == 0) {
+        qemu_cond_signal(&pool->all_finished_cond);
+    }
+}
+
+ThreadPool *thread_pool_new(void)
+{
+    ThreadPool *pool = g_new(ThreadPool, 1);
+
+    pool->cur_work = 0;
+    qemu_mutex_init(&pool->cur_work_lock);
+    qemu_cond_init(&pool->all_finished_cond);
+
+    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
+    /*
+     * g_thread_pool_new() can only return errors if initial thread(s)
+     * creation fails but we ask for 0 initial threads above.
+     */
+    assert(pool->t);
+
+    return pool;
+}
+
+void thread_pool_free(ThreadPool *pool)
+{
+    /*
+     * With _wait = TRUE this effectively waits for all
+     * previously submitted work to complete first.
+     */
+    g_thread_pool_free(pool->t, FALSE, TRUE);
+
+    qemu_cond_destroy(&pool->all_finished_cond);
+    qemu_mutex_destroy(&pool->cur_work_lock);
+
+    g_free(pool);
+}
+
+void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                        void *opaque, GDestroyNotify opaque_destroy)
+{
+    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
+
+    el->func = func;
+    el->opaque = opaque;
+    el->opaque_destroy = opaque_destroy;
+
+    WITH_QEMU_LOCK_GUARD(&pool->cur_work_lock) {
+        pool->cur_work++;
+    }
+
+    /*
+     * Ignore the return value since this function can only return errors
+     * if creation of an additional thread fails but even in this case the
+     * provided work is still getting queued (just for the existing threads).
+     */
+    g_thread_pool_push(pool->t, el, NULL);
+}
+
+void thread_pool_submit_immediate(ThreadPool *pool, ThreadPoolFunc *func,
+                                  void *opaque, GDestroyNotify opaque_destroy)
+{
+    thread_pool_submit(pool, func, opaque, opaque_destroy);
+    thread_pool_adjust_max_threads_to_work(pool);
+}
+
+void thread_pool_wait(ThreadPool *pool)
+{
+    QEMU_LOCK_GUARD(&pool->cur_work_lock);
+
+    while (pool->cur_work > 0) {
+        qemu_cond_wait(&pool->all_finished_cond,
+                       &pool->cur_work_lock);
+    }
+}
+
+bool thread_pool_set_max_threads(ThreadPool *pool,
+                                 int max_threads)
+{
+    assert(max_threads > 0);
+
+    return g_thread_pool_set_max_threads(pool->t, max_threads, NULL);
+}
+
+bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool)
+{
+    QEMU_LOCK_GUARD(&pool->cur_work_lock);
+
+    return thread_pool_set_max_threads(pool, pool->cur_work);
+}


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 05/33] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 04/33] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 06/33] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
used to mark the switchover point in main migration stream.

It can be used to inform the destination that all pre-switchover main
migration stream data has been sent/received so it can start to process
post-switchover data that it might have received via other migration
channels like the multifd ones.

Add also the relevant MigrationState bit stream compatibility property and
its hw_compat entry.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Zhang Chen <zhangckid@gmail.com> # for the COLO part
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/machine.c                  |  1 +
 include/migration/client-options.h |  4 +++
 include/migration/register.h       | 12 +++++++++
 migration/colo.c                   |  3 +++
 migration/migration-hmp-cmds.c     |  2 ++
 migration/migration.c              |  4 +++
 migration/migration.h              |  2 ++
 migration/options.c                |  9 +++++++
 migration/savevm.c                 | 39 ++++++++++++++++++++++++++++++
 migration/savevm.h                 |  1 +
 migration/trace-events             |  1 +
 scripts/analyze-migration.py       | 11 +++++++++
 12 files changed, 89 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index c23b39949649..c2964503c5bd 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -38,6 +38,7 @@
 
 GlobalProperty hw_compat_9_2[] = {
     {"arm-cpu", "backcompat-pauth-default-use-qarma5", "true"},
+    { "migration", "send-switchover-start", "off"},
 };
 const size_t hw_compat_9_2_len = G_N_ELEMENTS(hw_compat_9_2);
 
diff --git a/include/migration/client-options.h b/include/migration/client-options.h
index 59f4b55cf4f7..289c9d776221 100644
--- a/include/migration/client-options.h
+++ b/include/migration/client-options.h
@@ -10,6 +10,10 @@
 #ifndef QEMU_MIGRATION_CLIENT_OPTIONS_H
 #define QEMU_MIGRATION_CLIENT_OPTIONS_H
 
+
+/* properties */
+bool migrate_send_switchover_start(void);
+
 /* capabilities */
 
 bool migrate_background_snapshot(void);
diff --git a/include/migration/register.h b/include/migration/register.h
index 0b0292738320..ff0faf5f68c8 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -279,6 +279,18 @@ typedef struct SaveVMHandlers {
      * otherwise
      */
     bool (*switchover_ack_needed)(void *opaque);
+
+    /**
+     * @switchover_start
+     *
+     * Notifies that the switchover has started. Called only on
+     * the destination.
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*switchover_start)(void *opaque);
 } SaveVMHandlers;
 
 /**
diff --git a/migration/colo.c b/migration/colo.c
index 9a8e5fbe9b94..c976b3ff344d 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         bql_unlock();
         goto out;
     }
+
+    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
+
     /* Note: device state is saved into buffer */
     ret = qemu_save_device_state(fb);
 
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index e8527bef801c..c84e7914ccfb 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -46,6 +46,8 @@ static void migration_global_dump(Monitor *mon)
                    ms->send_configuration ? "on" : "off");
     monitor_printf(mon, "send-section-footer: %s\n",
                    ms->send_section_footer ? "on" : "off");
+    monitor_printf(mon, "send-switchover-start: %s\n",
+                   ms->send_switchover_start ? "on" : "off");
     monitor_printf(mon, "clear-bitmap-shift: %u\n",
                    ms->clear_bitmap_shift);
 }
diff --git a/migration/migration.c b/migration/migration.c
index 2d1da917c7b1..65b51d360896 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2536,6 +2536,8 @@ static int postcopy_start(MigrationState *ms, Error **errp)
         goto fail;
     }
 
+    qemu_savevm_maybe_send_switchover_start(ms->to_dst_file);
+
     /*
      * Cause any non-postcopiable, but iterative devices to
      * send out their final data.
@@ -2719,6 +2721,8 @@ static int migration_completion_precopy(MigrationState *s,
 
     migration_rate_set(RATE_LIMIT_DISABLED);
 
+    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
+
     /* Inactivate disks except in COLO */
     ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
                                              !migrate_colo());
diff --git a/migration/migration.h b/migration/migration.h
index 0df2a187afef..c5731626bbfb 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -400,6 +400,8 @@ struct MigrationState {
     bool send_configuration;
     /* Whether we send section footer during migration */
     bool send_section_footer;
+    /* Whether we send switchover start notification during migration */
+    bool send_switchover_start;
 
     /* Needed by postcopy-pause state */
     QemuSemaphore postcopy_pause_sem;
diff --git a/migration/options.c b/migration/options.c
index b8d530032698..466a45e7f2ba 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -92,6 +92,8 @@ const Property migration_properties[] = {
                      send_configuration, true),
     DEFINE_PROP_BOOL("send-section-footer", MigrationState,
                      send_section_footer, true),
+    DEFINE_PROP_BOOL("send-switchover-start", MigrationState,
+                     send_switchover_start, true),
     DEFINE_PROP_BOOL("multifd-flush-after-each-section", MigrationState,
                       multifd_flush_after_each_section, false),
     DEFINE_PROP_UINT8("x-clear-bitmap-shift", MigrationState,
@@ -206,6 +208,13 @@ bool migrate_auto_converge(void)
     return s->capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
 }
 
+bool migrate_send_switchover_start(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->send_switchover_start;
+}
+
 bool migrate_background_snapshot(void)
 {
     MigrationState *s = migrate_get_current();
diff --git a/migration/savevm.c b/migration/savevm.c
index c929da1ca5a5..5a3be7a06b6f 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -90,6 +90,7 @@ enum qemu_vm_cmd {
     MIG_CMD_ENABLE_COLO,       /* Enable COLO */
     MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
     MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
+    MIG_CMD_SWITCHOVER_START,  /* Switchover start notification */
     MIG_CMD_MAX
 };
 
@@ -109,6 +110,7 @@ static struct mig_cmd_args {
     [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
     [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
     [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
+    [MIG_CMD_SWITCHOVER_START] = { .len =  0, .name = "SWITCHOVER_START" },
     [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
 };
 
@@ -1201,6 +1203,19 @@ void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name)
     qemu_savevm_command_send(f, MIG_CMD_RECV_BITMAP, len + 1, (uint8_t *)buf);
 }
 
+static void qemu_savevm_send_switchover_start(QEMUFile *f)
+{
+    trace_savevm_send_switchover_start();
+    qemu_savevm_command_send(f, MIG_CMD_SWITCHOVER_START, 0, NULL);
+}
+
+void qemu_savevm_maybe_send_switchover_start(QEMUFile *f)
+{
+    if (migrate_send_switchover_start()) {
+        qemu_savevm_send_switchover_start(f);
+    }
+}
+
 bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
@@ -1714,6 +1729,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
 
     ret = qemu_file_get_error(f);
     if (ret == 0) {
+        qemu_savevm_maybe_send_switchover_start(f);
         qemu_savevm_state_complete_precopy(f, false, false);
         ret = qemu_file_get_error(f);
     }
@@ -2410,6 +2426,26 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
     return ret;
 }
 
+static int loadvm_postcopy_handle_switchover_start(void)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        int ret;
+
+        if (!se->ops || !se->ops->switchover_start) {
+            continue;
+        }
+
+        ret = se->ops->switchover_start(se->opaque);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * 0           just a normal return
@@ -2508,6 +2544,9 @@ static int loadvm_process_command(QEMUFile *f)
 
     case MIG_CMD_ENABLE_COLO:
         return loadvm_process_enable_colo(mis);
+
+    case MIG_CMD_SWITCHOVER_START:
+        return loadvm_postcopy_handle_switchover_start();
     }
 
     return 0;
diff --git a/migration/savevm.h b/migration/savevm.h
index 9ec96a995c93..4d402723bc3c 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -53,6 +53,7 @@ void qemu_savevm_send_postcopy_listen(QEMUFile *f);
 void qemu_savevm_send_postcopy_run(QEMUFile *f);
 void qemu_savevm_send_postcopy_resume(QEMUFile *f);
 void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name);
+void qemu_savevm_maybe_send_switchover_start(QEMUFile *f);
 
 void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
                                            uint16_t len,
diff --git a/migration/trace-events b/migration/trace-events
index b82a1c5e40ff..5a73af532f1c 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -39,6 +39,7 @@ savevm_send_postcopy_run(void) ""
 savevm_send_postcopy_resume(void) ""
 savevm_send_colo_enable(void) ""
 savevm_send_recv_bitmap(char *name) "%s"
+savevm_send_switchover_start(void) ""
 savevm_state_setup(void) ""
 savevm_state_resume_prepare(void) ""
 savevm_state_header(void) ""
diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
index 8e1fbf4c9d9f..67631ac43e9f 100755
--- a/scripts/analyze-migration.py
+++ b/scripts/analyze-migration.py
@@ -620,7 +620,9 @@ class MigrationDump(object):
     QEMU_VM_SUBSECTION    = 0x05
     QEMU_VM_VMDESCRIPTION = 0x06
     QEMU_VM_CONFIGURATION = 0x07
+    QEMU_VM_COMMAND       = 0x08
     QEMU_VM_SECTION_FOOTER= 0x7e
+    QEMU_MIG_CMD_SWITCHOVER_START = 0x0b
 
     def __init__(self, filename):
         self.section_classes = {
@@ -685,6 +687,15 @@ def read(self, desc_only = False, dump_memory = False,
             elif section_type == self.QEMU_VM_SECTION_PART or section_type == self.QEMU_VM_SECTION_END:
                 section_id = file.read32()
                 self.sections[section_id].read()
+            elif section_type == self.QEMU_VM_COMMAND:
+                command_type = file.read16()
+                command_data_len = file.read16()
+                if command_type != self.QEMU_MIG_CMD_SWITCHOVER_START:
+                    raise Exception("Unknown QEMU_VM_COMMAND: %x" %
+                                    (command_type))
+                if command_data_len != 0:
+                    raise Exception("Invalid SWITCHOVER_START length: %x" %
+                                    (command_data_len))
             elif section_type == self.QEMU_VM_SECTION_FOOTER:
                 read_section_id = file.read32()
                 if read_section_id != section_id:


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 06/33] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 05/33] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF Maciej S. Szmigiero
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 15 +++++++++++++++
 migration/savevm.c           | 23 +++++++++++++++++++++++
 migration/savevm.h           |  3 +++
 3 files changed, 41 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index ff0faf5f68c8..58891aa54b76 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -229,6 +229,21 @@ typedef struct SaveVMHandlers {
      */
     int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+    /**
+     * @load_state_buffer (invoked outside the BQL)
+     *
+     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @buf: the data buffer to load
+     * @len: the data length in buffer
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns true to indicate success and false for errors.
+     */
+    bool (*load_state_buffer)(void *opaque, char *buf, size_t len,
+                              Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/savevm.c b/migration/savevm.c
index 5a3be7a06b6f..b0b74140daea 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3082,6 +3082,29 @@ int qemu_loadvm_approve_switchover(void)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+bool qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                   char *buf, size_t len, Error **errp)
+{
+    SaveStateEntry *se;
+
+    se = find_se(idstr, instance_id);
+    if (!se) {
+        error_setg(errp,
+                   "Unknown idstr %s or instance id %u for load state buffer",
+                   idstr, instance_id);
+        return false;
+    }
+
+    if (!se->ops || !se->ops->load_state_buffer) {
+        error_setg(errp,
+                   "idstr %s / instance %u has no load state buffer operation",
+                   idstr, instance_id);
+        return false;
+    }
+
+    return se->ops->load_state_buffer(se->opaque, buf, len, errp);
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 4d402723bc3c..8b78493dbc0e 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -71,4 +71,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
 
+bool qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                   char *buf, size_t len, Error **errp);
+
 #endif


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (5 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 06/33] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-04 15:15   ` Daniel P. Berrangé
  2025-01-30 10:08 ` [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels Maciej S. Szmigiero
                   ` (27 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Currently, hitting EOF on receive without sender terminating the TLS
session properly causes the TLS channel to return an error (unless
the channel was already shut down for read).

Add an optional setting whether we instead just return EOF in that
case.

This possibility will be soon used by the migration multifd code.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/io/channel-tls.h | 11 +++++++++++
 io/channel-tls.c         |  6 ++++++
 2 files changed, 17 insertions(+)

diff --git a/include/io/channel-tls.h b/include/io/channel-tls.h
index 26c67f17e2d3..8552c0d0266e 100644
--- a/include/io/channel-tls.h
+++ b/include/io/channel-tls.h
@@ -49,6 +49,7 @@ struct QIOChannelTLS {
     QCryptoTLSSession *session;
     QIOChannelShutdown shutdown;
     guint hs_ioc_tag;
+    bool premature_eof_okay;
 };
 
 /**
@@ -143,4 +144,14 @@ void qio_channel_tls_handshake(QIOChannelTLS *ioc,
 QCryptoTLSSession *
 qio_channel_tls_get_session(QIOChannelTLS *ioc);
 
+/**
+ * qio_channel_tls_set_premature_eof_okay:
+ * @ioc: the TLS channel object
+ *
+ * Sets whether receiving an EOF without terminating the TLS session properly
+ * by used the other side is considered okay or an error (the
+ * default behaviour).
+ */
+void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled);
+
 #endif /* QIO_CHANNEL_TLS_H */
diff --git a/io/channel-tls.c b/io/channel-tls.c
index aab630e5ae32..1079d6d10de1 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -147,6 +147,11 @@ qio_channel_tls_new_client(QIOChannel *master,
     return NULL;
 }
 
+void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled)
+{
+    ioc->premature_eof_okay = enabled;
+}
+
 struct QIOChannelTLSData {
     QIOTask *task;
     GMainContext *context;
@@ -279,6 +284,7 @@ static ssize_t qio_channel_tls_readv(QIOChannel *ioc,
             tioc->session,
             iov[i].iov_base,
             iov[i].iov_len,
+            tioc->premature_eof_okay ||
             qatomic_load_acquire(&tioc->shutdown) & QIO_CHANNEL_SHUTDOWN_READ,
             errp);
         if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) {


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (6 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-03 18:20   ` Peter Xu
  2025-01-30 10:08 ` [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls Maciej S. Szmigiero
                   ` (26 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index ab73d6d984cf..ceaad930e141 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1310,6 +1310,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
     Error *local_err = NULL;
     bool use_packets = multifd_use_packets();
     int id;
+    QIOChannelTLS *ioc_tls;
 
     if (use_packets) {
         id = multifd_recv_initial_packet(ioc, &local_err);
@@ -1337,6 +1338,13 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
     p->c = ioc;
     object_ref(OBJECT(ioc));
 
+    ioc_tls = QIO_CHANNEL_TLS(object_dynamic_cast(OBJECT(ioc),
+                                                  TYPE_QIO_CHANNEL_TLS));
+    if (ioc_tls) {
+        /* Multifd send channels do not terminate the TLS session properly */
+        qio_channel_tls_set_premature_eof_okay(ioc_tls, true);
+    }
+
     p->thread_created = true;
     qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
                        QEMU_THREAD_JOINABLE);


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (7 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-02  2:06   ` Dr. David Alan Gilbert
  2025-01-30 10:08 ` [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
                   ` (25 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

postcopy_ram_listen_thread() is a free running thread, so it needs to
take BQL around function calls to migration methods requiring BQL.

qemu_loadvm_state_main() needs BQL held since it ultimately calls
"load_state" SaveVMHandlers.

migration_incoming_state_destroy() needs BQL held since it ultimately calls
"load_cleanup" SaveVMHandlers.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/savevm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index b0b74140daea..0ceea9638cc1 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
      * in qemu_file, and thus we must be blocking now.
      */
     qemu_file_set_blocking(f, true);
+    bql_lock();
     load_res = qemu_loadvm_state_main(f, mis);
+    bql_unlock();
 
     /*
      * This is tricky, but, mis->from_src_file can change after it
@@ -2073,7 +2075,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
      * (If something broke then qemu will have to exit anyway since it's
      * got a bad migration state).
      */
+    bql_lock();
     migration_incoming_state_destroy();
+    bql_unlock();
 
     rcu_unregister_thread();
     mis->have_listen_thread = false;


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (8 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-03 20:53   ` Peter Xu
  2025-02-03 21:13   ` Daniel P. Berrangé
  2025-01-30 10:08 ` [PATCH v4 11/33] migration: Add thread pool of optional load threads Maciej S. Szmigiero
                   ` (24 subsequent siblings)
  34 siblings, 2 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Automatic memory management helps avoid memory safety issues.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/qapi/error.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qapi/error.h b/include/qapi/error.h
index 71f8fb2c50ee..649ec8f1b6a2 100644
--- a/include/qapi/error.h
+++ b/include/qapi/error.h
@@ -437,6 +437,8 @@ Error *error_copy(const Error *err);
  */
 void error_free(Error *err);
 
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(Error, error_free)
+
 /*
  * Convenience function to assert that *@errp is set, then silently free it.
  */


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 11/33] migration: Add thread pool of optional load threads
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (9 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 12/33] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Some drivers might want to make use of auxiliary helper threads during VM
state loading, for example to make sure that their blocking (sync) I/O
operations don't block the rest of the migration process.

Add a migration core managed thread pool to facilitate this use case.

The migration core will wait for these threads to finish before
(re)starting the VM at destination.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h |  3 ++
 include/qemu/typedefs.h  |  2 +
 migration/migration.c    |  2 +-
 migration/migration.h    |  5 +++
 migration/savevm.c       | 92 +++++++++++++++++++++++++++++++++++++++-
 migration/savevm.h       |  2 +-
 6 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 67f7ef7a0e5c..bc5ce31b52e0 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
 /* migration/block.c */
 
 AnnounceParameters *migrate_announce_params(void);
+
 /* migration/savevm.c */
 
 void dump_vmstate_json_to_file(FILE *out_fp);
+void qemu_loadvm_start_load_thread(MigrationLoadThread function,
+                                   void *opaque);
 
 /* migration/migration.c */
 void migration_object_init(void);
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 3d84efcac47a..fd23ff7771b1 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -131,5 +131,7 @@ typedef struct IRQState *qemu_irq;
  * Function types
  */
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
+typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
+                                    Error **errp);
 
 #endif /* QEMU_TYPEDEFS_H */
diff --git a/migration/migration.c b/migration/migration.c
index 65b51d360896..0f29188499e4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -400,7 +400,7 @@ void migration_incoming_state_destroy(void)
      * RAM state cleanup needs to happen after multifd cleanup, because
      * multifd threads can use some of its states (receivedmap).
      */
-    qemu_loadvm_state_cleanup();
+    qemu_loadvm_state_cleanup(mis);
 
     if (mis->to_src_file) {
         /* Tell source that we are done */
diff --git a/migration/migration.h b/migration/migration.h
index c5731626bbfb..1699fe7d91cc 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -43,6 +43,7 @@
 #define  MIGRATION_THREAD_DST_PREEMPT       "mig/dst/preempt"
 
 struct PostcopyBlocktimeContext;
+typedef struct ThreadPool ThreadPool;
 
 #define  MIGRATION_RESUME_ACK_VALUE  (1)
 
@@ -187,6 +188,10 @@ struct MigrationIncomingState {
     Coroutine *colo_incoming_co;
     QemuSemaphore colo_incoming_sem;
 
+    /* Optional load threads pool and its thread exit request flag */
+    ThreadPool *load_threads;
+    bool load_threads_abort;
+
     /*
      * PostcopyBlocktimeContext to keep information for postcopy
      * live migration, to calculate vCPU block time
diff --git a/migration/savevm.c b/migration/savevm.c
index 0ceea9638cc1..74d1960de3c6 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -54,6 +54,7 @@
 #include "qemu/job.h"
 #include "qemu/main-loop.h"
 #include "block/snapshot.h"
+#include "block/thread-pool.h"
 #include "qemu/cutils.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
@@ -131,6 +132,35 @@ static struct mig_cmd_args {
  * generic extendable format with an exception for two old entities.
  */
 
+/***********************************************************/
+/* Optional load threads pool support */
+
+static void qemu_loadvm_thread_pool_create(MigrationIncomingState *mis)
+{
+    assert(!mis->load_threads);
+    mis->load_threads = thread_pool_new();
+    mis->load_threads_abort = false;
+}
+
+static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
+{
+    qatomic_set(&mis->load_threads_abort, true);
+
+    bql_unlock(); /* Load threads might be waiting for BQL */
+    g_clear_pointer(&mis->load_threads, thread_pool_free);
+    bql_lock();
+}
+
+static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
+                                         MigrationIncomingState *mis)
+{
+    bql_unlock(); /* Let load threads do work requiring BQL */
+    thread_pool_wait(mis->load_threads);
+    bql_lock();
+
+    return !migrate_has_error(s);
+}
+
 /***********************************************************/
 /* savevm/loadvm support */
 
@@ -2810,16 +2840,62 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
     return 0;
 }
 
-void qemu_loadvm_state_cleanup(void)
+struct LoadThreadData {
+    MigrationLoadThread function;
+    void *opaque;
+};
+
+static int qemu_loadvm_load_thread(void *thread_opaque)
+{
+    struct LoadThreadData *data = thread_opaque;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    g_autoptr(Error) local_err = NULL;
+
+    if (!data->function(data->opaque, &mis->load_threads_abort, &local_err)) {
+        MigrationState *s = migrate_get_current();
+
+        assert(local_err);
+
+        /*
+         * In case of multiple load threads failing which thread error
+         * return we end setting is purely arbitrary.
+         */
+        migrate_set_error(s, local_err);
+    }
+
+    return 0;
+}
+
+void qemu_loadvm_start_load_thread(MigrationLoadThread function,
+                                   void *opaque)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    struct LoadThreadData *data;
+
+    /* We only set it from this thread so it's okay to read it directly */
+    assert(!mis->load_threads_abort);
+
+    data = g_new(struct LoadThreadData, 1);
+    data->function = function;
+    data->opaque = opaque;
+
+    thread_pool_submit_immediate(mis->load_threads, qemu_loadvm_load_thread,
+                                 data, g_free);
+}
+
+void qemu_loadvm_state_cleanup(MigrationIncomingState *mis)
 {
     SaveStateEntry *se;
 
     trace_loadvm_state_cleanup();
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (se->ops && se->ops->load_cleanup) {
             se->ops->load_cleanup(se->opaque);
         }
     }
+
+    qemu_loadvm_thread_pool_destroy(mis);
 }
 
 /* Return true if we should continue the migration, or false. */
@@ -2970,6 +3046,7 @@ out:
 
 int qemu_loadvm_state(QEMUFile *f)
 {
+    MigrationState *s = migrate_get_current();
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
     int ret;
@@ -2979,6 +3056,8 @@ int qemu_loadvm_state(QEMUFile *f)
         return -EINVAL;
     }
 
+    qemu_loadvm_thread_pool_create(mis);
+
     ret = qemu_loadvm_state_header(f);
     if (ret) {
         return ret;
@@ -3008,6 +3087,17 @@ int qemu_loadvm_state(QEMUFile *f)
         return ret;
     }
 
+    if (ret == 0) {
+        if (!qemu_loadvm_thread_pool_wait(s, mis)) {
+            ret = -EINVAL;
+        }
+    }
+    /*
+     * Set this flag unconditionally so we'll catch further attempts to
+     * start additional threads via an appropriate assert()
+     */
+    qatomic_set(&mis->load_threads_abort, true);
+
     if (ret == 0) {
         ret = qemu_file_get_error(f);
     }
diff --git a/migration/savevm.h b/migration/savevm.h
index 8b78493dbc0e..3fa06574e632 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -64,7 +64,7 @@ void qemu_savevm_live_state(QEMUFile *f);
 int qemu_save_device_state(QEMUFile *f);
 
 int qemu_loadvm_state(QEMUFile *f);
-void qemu_loadvm_state_cleanup(void);
+void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_loadvm_approve_switchover(void);


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 12/33] migration/multifd: Split packet into header and RAM data
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (10 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 11/33] migration: Add thread pool of optional load threads Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Read packet header first so in the future we will be able to
differentiate between a RAM multifd packet and a device state multifd
packet.

Since these two are of different size we can't read the packet body until
we know which packet type it is.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 49 +++++++++++++++++++++++++++++++++++----------
 migration/multifd.h |  5 +++++
 2 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index ceaad930e141..53493676012e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -209,10 +209,10 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
 
     memset(packet, 0, p->packet_len);
 
-    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-    packet->version = cpu_to_be32(MULTIFD_VERSION);
+    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
 
-    packet->flags = cpu_to_be32(p->flags);
+    packet->hdr.flags = cpu_to_be32(p->flags);
     packet->next_packet_size = cpu_to_be32(p->next_packet_size);
 
     packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
@@ -228,12 +228,12 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
                             p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
+                                             const MultiFDPacketHdr_t *hdr,
+                                             Error **errp)
 {
-    const MultiFDPacket_t *packet = p->packet;
-    uint32_t magic = be32_to_cpu(packet->magic);
-    uint32_t version = be32_to_cpu(packet->version);
-    int ret = 0;
+    uint32_t magic = be32_to_cpu(hdr->magic);
+    uint32_t version = be32_to_cpu(hdr->version);
 
     if (magic != MULTIFD_MAGIC) {
         error_setg(errp, "multifd: received packet magic %x, expected %x",
@@ -247,7 +247,16 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
         return -1;
     }
 
-    p->flags = be32_to_cpu(packet->flags);
+    p->flags = be32_to_cpu(hdr->flags);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    const MultiFDPacket_t *packet = p->packet;
+    int ret = 0;
+
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
     p->packets_recved++;
@@ -1130,8 +1139,12 @@ static void *multifd_recv_thread(void *opaque)
     rcu_register_thread();
 
     while (true) {
+        MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
         bool has_data = false;
+        uint8_t *pkt_buf;
+        size_t pkt_len;
+
         p->normal_num = 0;
 
         if (use_packets) {
@@ -1139,8 +1152,22 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
-                                           p->packet_len, &local_err);
+            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
+                                           sizeof(hdr), &local_err);
+            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
+                break;
+            }
+
+            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
+            if (ret) {
+                break;
+            }
+
+            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+            pkt_len = p->packet_len - sizeof(hdr);
+
+            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
+                                           &local_err);
             if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
                 break;
             }
diff --git a/migration/multifd.h b/migration/multifd.h
index bd785b987315..9e4baa066312 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -69,6 +69,11 @@ typedef struct {
     uint32_t magic;
     uint32_t version;
     uint32_t flags;
+} __attribute__((packed)) MultiFDPacketHdr_t;
+
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
     /* maximum number of allocated pages */
     uint32_t pages_alloc;
     /* non zero pages */


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (11 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 12/33] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-03 21:27   ` Peter Xu
  2025-01-30 10:08 ` [PATCH v4 14/33] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
                   ` (21 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 99 ++++++++++++++++++++++++++++++++++++++++-----
 migration/multifd.h | 26 +++++++++++-
 2 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 53493676012e..810e7b1fb340 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -21,6 +21,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -252,14 +253,24 @@ static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
     return 0;
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
+                                                   Error **errp)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+    packet->instance_id = be32_to_cpu(packet->instance_id);
+    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
 {
     const MultiFDPacket_t *packet = p->packet;
     int ret = 0;
 
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
-    p->packets_recved++;
 
     /* Always unfill, old QEMUs (<9.0) send data along with SYNC */
     ret = multifd_ram_unfill_packet(p, errp);
@@ -270,6 +281,17 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     return ret;
 }
 
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    p->packets_recved++;
+
+    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
+        return multifd_recv_unfill_packet_device_state(p, errp);
+    }
+
+    return multifd_recv_unfill_packet_ram(p, errp);
+}
+
 static bool multifd_send_should_exit(void)
 {
     return qatomic_read(&multifd_send_state->exiting);
@@ -1027,6 +1049,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
+    g_clear_pointer(&p->packet_dev_state, g_free);
     g_free(p->normal);
     p->normal = NULL;
     g_free(p->zero);
@@ -1128,6 +1151,32 @@ void multifd_recv_sync_main(void)
     trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
 }
 
+static int multifd_device_state_recv(MultiFDRecvParams *p, Error **errp)
+{
+    g_autofree char *idstr = NULL;
+    g_autofree char *dev_state_buf = NULL;
+    int ret;
+
+    dev_state_buf = g_malloc(p->next_packet_size);
+
+    ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    idstr = g_strndup(p->packet_dev_state->idstr,
+                      sizeof(p->packet_dev_state->idstr));
+
+    if (!qemu_loadvm_load_state_buffer(idstr,
+                                       p->packet_dev_state->instance_id,
+                                       dev_state_buf, p->next_packet_size,
+                                       errp)) {
+        ret = -1;
+    }
+
+    return ret;
+}
+
 static void *multifd_recv_thread(void *opaque)
 {
     MultiFDRecvParams *p = opaque;
@@ -1141,6 +1190,7 @@ static void *multifd_recv_thread(void *opaque)
     while (true) {
         MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
+        bool is_device_state = false;
         bool has_data = false;
         uint8_t *pkt_buf;
         size_t pkt_len;
@@ -1163,8 +1213,14 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
-            pkt_len = p->packet_len - sizeof(hdr);
+            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
+            if (is_device_state) {
+                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
+                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
+            } else {
+                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+                pkt_len = p->packet_len - sizeof(hdr);
+            }
 
             ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
                                            &local_err);
@@ -1183,12 +1239,17 @@ static void *multifd_recv_thread(void *opaque)
             /* recv methods don't know how to handle the SYNC flag */
             p->flags &= ~MULTIFD_FLAG_SYNC;
 
-            /*
-             * Even if it's a SYNC packet, this needs to be set
-             * because older QEMUs (<9.0) still send data along with
-             * the SYNC packet.
-             */
-            has_data = p->normal_num || p->zero_num;
+            if (is_device_state) {
+                has_data = p->next_packet_size > 0;
+            } else {
+                /*
+                 * Even if it's a SYNC packet, this needs to be set
+                 * because older QEMUs (<9.0) still send data along with
+                 * the SYNC packet.
+                 */
+                has_data = p->normal_num || p->zero_num;
+            }
+
             qemu_mutex_unlock(&p->mutex);
         } else {
             /*
@@ -1217,14 +1278,29 @@ static void *multifd_recv_thread(void *opaque)
         }
 
         if (has_data) {
-            ret = multifd_recv_state->ops->recv(p, &local_err);
+            if (is_device_state) {
+                assert(use_packets);
+                ret = multifd_device_state_recv(p, &local_err);
+            } else {
+                ret = multifd_recv_state->ops->recv(p, &local_err);
+            }
             if (ret != 0) {
                 break;
             }
+        } else if (is_device_state) {
+            error_setg(&local_err,
+                       "multifd: received empty device state packet");
+            break;
         }
 
         if (use_packets) {
             if (flags & MULTIFD_FLAG_SYNC) {
+                if (is_device_state) {
+                    error_setg(&local_err,
+                               "multifd: received SYNC device state packet");
+                    break;
+                }
+
                 qemu_sem_post(&multifd_recv_state->sem_sync);
                 qemu_sem_wait(&p->sem_sync);
             }
@@ -1293,6 +1369,7 @@ int multifd_recv_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
         }
         p->name = g_strdup_printf(MIGRATION_THREAD_DST_MULTIFD, i);
         p->normal = g_new0(ram_addr_t, page_count);
diff --git a/migration/multifd.h b/migration/multifd.h
index 9e4baa066312..abf3acdcee40 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_UADK (8 << 1)
 #define MULTIFD_FLAG_QATZIP (16 << 1)
 
+/*
+ * If set it means that this packet contains device state
+ * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
+ */
+#define MULTIFD_FLAG_DEVICE_STATE (1 << 6)
+
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
 
@@ -94,6 +100,16 @@ typedef struct {
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
+    char idstr[256] QEMU_NONSTRING;
+    uint32_t instance_id;
+
+    /* size of the next packet that contains the actual data */
+    uint32_t next_packet_size;
+} __attribute__((packed)) MultiFDPacketDeviceState_t;
+
 typedef struct {
     /* number of used pages */
     uint32_t num;
@@ -111,6 +127,13 @@ struct MultiFDRecvData {
     off_t file_offset;
 };
 
+typedef struct {
+    char *idstr;
+    uint32_t instance_id;
+    char *buf;
+    size_t buf_len;
+} MultiFDDeviceState_t;
+
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
@@ -227,8 +250,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_dev_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets received through this channel */


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 14/33] migration/multifd: Make multifd_send() thread safe
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (12 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 15/33] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

multifd_send() function is currently not thread safe, make it thread safe
by holding a lock during its execution.

This way it will be possible to safely call it concurrently from multiple
threads.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 810e7b1fb340..02d163fe292d 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -50,6 +50,10 @@ typedef struct {
 
 struct {
     MultiFDSendParams *params;
+
+    /* multifd_send() body is not thread safe, needs serialization */
+    QemuMutex multifd_send_mutex;
+
     /*
      * Global number of generated multifd packets.
      *
@@ -339,6 +343,8 @@ bool multifd_send(MultiFDSendData **send_data)
         return false;
     }
 
+    QEMU_LOCK_GUARD(&multifd_send_state->multifd_send_mutex);
+
     /* We wait here, until at least one channel is ready */
     qemu_sem_wait(&multifd_send_state->channels_ready);
 
@@ -507,6 +513,7 @@ static void multifd_send_cleanup_state(void)
     socket_cleanup_outgoing_migration();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
+    qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
     g_free(multifd_send_state->params);
     multifd_send_state->params = NULL;
     g_free(multifd_send_state);
@@ -857,6 +864,7 @@ bool multifd_send_setup(void)
     thread_count = migrate_multifd_channels();
     multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
+    qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
     qemu_sem_init(&multifd_send_state->channels_created, 0);
     qemu_sem_init(&multifd_send_state->channels_ready, 0);
     qatomic_set(&multifd_send_state->exiting, 0);


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 15/33] migration/multifd: Add an explicit MultiFDSendData destructor
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (13 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 14/33] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 16/33] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way if there are fields there that needs explicit disposal (like, for
example, some attached buffers) they will be handled appropriately.

Add a related assert to multifd_set_payload_type() in order to make sure
that this function is only used to fill a previously empty MultiFDSendData
with some payload, not the other way around.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-nocomp.c |  3 +--
 migration/multifd.c        | 31 ++++++++++++++++++++++++++++---
 migration/multifd.h        |  5 +++++
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 1325dba97cea..e46e79d8b272 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -42,8 +42,7 @@ void multifd_ram_save_setup(void)
 
 void multifd_ram_save_cleanup(void)
 {
-    g_free(multifd_ram_send);
-    multifd_ram_send = NULL;
+    g_clear_pointer(&multifd_ram_send, multifd_send_data_free);
 }
 
 static void multifd_set_file_bitmap(MultiFDSendParams *p)
diff --git a/migration/multifd.c b/migration/multifd.c
index 02d163fe292d..3353a5da7593 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -123,6 +123,32 @@ MultiFDSendData *multifd_send_data_alloc(void)
     return g_malloc0(size_minus_payload + max_payload_size);
 }
 
+void multifd_send_data_clear(MultiFDSendData *data)
+{
+    if (multifd_payload_empty(data)) {
+        return;
+    }
+
+    switch (data->type) {
+    default:
+        /* Nothing to do */
+        break;
+    }
+
+    data->type = MULTIFD_PAYLOAD_NONE;
+}
+
+void multifd_send_data_free(MultiFDSendData *data)
+{
+    if (!data) {
+        return;
+    }
+
+    multifd_send_data_clear(data);
+
+    g_free(data);
+}
+
 static bool multifd_use_packets(void)
 {
     return !migrate_mapped_ram();
@@ -496,8 +522,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     qemu_sem_destroy(&p->sem_sync);
     g_free(p->name);
     p->name = NULL;
-    g_free(p->data);
-    p->data = NULL;
+    g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
@@ -665,7 +690,7 @@ static void *multifd_send_thread(void *opaque)
                        (uint64_t)p->next_packet_size + p->packet_len);
 
             p->next_packet_size = 0;
-            multifd_set_payload_type(p->data, MULTIFD_PAYLOAD_NONE);
+            multifd_send_data_clear(p->data);
 
             /*
              * Making sure p->data is published before saying "we're
diff --git a/migration/multifd.h b/migration/multifd.h
index abf3acdcee40..436598999e6f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -156,6 +156,9 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
+    assert(multifd_payload_empty(data));
+    assert(type != MULTIFD_PAYLOAD_NONE);
+
     data->type = type;
 }
 
@@ -370,6 +373,8 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
+void multifd_send_data_clear(MultiFDSendData *data);
+void multifd_send_data_free(MultiFDSendData *data);
 
 static inline uint32_t multifd_ram_page_size(void)
 {


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 16/33] migration/multifd: Device state transfer support - send side
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (14 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 15/33] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-03 21:47   ` Peter Xu
  2025-01-30 10:08 ` [PATCH v4 17/33] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
                   ` (18 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         |   4 ++
 migration/meson.build            |   1 +
 migration/multifd-device-state.c | 107 +++++++++++++++++++++++++++++++
 migration/multifd-nocomp.c       |  14 +++-
 migration/multifd.c              |  42 ++++++++++--
 migration/multifd.h              |  27 +++++---
 6 files changed, 179 insertions(+), 16 deletions(-)
 create mode 100644 migration/multifd-device-state.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index bc5ce31b52e0..885022d21a0c 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void);
 bool migration_block_activate(Error **errp);
 bool migration_block_inactivate(void);
 
+/* migration/multifd-device-state.c */
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len);
+
 #endif
diff --git a/migration/meson.build b/migration/meson.build
index dac687ee3acb..50263fa741e7 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -23,6 +23,7 @@ system_ss.add(files(
   'migration-hmp-cmds.c',
   'migration.c',
   'multifd.c',
+  'multifd-device-state.c',
   'multifd-nocomp.c',
   'multifd-zlib.c',
   'multifd-zero-page.c',
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
new file mode 100644
index 000000000000..2207bea9bf8a
--- /dev/null
+++ b/migration/multifd-device-state.c
@@ -0,0 +1,107 @@
+/*
+ * Multifd device state migration
+ *
+ * Copyright (C) 2024,2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/lockable.h"
+#include "migration/misc.h"
+#include "multifd.h"
+
+static QemuMutex queue_job_mutex;
+
+static MultiFDSendData *device_state_send;
+
+size_t multifd_device_state_payload_size(void)
+{
+    return sizeof(MultiFDDeviceState_t);
+}
+
+void multifd_device_state_send_setup(void)
+{
+    qemu_mutex_init(&queue_job_mutex);
+
+    assert(!device_state_send);
+    device_state_send = multifd_send_data_alloc();
+}
+
+void multifd_device_state_send_cleanup(void)
+{
+    g_clear_pointer(&device_state_send, multifd_send_data_free);
+
+    qemu_mutex_destroy(&queue_job_mutex);
+}
+
+void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state)
+{
+    g_clear_pointer(&device_state->idstr, g_free);
+    g_clear_pointer(&device_state->buf, g_free);
+}
+
+static void multifd_device_state_fill_packet(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+    packet->hdr.flags = cpu_to_be32(p->flags);
+    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
+    packet->instance_id = cpu_to_be32(device_state->instance_id);
+    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+static void multifd_prepare_header_device_state(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = sizeof(*p->packet_device_state);
+    p->iov[0].iov_base = p->packet_device_state;
+    p->iovs_num++;
+}
+
+void multifd_device_state_send_prepare(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+
+    assert(multifd_payload_device_state(p->data));
+
+    multifd_prepare_header_device_state(p);
+
+    assert(!(p->flags & MULTIFD_FLAG_SYNC));
+
+    p->next_packet_size = device_state->buf_len;
+    if (p->next_packet_size > 0) {
+        p->iov[p->iovs_num].iov_base = device_state->buf;
+        p->iov[p->iovs_num].iov_len = p->next_packet_size;
+        p->iovs_num++;
+    }
+
+    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
+
+    multifd_device_state_fill_packet(p);
+}
+
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len)
+{
+    /* Device state submissions can come from multiple threads */
+    QEMU_LOCK_GUARD(&queue_job_mutex);
+    MultiFDDeviceState_t *device_state;
+
+    assert(multifd_payload_empty(device_state_send));
+
+    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
+    device_state = &device_state_send->u.device_state;
+    device_state->idstr = g_strdup(idstr);
+    device_state->instance_id = instance_id;
+    device_state->buf = g_memdup2(data, len);
+    device_state->buf_len = len;
+
+    if (!multifd_send(&device_state_send)) {
+        multifd_send_data_clear(device_state_send);
+        return false;
+    }
+
+    return true;
+}
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index e46e79d8b272..c00804652383 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -14,6 +14,7 @@
 #include "exec/ramblock.h"
 #include "exec/target_page.h"
 #include "file.h"
+#include "migration-stats.h"
 #include "multifd.h"
 #include "options.h"
 #include "qapi/error.h"
@@ -85,6 +86,13 @@ static void multifd_nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
     return;
 }
 
+static void multifd_ram_prepare_header(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = p->packet_len;
+    p->iov[0].iov_base = p->packet;
+    p->iovs_num++;
+}
+
 static void multifd_send_prepare_iovs(MultiFDSendParams *p)
 {
     MultiFDPages_t *pages = &p->data->u.ram;
@@ -118,7 +126,7 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
          * Only !zerocopy needs the header in IOV; zerocopy will
          * send it separately.
          */
-        multifd_send_prepare_header(p);
+        multifd_ram_prepare_header(p);
     }
 
     multifd_send_prepare_iovs(p);
@@ -133,6 +141,8 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
         if (ret != 0) {
             return -1;
         }
+
+        stat64_add(&mig_stats.multifd_bytes, p->packet_len);
     }
 
     return 0;
@@ -431,7 +441,7 @@ int multifd_ram_flush_and_sync(QEMUFile *f)
 bool multifd_send_prepare_common(MultiFDSendParams *p)
 {
     MultiFDPages_t *pages = &p->data->u.ram;
-    multifd_send_prepare_header(p);
+    multifd_ram_prepare_header(p);
     multifd_send_zero_page_detect(p);
 
     if (!pages->normal_num) {
diff --git a/migration/multifd.c b/migration/multifd.c
index 3353a5da7593..61b061a33d35 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "system/system.h"
@@ -19,6 +20,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -111,7 +113,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
      * added to the union in the future are larger than
      * (MultiFDPages_t + flex array).
      */
-    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
+    max_payload_size = MAX(multifd_ram_payload_size(),
+                           multifd_device_state_payload_size());
+    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
 
     /*
      * Account for any holes the compiler might insert. We can't pack
@@ -130,6 +134,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
     }
 
     switch (data->type) {
+    case MULTIFD_PAYLOAD_DEVICE_STATE:
+        multifd_send_data_clear_device_state(&data->u.device_state);
+        break;
     default:
         /* Nothing to do */
         break;
@@ -232,6 +239,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
     return msg.id;
 }
 
+/* Fills a RAM multifd packet */
 void multifd_send_fill_packet(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
@@ -524,6 +532,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     p->name = NULL;
     g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
+    g_clear_pointer(&p->packet_device_state, g_free);
     g_free(p->packet);
     p->packet = NULL;
     multifd_send_state->ops->send_cleanup(p, errp);
@@ -536,6 +545,7 @@ static void multifd_send_cleanup_state(void)
 {
     file_cleanup_outgoing_migration();
     socket_cleanup_outgoing_migration();
+    multifd_device_state_send_cleanup();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
     qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
@@ -664,16 +674,32 @@ static void *multifd_send_thread(void *opaque)
          * qatomic_store_release() in multifd_send().
          */
         if (qatomic_load_acquire(&p->pending_job)) {
+            bool is_device_state = multifd_payload_device_state(p->data);
+            size_t total_size;
+
             p->flags = 0;
             p->iovs_num = 0;
             assert(!multifd_payload_empty(p->data));
 
-            ret = multifd_send_state->ops->send_prepare(p, &local_err);
-            if (ret != 0) {
-                break;
+            if (is_device_state) {
+                multifd_device_state_send_prepare(p);
+            } else {
+                ret = multifd_send_state->ops->send_prepare(p, &local_err);
+                if (ret != 0) {
+                    break;
+                }
             }
 
+            /*
+             * The packet header in the zerocopy RAM case is accounted for
+             * in multifd_nocomp_send_prepare() - where it is actually
+             * being sent.
+             */
+            total_size = iov_size(p->iov, p->iovs_num);
+
             if (migrate_mapped_ram()) {
+                assert(!is_device_state);
+
                 ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
                                               &p->data->u.ram, &local_err);
             } else {
@@ -686,8 +712,7 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
-            stat64_add(&mig_stats.multifd_bytes,
-                       (uint64_t)p->next_packet_size + p->packet_len);
+            stat64_add(&mig_stats.multifd_bytes, total_size);
 
             p->next_packet_size = 0;
             multifd_send_data_clear(p->data);
@@ -908,6 +933,9 @@ bool multifd_send_setup(void)
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
+            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
         }
         p->name = g_strdup_printf(MIGRATION_THREAD_SRC_MULTIFD, i);
         p->write_flags = 0;
@@ -943,6 +971,8 @@ bool multifd_send_setup(void)
         assert(p->iov);
     }
 
+    multifd_device_state_send_setup();
+
     return true;
 
 err:
diff --git a/migration/multifd.h b/migration/multifd.h
index 436598999e6f..ddc617db9acb 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -137,10 +137,12 @@ typedef struct {
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
+    MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
 typedef union MultiFDPayload {
     MultiFDPages_t ram;
+    MultiFDDeviceState_t device_state;
 } MultiFDPayload;
 
 struct MultiFDSendData {
@@ -153,6 +155,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
     return data->type == MULTIFD_PAYLOAD_NONE;
 }
 
+static inline bool multifd_payload_device_state(MultiFDSendData *data)
+{
+    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
+}
+
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
@@ -205,8 +212,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_device_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
@@ -363,13 +371,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p);
 void multifd_send_zero_page_detect(MultiFDSendParams *p);
 void multifd_recv_zero_page_process(MultiFDRecvParams *p);
 
-static inline void multifd_send_prepare_header(MultiFDSendParams *p)
-{
-    p->iov[0].iov_len = p->packet_len;
-    p->iov[0].iov_base = p->packet;
-    p->iovs_num++;
-}
-
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
@@ -394,4 +395,14 @@ bool multifd_ram_sync_per_section(void);
 size_t multifd_ram_payload_size(void);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
+
+size_t multifd_device_state_payload_size(void);
+
+void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state);
+
+void multifd_device_state_send_setup(void);
+void multifd_device_state_send_cleanup(void);
+
+void multifd_device_state_send_prepare(MultiFDSendParams *p);
+
 #endif


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 17/33] migration/multifd: Make MultiFDSendData a struct
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (15 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 16/33] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-07 14:36   ` Fabiano Rosas
  2025-01-30 10:08 ` [PATCH v4 18/33] migration/multifd: Add multifd_device_state_supported() Maciej S. Szmigiero
                   ` (17 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: Peter Xu <peterx@redhat.com>

The newly introduced device state buffer can be used for either storing
VFIO's read() raw data, but already also possible to store generic device
states.  After noticing that device states may not easily provide a max
buffer size (also the fact that RAM MultiFDPages_t after all also want to
have flexibility on managing offset[] array), it may not be a good idea to
stick with union on MultiFDSendData.. as it won't play well with such
flexibility.

Switch MultiFDSendData to a struct.

It won't consume a lot more space in reality, after all the real buffers
were already dynamically allocated, so it's so far only about the two
structs (pages, device_state) that will be duplicated, but they're small.

With this, we can remove the pretty hard to understand alloc size logic.
Because now we can allocate offset[] together with the SendData, and
properly free it when the SendData is freed.

Signed-off-by: Peter Xu <peterx@redhat.com>
[MSS: Make sure to clear possible device state payload before freeing
MultiFDSendData, remove placeholders for other patches not included]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-device-state.c |  5 -----
 migration/multifd-nocomp.c       | 13 ++++++-------
 migration/multifd.c              | 25 +++++++------------------
 migration/multifd.h              | 15 +++++++++------
 4 files changed, 22 insertions(+), 36 deletions(-)

diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index 2207bea9bf8a..d1674b432ff2 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -16,11 +16,6 @@ static QemuMutex queue_job_mutex;
 
 static MultiFDSendData *device_state_send;
 
-size_t multifd_device_state_payload_size(void)
-{
-    return sizeof(MultiFDDeviceState_t);
-}
-
 void multifd_device_state_send_setup(void)
 {
     qemu_mutex_init(&queue_job_mutex);
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index c00804652383..ffe75256c9fb 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -25,15 +25,14 @@
 
 static MultiFDSendData *multifd_ram_send;
 
-size_t multifd_ram_payload_size(void)
+void multifd_ram_payload_alloc(MultiFDPages_t *pages)
 {
-    uint32_t n = multifd_ram_page_count();
+    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
+}
 
-    /*
-     * We keep an array of page offsets at the end of MultiFDPages_t,
-     * add space for it in the allocation.
-     */
-    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
+void multifd_ram_payload_free(MultiFDPages_t *pages)
+{
+    g_clear_pointer(&pages->offset, g_free);
 }
 
 void multifd_ram_save_setup(void)
diff --git a/migration/multifd.c b/migration/multifd.c
index 61b061a33d35..0b61b8192231 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -105,26 +105,12 @@ struct {
 
 MultiFDSendData *multifd_send_data_alloc(void)
 {
-    size_t max_payload_size, size_minus_payload;
+    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
 
-    /*
-     * MultiFDPages_t has a flexible array at the end, account for it
-     * when allocating MultiFDSendData. Use max() in case other types
-     * added to the union in the future are larger than
-     * (MultiFDPages_t + flex array).
-     */
-    max_payload_size = MAX(multifd_ram_payload_size(),
-                           multifd_device_state_payload_size());
-    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
-
-    /*
-     * Account for any holes the compiler might insert. We can't pack
-     * the structure because that misaligns the members and triggers
-     * Waddress-of-packed-member.
-     */
-    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
+    multifd_ram_payload_alloc(&new->u.ram);
+    /* Device state allocates its payload on-demand */
 
-    return g_malloc0(size_minus_payload + max_payload_size);
+    return new;
 }
 
 void multifd_send_data_clear(MultiFDSendData *data)
@@ -151,8 +137,11 @@ void multifd_send_data_free(MultiFDSendData *data)
         return;
     }
 
+    /* This also free's device state payload */
     multifd_send_data_clear(data);
 
+    multifd_ram_payload_free(&data->u.ram);
+
     g_free(data);
 }
 
diff --git a/migration/multifd.h b/migration/multifd.h
index ddc617db9acb..f7811cc0d0cb 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -115,9 +115,13 @@ typedef struct {
     uint32_t num;
     /* number of normal pages */
     uint32_t normal_num;
+    /*
+     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
+     * sure the pointer is always valid!
+     */
     RAMBlock *block;
-    /* offset of each page */
-    ram_addr_t offset[];
+    /* offset array of each page, managed by multifd */
+    ram_addr_t *offset;
 } MultiFDPages_t;
 
 struct MultiFDRecvData {
@@ -140,7 +144,7 @@ typedef enum {
     MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
-typedef union MultiFDPayload {
+typedef struct MultiFDPayload {
     MultiFDPages_t ram;
     MultiFDDeviceState_t device_state;
 } MultiFDPayload;
@@ -392,12 +396,11 @@ void multifd_ram_save_cleanup(void);
 int multifd_ram_flush_and_sync(QEMUFile *f);
 bool multifd_ram_sync_per_round(void);
 bool multifd_ram_sync_per_section(void);
-size_t multifd_ram_payload_size(void);
+void multifd_ram_payload_alloc(MultiFDPages_t *pages);
+void multifd_ram_payload_free(MultiFDPages_t *pages);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
 
-size_t multifd_device_state_payload_size(void);
-
 void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state);
 
 void multifd_device_state_send_setup(void);


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 18/33] migration/multifd: Add multifd_device_state_supported()
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (16 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 17/33] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         | 1 +
 migration/multifd-device-state.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 885022d21a0c..cc987e6e97af 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -114,5 +114,6 @@ bool migration_block_inactivate(void);
 /* migration/multifd-device-state.c */
 bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
+bool multifd_device_state_supported(void);
 
 #endif
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index d1674b432ff2..cee3c44bcf2a 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -11,6 +11,7 @@
 #include "qemu/lockable.h"
 #include "migration/misc.h"
 #include "multifd.h"
+#include "options.h"
 
 static QemuMutex queue_job_mutex;
 
@@ -100,3 +101,9 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
 
     return true;
 }
+
+bool multifd_device_state_supported(void)
+{
+    return migrate_multifd() && !migrate_mapped_ram() &&
+        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (17 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 18/33] migration/multifd: Add multifd_device_state_supported() Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-04 17:54   ` Peter Xu
  2025-01-30 10:08 ` [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property Maciej S. Szmigiero
                   ` (15 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This SaveVMHandler helps device provide its own asynchronous transmission
of the remaining data at the end of a precopy phase via multifd channels,
in parallel with the transfer done by save_live_complete_precopy handlers.

These threads are launched only when multifd device state transfer is
supported.

Management of these threads in done in the multifd migration code,
wrapping them in the generic thread pool.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         |  8 +++
 include/migration/register.h     | 21 ++++++++
 include/qemu/typedefs.h          |  4 ++
 migration/multifd-device-state.c | 83 ++++++++++++++++++++++++++++++++
 migration/savevm.c               | 37 +++++++++++++-
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index cc987e6e97af..008d22df8e72 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -116,4 +116,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
 bool multifd_device_state_supported(void);
 
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque);
+
+void multifd_abort_device_state_save_threads(void);
+int multifd_join_device_state_save_threads(void);
+
 #endif
diff --git a/include/migration/register.h b/include/migration/register.h
index 58891aa54b76..f63b3ca3fd44 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -105,6 +105,27 @@ typedef struct SaveVMHandlers {
      */
     int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+    /**
+     * @save_live_complete_precopy_thread (invoked in a separate thread)
+     *
+     * Called at the end of a precopy phase from a separate worker thread
+     * in configurations where multifd device state transfer is supported
+     * in order to perform asynchronous transmission of the remaining data in
+     * parallel with @save_live_complete_precopy handlers.
+     * When postcopy is enabled, devices that support postcopy will skip this
+     * step.
+     *
+     * @idstr: this device section idstr
+     * @instance_id: this device section instance_id
+     * @abort_flag: flag indicating that the migration core wants to abort
+     * the transmission and so the handler should exit ASAP. To be read by
+     * qatomic_read() or similar.
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
+
     /* This runs both outside and inside the BQL.  */
 
     /**
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index fd23ff7771b1..76578dee8fd3 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -133,5 +133,9 @@ typedef struct IRQState *qemu_irq;
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
 typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
                                     Error **errp);
+typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
+                                                    uint32_t instance_id,
+                                                    bool *abort_flag,
+                                                    void *opaque);
 
 #endif /* QEMU_TYPEDEFS_H */
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index cee3c44bcf2a..fb09e6a48df9 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -9,12 +9,17 @@
 
 #include "qemu/osdep.h"
 #include "qemu/lockable.h"
+#include "block/thread-pool.h"
 #include "migration/misc.h"
 #include "multifd.h"
 #include "options.h"
 
 static QemuMutex queue_job_mutex;
 
+static ThreadPool *send_threads;
+static int send_threads_ret;
+static bool send_threads_abort;
+
 static MultiFDSendData *device_state_send;
 
 void multifd_device_state_send_setup(void)
@@ -23,10 +28,16 @@ void multifd_device_state_send_setup(void)
 
     assert(!device_state_send);
     device_state_send = multifd_send_data_alloc();
+
+    assert(!send_threads);
+    send_threads = thread_pool_new();
+    send_threads_ret = 0;
+    send_threads_abort = false;
 }
 
 void multifd_device_state_send_cleanup(void)
 {
+    g_clear_pointer(&send_threads, thread_pool_free);
     g_clear_pointer(&device_state_send, multifd_send_data_free);
 
     qemu_mutex_destroy(&queue_job_mutex);
@@ -107,3 +118,75 @@ bool multifd_device_state_supported(void)
     return migrate_multifd() && !migrate_mapped_ram() &&
         migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
 }
+
+struct MultiFDDSSaveThreadData {
+    SaveLiveCompletePrecopyThreadHandler hdlr;
+    char *idstr;
+    uint32_t instance_id;
+    void *handler_opaque;
+};
+
+static void multifd_device_state_save_thread_data_free(void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data = opaque;
+
+    g_clear_pointer(&data->idstr, g_free);
+    g_free(data);
+}
+
+static int multifd_device_state_save_thread(void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data = opaque;
+    int ret;
+
+    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
+                     data->handler_opaque);
+    if (ret && !qatomic_read(&send_threads_ret)) {
+        /*
+         * Racy with the above read but that's okay - which thread error
+         * return we report is purely arbitrary anyway.
+         */
+        qatomic_set(&send_threads_ret, ret);
+    }
+
+    return 0;
+}
+
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data;
+
+    assert(multifd_device_state_supported());
+
+    assert(!qatomic_read(&send_threads_abort));
+
+    data = g_new(struct MultiFDDSSaveThreadData, 1);
+    data->hdlr = hdlr;
+    data->idstr = g_strdup(idstr);
+    data->instance_id = instance_id;
+    data->handler_opaque = opaque;
+
+    thread_pool_submit_immediate(send_threads,
+                                 multifd_device_state_save_thread,
+                                 data,
+                                 multifd_device_state_save_thread_data_free);
+}
+
+void multifd_abort_device_state_save_threads(void)
+{
+    assert(multifd_device_state_supported());
+
+    qatomic_set(&send_threads_abort, true);
+}
+
+int multifd_join_device_state_save_threads(void)
+{
+    assert(multifd_device_state_supported());
+
+    thread_pool_wait(send_threads);
+
+    return send_threads_ret;
+}
diff --git a/migration/savevm.c b/migration/savevm.c
index 74d1960de3c6..e47c6c92fe50 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -37,6 +37,7 @@
 #include "migration/register.h"
 #include "migration/global_state.h"
 #include "migration/channel-block.h"
+#include "multifd.h"
 #include "ram.h"
 #include "qemu-file.h"
 #include "savevm.h"
@@ -1521,6 +1522,24 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     int64_t start_ts_each, end_ts_each;
     SaveStateEntry *se;
     int ret;
+    bool multifd_device_state = multifd_device_state_supported();
+
+    if (multifd_device_state) {
+        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+            SaveLiveCompletePrecopyThreadHandler hdlr;
+
+            if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+                             se->ops->has_postcopy(se->opaque)) ||
+                !se->ops->save_live_complete_precopy_thread) {
+                continue;
+            }
+
+            hdlr = se->ops->save_live_complete_precopy_thread;
+            multifd_spawn_device_state_save_thread(hdlr,
+                                                   se->idstr, se->instance_id,
+                                                   se->opaque);
+        }
+    }
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
@@ -1546,16 +1565,32 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
         save_section_footer(f, se);
         if (ret < 0) {
             qemu_file_set_error(f, ret);
-            return -1;
+            goto ret_fail_abort_threads;
         }
         end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
         trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id,
                                     end_ts_each - start_ts_each);
     }
 
+    if (multifd_device_state) {
+        ret = multifd_join_device_state_save_threads();
+        if (ret) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;
+
+ret_fail_abort_threads:
+    if (multifd_device_state) {
+        multifd_abort_device_state_save_threads();
+        multifd_join_device_state_save_threads();
+    }
+
+    return -1;
 }
 
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (18 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-10 17:24   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 21/33] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
                   ` (14 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This property allows configuring whether to start the config load only
after all iterables were loaded.
Such interlocking is required for ARM64 due to this platform VFIO
dependency on interrupt controller being loaded first.

The property defaults to AUTO, which means ON for ARM, OFF for other
platforms.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 25 +++++++++++++++++++++++++
 hw/vfio/pci.c                 |  3 +++
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 29 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index adfa752db527..d801c861d202 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -254,6 +254,31 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
+static bool vfio_load_config_after_iter(VFIODevice *vbasedev)
+{
+    if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_ON) {
+        return true;
+    } else if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_OFF) {
+        return false;
+    }
+
+    assert(vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_AUTO);
+
+    /*
+     * Starting the config load only after all iterables were loaded is required
+     * for ARM64 due to this platform VFIO dependency on interrupt controller
+     * being loaded first.
+     *
+     * See commit d329f5032e17 ("vfio: Move the saving of the config space to
+     * the right place in VFIO migration").
+     */
+#if defined(TARGET_ARM)
+    return true;
+#else
+    return false;
+#endif
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
                                          Error **errp)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ab17a98ee5b6..83090c544d95 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3377,6 +3377,9 @@ static const Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+    DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
+                            vbasedev.migration_load_config_after_iter,
+                            ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0c60be5b15c7..153d03745dc7 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -133,6 +133,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
+    OnOffAuto migration_load_config_after_iter;
     bool migration_events;
     VFIODeviceOps *ops;
     unsigned int num_irqs;


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 21/33] vfio/migration: Add load_device_config_state_start trace event
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (19 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 22/33] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

And rename existing load_device_config_state trace event to
load_device_config_state_end for consistency since it is triggered at the
end of loading of the VFIO device config state.

This way both the start and end points of particular device config
loading operation (a long, BQL-serialized operation) are known.

Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c  | 4 +++-
 hw/vfio/trace-events | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index d801c861d202..f5df5ef17080 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -310,6 +310,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     uint64_t data;
 
+    trace_vfio_load_device_config_state_start(vbasedev->name);
+
     if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
         int ret;
 
@@ -328,7 +330,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         return -EINVAL;
     }
 
-    trace_vfio_load_device_config_state(vbasedev->name);
+    trace_vfio_load_device_config_state_end(vbasedev->name);
     return qemu_file_get_error(f);
 }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index cab1cf1de0a2..1bebe9877d88 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,7 +149,8 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_device_config_state_start(const char *name) " (%s)"
+vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
 vfio_migration_realize(const char *name) " (%s)"


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 22/33] vfio/migration: Convert bytes_transferred counter to atomic
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (20 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 21/33] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 21:35   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 23/33] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
                   ` (12 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

So it can be safety accessed from multiple threads.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index f5df5ef17080..cbb1e0b6f852 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -416,7 +416,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
     qemu_put_be64(f, data_size);
     qemu_put_buffer(f, migration->data_buffer, data_size);
-    bytes_transferred += data_size;
+    qatomic_add(&bytes_transferred, data_size);
 
     trace_vfio_save_block(migration->vbasedev->name, data_size);
 
@@ -1038,12 +1038,12 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
 
 int64_t vfio_mig_bytes_transferred(void)
 {
-    return bytes_transferred;
+    return qatomic_read(&bytes_transferred);
 }
 
 void vfio_reset_bytes_transferred(void)
 {
-    bytes_transferred = 0;
+    qatomic_set(&bytes_transferred, 0);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 23/33] vfio/migration: Multifd device state transfer support - basic types
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (21 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 22/33] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-10 17:17   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 24/33] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
                   ` (11 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add basic types and flags used by VFIO multifd device state transfer
support.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index cbb1e0b6f852..715182c4f810 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -47,6 +47,7 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY (0xffffffffef100006ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -55,6 +56,15 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+typedef struct VFIODeviceStatePacket {
+    uint32_t version;
+    uint32_t idx;
+    uint32_t flags;
+    uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 24/33] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s)
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (22 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 23/33] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 25/33] vfio/migration: Multifd device state transfer - add support checking function Maciej S. Szmigiero
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add VFIOStateBuffer(s) types and the associated methods.

These store received device state buffers and config state waiting to get
loaded into the device.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 715182c4f810..40cbe1be687d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -289,6 +289,60 @@ static bool vfio_load_config_after_iter(VFIODevice *vbasedev)
 #endif
 }
 
+/* type safety */
+typedef struct VFIOStateBuffers {
+    GArray *array;
+} VFIOStateBuffers;
+
+typedef struct VFIOStateBuffer {
+    bool is_present;
+    char *data;
+    size_t len;
+} VFIOStateBuffer;
+
+static void vfio_state_buffer_clear(gpointer data)
+{
+    VFIOStateBuffer *lb = data;
+
+    if (!lb->is_present) {
+        return;
+    }
+
+    g_clear_pointer(&lb->data, g_free);
+    lb->is_present = false;
+}
+
+static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
+{
+    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
+    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
+}
+
+static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
+{
+    g_clear_pointer(&bufs->array, g_array_unref);
+}
+
+static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
+{
+    assert(bufs->array);
+}
+
+static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
+{
+    return bufs->array->len;
+}
+
+static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
+{
+    g_array_set_size(bufs->array, size);
+}
+
+static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
+{
+    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
                                          Error **errp)
 {


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 25/33] vfio/migration: Multifd device state transfer - add support checking function
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (23 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 24/33] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add vfio_multifd_transfer_supported() function that tells whether the
multifd device state transfer is supported.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 40cbe1be687d..3211041939c6 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -519,6 +519,12 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
     return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
 }
 
+static bool vfio_multifd_transfer_supported(void)
+{
+    return multifd_device_state_supported() &&
+        migrate_send_switchover_start();
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_prepare(void *opaque, Error **errp)


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (24 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 25/33] vfio/migration: Multifd device state transfer - add support checking function Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-12 10:55   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
                   ` (8 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add support for VFIOMultifd data structure that will contain most of the
receive-side data together with its init/cleanup methods.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 52 +++++++++++++++++++++++++++++++++--
 include/hw/vfio/vfio-common.h |  5 ++++
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3211041939c6..bcdf204d5cf4 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -300,6 +300,9 @@ typedef struct VFIOStateBuffer {
     size_t len;
 } VFIOStateBuffer;
 
+typedef struct VFIOMultifd {
+} VFIOMultifd;
+
 static void vfio_state_buffer_clear(gpointer data)
 {
     VFIOStateBuffer *lb = data;
@@ -398,6 +401,18 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static VFIOMultifd *vfio_multifd_new(void)
+{
+    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
+
+    return multifd;
+}
+
+static void vfio_multifd_free(VFIOMultifd *multifd)
+{
+    g_free(multifd);
+}
+
 static void vfio_migration_cleanup(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -785,14 +800,47 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
 static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    /*
+     * Make a copy of this setting at the start in case it is changed
+     * mid-migration.
+     */
+    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
+        migration->multifd_transfer = vfio_multifd_transfer_supported();
+    } else {
+        migration->multifd_transfer =
+            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
+    }
+
+    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
+        error_setg(errp,
+                   "%s: Multifd device transfer requested but unsupported in the current config",
+                   vbasedev->name);
+        return -EINVAL;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   migration->device_state, errp);
+    if (ret) {
+        return ret;
+    }
+
+    if (migration->multifd_transfer) {
+        assert(!migration->multifd);
+        migration->multifd = vfio_multifd_new();
+    }
 
-    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
-                                    vbasedev->migration->device_state, errp);
+    return 0;
 }
 
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    g_clear_pointer(&migration->multifd, vfio_multifd_free);
 
     vfio_migration_cleanup(vbasedev);
     trace_vfio_load_cleanup(vbasedev->name);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 153d03745dc7..c0c9c0b1b263 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -61,6 +61,8 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMultifd VFIOMultifd;
+
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
@@ -72,6 +74,8 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    bool multifd_transfer;
+    VFIOMultifd *multifd;
     bool initial_data_sent;
 
     bool event_save_iterate_started;
@@ -133,6 +137,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
+    OnOffAuto migration_multifd_transfer;
     OnOffAuto migration_load_config_after_iter;
     bool migration_events;
     VFIODeviceOps *ops;


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (25 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-12 13:47   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
                   ` (7 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating its
position in the stream.
The raw device state data is saved into a VFIOStateBuffer for later
in-order loading into the device.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 116 ++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |   2 +
 hw/vfio/trace-events          |   1 +
 include/hw/vfio/vfio-common.h |   1 +
 4 files changed, 120 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index bcdf204d5cf4..0c0caec1bd64 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -301,6 +301,12 @@ typedef struct VFIOStateBuffer {
 } VFIOStateBuffer;
 
 typedef struct VFIOMultifd {
+    VFIOStateBuffers load_bufs;
+    QemuCond load_bufs_buffer_ready_cond;
+    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
+    uint32_t load_buf_idx;
+    uint32_t load_buf_idx_last;
+    uint32_t load_buf_queued_pending_buffers;
 } VFIOMultifd;
 
 static void vfio_state_buffer_clear(gpointer data)
@@ -346,6 +352,103 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
     return &g_array_index(bufs->array, VFIOStateBuffer, idx);
 }
 
+static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
+                                          VFIODeviceStatePacket *packet,
+                                          size_t packet_total_size,
+                                          Error **errp)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIOStateBuffer *lb;
+
+    vfio_state_buffers_assert_init(&multifd->load_bufs);
+    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
+        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
+    }
+
+    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
+    if (lb->is_present) {
+        error_setg(errp, "state buffer %" PRIu32 " already filled",
+                   packet->idx);
+        return false;
+    }
+
+    assert(packet->idx >= multifd->load_buf_idx);
+
+    multifd->load_buf_queued_pending_buffers++;
+    if (multifd->load_buf_queued_pending_buffers >
+        vbasedev->migration_max_queued_buffers) {
+        error_setg(errp,
+                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
+                   packet->idx, vbasedev->migration_max_queued_buffers);
+        return false;
+    }
+
+    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
+    lb->len = packet_total_size - sizeof(*packet);
+    lb->is_present = true;
+
+    return true;
+}
+
+static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                                   Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+
+    /*
+     * Holding BQL here would violate the lock order and can cause
+     * a deadlock once we attempt to lock load_bufs_mutex below.
+     */
+    assert(!bql_locked());
+
+    if (!migration->multifd_transfer) {
+        error_setg(errp,
+                   "got device state packet but not doing multifd transfer");
+        return false;
+    }
+
+    assert(multifd);
+
+    if (data_size < sizeof(*packet)) {
+        error_setg(errp, "packet too short at %zu (min is %zu)",
+                   data_size, sizeof(*packet));
+        return false;
+    }
+
+    if (packet->version != 0) {
+        error_setg(errp, "packet has unknown version %" PRIu32,
+                   packet->version);
+        return false;
+    }
+
+    if (packet->idx == UINT32_MAX) {
+        error_setg(errp, "packet has too high idx %" PRIu32,
+                   packet->idx);
+        return false;
+    }
+
+    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
+
+    /* config state packet should be the last one in the stream */
+    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+        multifd->load_buf_idx_last = packet->idx;
+    }
+
+    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {
+        return false;
+    }
+
+    qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
+
+    return true;
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
                                          Error **errp)
 {
@@ -405,11 +508,23 @@ static VFIOMultifd *vfio_multifd_new(void)
 {
     VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
 
+    vfio_state_buffers_init(&multifd->load_bufs);
+
+    qemu_mutex_init(&multifd->load_bufs_mutex);
+
+    multifd->load_buf_idx = 0;
+    multifd->load_buf_idx_last = UINT32_MAX;
+    multifd->load_buf_queued_pending_buffers = 0;
+    qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
+
     return multifd;
 }
 
 static void vfio_multifd_free(VFIOMultifd *multifd)
 {
+    qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
+    qemu_mutex_destroy(&multifd->load_bufs_mutex);
+
     g_free(multifd);
 }
 
@@ -940,6 +1055,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .load_state_buffer = vfio_load_state_buffer,
     .switchover_ack_needed = vfio_switchover_ack_needed,
 };
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 83090c544d95..2700b355ecf1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3380,6 +3380,8 @@ static const Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
                             vbasedev.migration_load_config_after_iter,
                             ON_OFF_AUTO_AUTO),
+    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
+                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 1bebe9877d88..042a3dc54a33 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -153,6 +153,7 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
 vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
+vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c0c9c0b1b263..0e8b0848882e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -139,6 +139,7 @@ typedef struct VFIODevice {
     OnOffAuto enable_migration;
     OnOffAuto migration_multifd_transfer;
     OnOffAuto migration_load_config_after_iter;
+    uint64_t migration_max_queued_buffers;
     bool migration_events;
     VFIODeviceOps *ops;
     unsigned int num_irqs;


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (26 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-12 15:48   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 29/33] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
                   ` (6 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since it's important to finish loading device state transferred via the
main migration channel (via save_live_iterate SaveVMHandler) before
starting loading the data asynchronously transferred via multifd the thread
doing the actual loading of the multifd transferred data is only started
from switchover_start SaveVMHandler.

switchover_start handler is called when MIG_CMD_SWITCHOVER_START
sub-command of QEMU_VM_COMMAND is received via the main migration channel.

This sub-command is only sent after all save_live_iterate data have already
been posted so it is safe to commence loading of the multifd-transferred
device state upon receiving it - loading of save_live_iterate data happens
synchronously in the main migration thread (much like the processing of
MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
processed all the proceeding data must have already been loaded.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c  | 229 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   5 +
 2 files changed, 234 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 0c0caec1bd64..ab5b097f59c9 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -301,8 +301,16 @@ typedef struct VFIOStateBuffer {
 } VFIOStateBuffer;
 
 typedef struct VFIOMultifd {
+    QemuThread load_bufs_thread;
+    bool load_bufs_thread_running;
+    bool load_bufs_thread_want_exit;
+
+    bool load_bufs_iter_done;
+    QemuCond load_bufs_iter_done_cond;
+
     VFIOStateBuffers load_bufs;
     QemuCond load_bufs_buffer_ready_cond;
+    QemuCond load_bufs_thread_finished_cond;
     QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
     uint32_t load_buf_idx;
     uint32_t load_buf_idx_last;
@@ -449,6 +457,171 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
     return true;
 }
 
+static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
+{
+    VFIOStateBuffer *lb;
+    guint bufs_len;
+
+    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
+    if (multifd->load_buf_idx >= bufs_len) {
+        assert(multifd->load_buf_idx == bufs_len);
+        return NULL;
+    }
+
+    lb = vfio_state_buffers_at(&multifd->load_bufs,
+                               multifd->load_buf_idx);
+    if (!lb->is_present) {
+        return NULL;
+    }
+
+    return lb;
+}
+
+static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
+{
+    return -EINVAL;
+}
+
+static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
+                                         VFIOStateBuffer *lb,
+                                         Error **errp)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    g_autofree char *buf = NULL;
+    char *buf_cur;
+    size_t buf_len;
+
+    if (!lb->len) {
+        return true;
+    }
+
+    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
+                                                   multifd->load_buf_idx);
+
+    /* lb might become re-allocated when we drop the lock */
+    buf = g_steal_pointer(&lb->data);
+    buf_cur = buf;
+    buf_len = lb->len;
+    while (buf_len > 0) {
+        ssize_t wr_ret;
+        int errno_save;
+
+        /*
+         * Loading data to the device takes a while,
+         * drop the lock during this process.
+         */
+        qemu_mutex_unlock(&multifd->load_bufs_mutex);
+        wr_ret = write(migration->data_fd, buf_cur, buf_len);
+        errno_save = errno;
+        qemu_mutex_lock(&multifd->load_bufs_mutex);
+
+        if (wr_ret < 0) {
+            error_setg(errp,
+                       "writing state buffer %" PRIu32 " failed: %d",
+                       multifd->load_buf_idx, errno_save);
+            return false;
+        }
+
+        assert(wr_ret <= buf_len);
+        buf_len -= wr_ret;
+        buf_cur += wr_ret;
+    }
+
+    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
+                                                 multifd->load_buf_idx);
+
+    return true;
+}
+
+static bool vfio_load_bufs_thread_want_abort(VFIOMultifd *multifd,
+                                             bool *should_quit)
+{
+    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
+}
+
+static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    bool ret = true;
+    int config_ret;
+
+    assert(multifd);
+    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
+
+    assert(multifd->load_bufs_thread_running);
+
+    while (!vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
+        VFIOStateBuffer *lb;
+
+        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
+
+        lb = vfio_load_state_buffer_get(multifd);
+        if (!lb) {
+            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
+                                                        multifd->load_buf_idx);
+            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
+                           &multifd->load_bufs_mutex);
+            continue;
+        }
+
+        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
+            break;
+        }
+
+        if (multifd->load_buf_idx == 0) {
+            trace_vfio_load_state_device_buffer_start(vbasedev->name);
+        }
+
+        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
+            ret = false;
+            goto ret_signal;
+        }
+
+        assert(multifd->load_buf_queued_pending_buffers > 0);
+        multifd->load_buf_queued_pending_buffers--;
+
+        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
+            trace_vfio_load_state_device_buffer_end(vbasedev->name);
+        }
+
+        multifd->load_buf_idx++;
+    }
+
+    if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
+        error_setg(errp, "operation cancelled");
+        ret = false;
+        goto ret_signal;
+    }
+
+    if (vfio_load_config_after_iter(vbasedev)) {
+        while (!multifd->load_bufs_iter_done) {
+            qemu_cond_wait(&multifd->load_bufs_iter_done_cond,
+                           &multifd->load_bufs_mutex);
+
+            if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
+                error_setg(errp, "operation cancelled");
+                ret = false;
+                goto ret_signal;
+            }
+        }
+    }
+
+    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
+    if (config_ret) {
+        error_setg(errp, "load config state failed: %d", config_ret);
+        ret = false;
+    }
+
+ret_signal:
+    multifd->load_bufs_thread_running = false;
+    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
+
+    return ret;
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
                                          Error **errp)
 {
@@ -517,11 +690,40 @@ static VFIOMultifd *vfio_multifd_new(void)
     multifd->load_buf_queued_pending_buffers = 0;
     qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
 
+    multifd->load_bufs_iter_done = false;
+    qemu_cond_init(&multifd->load_bufs_iter_done_cond);
+
+    multifd->load_bufs_thread_running = false;
+    multifd->load_bufs_thread_want_exit = false;
+    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
+
     return multifd;
 }
 
+static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
+{
+    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+    bql_unlock();
+    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
+        while (multifd->load_bufs_thread_running) {
+            multifd->load_bufs_thread_want_exit = true;
+
+            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
+            qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
+            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
+                           &multifd->load_bufs_mutex);
+        }
+    }
+    bql_lock();
+}
+
 static void vfio_multifd_free(VFIOMultifd *multifd)
 {
+    vfio_load_cleanup_load_bufs_thread(multifd);
+
+    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
+    qemu_cond_destroy(&multifd->load_bufs_iter_done_cond);
+    vfio_state_buffers_destroy(&multifd->load_bufs);
     qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
     qemu_mutex_destroy(&multifd->load_bufs_mutex);
 
@@ -1042,6 +1244,32 @@ static bool vfio_switchover_ack_needed(void *opaque)
     return vfio_precopy_supported(vbasedev);
 }
 
+static int vfio_switchover_start(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+
+    if (!migration->multifd_transfer) {
+        /* Load thread is only used for multifd transfer */
+        return 0;
+    }
+
+    assert(multifd);
+
+    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+    bql_unlock();
+    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
+        assert(!multifd->load_bufs_thread_running);
+        multifd->load_bufs_thread_running = true;
+    }
+    bql_lock();
+
+    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
+
+    return 0;
+}
+
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_prepare = vfio_save_prepare,
     .save_setup = vfio_save_setup,
@@ -1057,6 +1285,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_state = vfio_load_state,
     .load_state_buffer = vfio_load_state_buffer,
     .switchover_ack_needed = vfio_switchover_ack_needed,
+    .switchover_start = vfio_switchover_start,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 042a3dc54a33..418b378ebd29 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -154,6 +154,11 @@ vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
 vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_start(const char *name) " (%s)"
+vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_end(const char *name) " (%s)"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 29/33] vfio/migration: Multifd device state transfer support - config loading support
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (27 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-12 16:21   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 30/33] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
                   ` (5 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Load device config received via multifd using the existing machinery
behind vfio_load_device_config_state().

Also, make sure to process the relevant main migration channel flags.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c | 103 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 98 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ab5b097f59c9..31f651ffee85 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -15,6 +15,7 @@
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 
+#include "io/channel-buffer.h"
 #include "system/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/misc.h"
@@ -457,6 +458,57 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
     return true;
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
+
+static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIOStateBuffer *lb;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f_out = NULL, *f_in = NULL;
+    uint64_t mig_header;
+    int ret;
+
+    assert(multifd->load_buf_idx == multifd->load_buf_idx_last);
+    lb = vfio_state_buffers_at(&multifd->load_bufs, multifd->load_buf_idx);
+    assert(lb->is_present);
+
+    bioc = qio_channel_buffer_new(lb->len);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
+
+    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
+    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
+
+    ret = qemu_fflush(f_out);
+    if (ret) {
+        g_clear_pointer(&f_out, qemu_fclose);
+        return ret;
+    }
+
+    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
+    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
+
+    mig_header = qemu_get_be64(f_in);
+    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+        g_clear_pointer(&f_out, qemu_fclose);
+        g_clear_pointer(&f_in, qemu_fclose);
+        return -EINVAL;
+    }
+
+    bql_lock();
+    ret = vfio_load_device_config_state(f_in, vbasedev);
+    bql_unlock();
+
+    g_clear_pointer(&f_out, qemu_fclose);
+    g_clear_pointer(&f_in, qemu_fclose);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
 static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
 {
     VFIOStateBuffer *lb;
@@ -477,11 +529,6 @@ static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
     return lb;
 }
 
-static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
-{
-    return -EINVAL;
-}
-
 static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
                                          VFIOStateBuffer *lb,
                                          Error **errp)
@@ -1168,6 +1215,8 @@ static int vfio_load_cleanup(void *opaque)
 static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
     int ret = 0;
     uint64_t data;
 
@@ -1179,6 +1228,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
         switch (data) {
         case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
         {
+            if (migration->multifd_transfer) {
+                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
+                             vbasedev->name);
+                return -EINVAL;
+            }
+
             return vfio_load_device_config_state(f, opaque);
         }
         case VFIO_MIG_FLAG_DEV_SETUP_STATE:
@@ -1223,6 +1278,44 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 
             return ret;
         }
+        case VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY:
+        {
+            if (!migration->multifd_transfer) {
+                error_report("%s: got DEV_CONFIG_LOAD_READY outside multifd transfer",
+                             vbasedev->name);
+                return -EINVAL;
+            }
+
+            if (!vfio_load_config_after_iter(vbasedev)) {
+                error_report("%s: got DEV_CONFIG_LOAD_READY but was disabled",
+                             vbasedev->name);
+                return -EINVAL;
+            }
+
+            assert(multifd);
+
+            /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+            bql_unlock();
+            WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
+                if (multifd->load_bufs_iter_done) {
+                    /* Can't print error here as we're outside BQL */
+                    ret = -EINVAL;
+                    break;
+                }
+
+                multifd->load_bufs_iter_done = true;
+                qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
+
+                ret = 0;
+            }
+            bql_lock();
+
+            if (ret) {
+                error_report("%s: duplicate DEV_CONFIG_LOAD_READY",
+                             vbasedev->name);
+            }
+            return ret;
+        }
         default:
             error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
             return -EINVAL;


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 30/33] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (28 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 29/33] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 10:08 ` [PATCH v4 31/33] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Automatic memory management helps avoid memory safety issues.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/qemu-file.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 11c2120edd72..fdf21324df07 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -33,6 +33,8 @@ QEMUFile *qemu_file_new_input(QIOChannel *ioc);
 QEMUFile *qemu_file_new_output(QIOChannel *ioc);
 int qemu_fclose(QEMUFile *f);
 
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(QEMUFile, qemu_fclose)
+
 /*
  * qemu_file_transferred:
  *


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 31/33] vfio/migration: Multifd device state transfer support - send side
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (29 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 30/33] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-12 17:03   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Implement the multifd device state transfer via additional per-device
thread inside save_live_complete_precopy_thread handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
x-migration-multifd-transfer device property value.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c  | 159 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   2 +
 2 files changed, 161 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 31f651ffee85..37d1c0f3d32f 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -943,6 +943,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
+    /*
+     * Make a copy of this setting at the start in case it is changed
+     * mid-migration.
+     */
+    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
+        migration->multifd_transfer = vfio_multifd_transfer_supported();
+    } else {
+        migration->multifd_transfer =
+            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
+    }
+
+    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
+        error_setg(errp,
+                   "%s: Multifd device transfer requested but unsupported in the current config",
+                   vbasedev->name);
+        return -EINVAL;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
@@ -1114,13 +1132,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     return !migration->precopy_init_size && !migration->precopy_dirty_size;
 }
 
+static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    assert(migration->multifd_transfer);
+
+    /*
+     * Emit dummy NOP data on the main migration channel since the actual
+     * device state transfer is done via multifd channels.
+     */
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+}
+
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     ssize_t data_size;
     int ret;
     Error *local_err = NULL;
 
+    if (migration->multifd_transfer) {
+        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
+        return 0;
+    }
+
     trace_vfio_save_complete_precopy_start(vbasedev->name);
 
     /* We reach here with device state STOP or STOP_COPY only */
@@ -1146,12 +1183,133 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int
+vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
+                                                     char *idstr,
+                                                     uint32_t instance_id,
+                                                     uint32_t idx)
+{
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    g_autoptr(QEMUFile) f = NULL;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    size_t packet_len;
+
+    bioc = qio_channel_buffer_new(0);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+    f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+    ret = vfio_save_device_config_state(f, vbasedev, NULL);
+    if (ret) {
+        return ret;
+    }
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        return ret;
+    }
+
+    packet_len = sizeof(*packet) + bioc->usage;
+    packet = g_malloc0(packet_len);
+    packet->idx = idx;
+    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+    memcpy(&packet->data, bioc->data, bioc->usage);
+
+    if (!multifd_queue_device_state(idstr, instance_id,
+                                    (char *)packet, packet_len)) {
+        return -1;
+    }
+
+    qatomic_add(&bytes_transferred, packet_len);
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy_thread(char *idstr,
+                                             uint32_t instance_id,
+                                             bool *abort_flag,
+                                             void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    uint32_t idx;
+
+    if (!migration->multifd_transfer) {
+        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
+        return 0;
+    }
+
+    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
+                                                  idstr, instance_id);
+
+    /* We reach here with device state STOP or STOP_COPY only */
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   VFIO_DEVICE_STATE_STOP, NULL);
+    if (ret) {
+        goto ret_finish;
+    }
+
+    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+
+    for (idx = 0; ; idx++) {
+        ssize_t data_size;
+        size_t packet_size;
+
+        if (qatomic_read(abort_flag)) {
+            ret = -ECANCELED;
+            goto ret_finish;
+        }
+
+        data_size = read(migration->data_fd, &packet->data,
+                         migration->data_buffer_size);
+        if (data_size < 0) {
+            ret = -errno;
+            goto ret_finish;
+        } else if (data_size == 0) {
+            break;
+        }
+
+        packet->idx = idx;
+        packet_size = sizeof(*packet) + data_size;
+
+        if (!multifd_queue_device_state(idstr, instance_id,
+                                        (char *)packet, packet_size)) {
+            ret = -1;
+            goto ret_finish;
+        }
+
+        qatomic_add(&bytes_transferred, packet_size);
+    }
+
+    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
+                                                               instance_id,
+                                                               idx);
+
+ret_finish:
+    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
+
+    return ret;
+}
+
 static void vfio_save_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     Error *local_err = NULL;
     int ret;
 
+    if (migration->multifd_transfer) {
+        if (vfio_load_config_after_iter(vbasedev)) {
+            qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY);
+        } else {
+            vfio_save_multifd_emit_dummy_eos(vbasedev, f);
+        }
+        return;
+    }
+
     ret = vfio_save_device_config_state(f, opaque, &local_err);
     if (ret) {
         error_prepend(&local_err,
@@ -1372,6 +1530,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 418b378ebd29..039979bdd98f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -168,6 +168,8 @@ vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_complete_precopy_start(const char *name) " (%s)"
+vfio_save_complete_precopy_thread_start(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
+vfio_save_complete_precopy_thread_end(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_save_iterate_start(const char *name) " (%s)"


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (30 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 31/33] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-02-12 17:10   ` Cédric Le Goater
  2025-01-30 10:08 ` [PATCH v4 33/33] hw/core/machine: Add compat for " Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  34 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This property allows configuring at runtime whether to transfer the
particular device state via multifd channels when live migrating that
device.

It defaults to AUTO, which means that VFIO device state transfer via
multifd channels is attempted in configurations that otherwise support it.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/pci.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2700b355ecf1..cd24f386aaf9 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
     pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
 }
 
+static PropertyInfo qdev_prop_on_off_auto_mutable;
+
 static const Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
     DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
@@ -3377,6 +3379,10 @@ static const Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
+                vbasedev.migration_multifd_transfer,
+                qdev_prop_on_off_auto_mutable, OnOffAuto,
+                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
     DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
                             vbasedev.migration_load_config_after_iter,
                             ON_OFF_AUTO_AUTO),
@@ -3477,6 +3483,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
+    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
+
     type_register_static(&vfio_pci_dev_info);
     type_register_static(&vfio_pci_nohotplug_dev_info);
 }


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* [PATCH v4 33/33] hw/core/machine: Add compat for x-migration-multifd-transfer VFIO property
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (31 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2025-01-30 10:08 ` Maciej S. Szmigiero
  2025-01-30 20:19 ` [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
  2025-02-03 14:19 ` Cédric Le Goater
  34 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 10:08 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a hw_compat entry for recently added x-migration-multifd-transfer VFIO
property.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/machine.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index c2964503c5bd..3f06ea945859 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -39,6 +39,7 @@
 GlobalProperty hw_compat_9_2[] = {
     {"arm-cpu", "backcompat-pauth-default-use-qarma5", "true"},
     { "migration", "send-switchover-start", "off"},
+    { "vfio-pci", "x-migration-multifd-transfer", "off" },
 };
 const size_t hw_compat_9_2_len = G_N_ELEMENTS(hw_compat_9_2);
 


^ permalink raw reply related	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (32 preceding siblings ...)
  2025-01-30 10:08 ` [PATCH v4 33/33] hw/core/machine: Add compat for " Maciej S. Szmigiero
@ 2025-01-30 20:19 ` Fabiano Rosas
  2025-01-30 20:27   ` Maciej S. Szmigiero
  2025-02-03 14:19 ` Cédric Le Goater
  34 siblings, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-01-30 20:19 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This is an updated v4 patch series of the v3 series located here:
> https://lore.kernel.org/qemu-devel/cover.1731773021.git.maciej.szmigiero@oracle.com/
>
> Changes from v3:
> * MigrationLoadThread now returns bool and an Error complex error type
> instead of just an int.
>
> * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
> instead of dedicated load_threads_ret variable.
>
> * Since the change above uncovered an issue with respect to multifd send
> channels not terminating TLS session properly QIOChannelTLS now allows
> gracefully handling this situation.
>
> * qemu_loadvm_load_thread_pool state is now part of MigrationIncomingState
> instead of being stored in global variables.
> This state now also has its own init/cleanup helpers.
>
> * qemu_loadvm_load_thread_pool code is now moved into a separate section
> of the savevm.c file, marked by an appropriate comment.
>
> * thread_pool_free() is now documented to have wait-before-free semantic,
> which allowed removal of explicit waits from thread pool cleanup paths.
>
> * thread_pool_submit_immediate() method was added since this functionality
> is used by both generic thread pool users in this patch set.
>
> * postcopy_ram_listen_thread() now takes BQL around function calls that
> ultimately call migration methods requiring BQL.
> This fixes one of QEMU tests failing when explicitly BQL-sensitive code
> is added later to these methods.
>
> * qemu_loadvm_load_state_buffer() now returns a bool value instead of int.
>
> * "Send final SYNC only after device state is complete" patch was
> dropped since Peter implemented equivalent functionality upstream.
>
> * "Document the BQL behavior of load SaveVMHandlers" patch was dropped
> since that's something better done later, separately from this patch set.
>
> * Header size is now added to mig_stats.multifd_bytes where it is actually
> sent in the zero copy case - in multifd_nocomp_send_prepare().
>
> * Spurious wakeups from qemu_cond_wait() are now handled properly as
> pointed out by Avihai.
>
> * VFIO migration FD now allows partial write() completion as pointed out
> by Avihai.
>
> * Patch "vfio/migration: Don't run load cleanup if load setup didn't run"
> was dropped, instead all objects related to multifd load are now located in
> their own VFIOMultifd struct which is allocated only if multifd device state
> transfer is actually in use.
>
> * Intermediate VFIOStateBuffers API as suggested by Avihai is now introduced
> to simplify vfio_load_state_buffer() and vfio_load_bufs_thread().
>
> * Optional VFIO device config state loading interlocking with loading
> other iterables is now possible due to ARM64 platform VFIO dependency on
> interrupt controller being loaded first as pointed out by Avihai.
>
> * Patch "Multifd device state transfer support - receive side" was split
> into a few smaller patches as suggested by Cédric.
>
> * x-migration-multifd-transfer VFIO property compat changes were moved
> into a separate patch as suggested by Cédric.
>
> * Other small changes, like renamed functions and variables/members, added
> review tags, code formatting, moved QEMU_LOCK_GUARD() instances closer to
> actual protected blocks, etc.
>
> ========================================================================
>
> This patch set is targeting QEMU 10.0.
>
> What's not yet present is documentation update under docs/devel/migration
> but I didn't want to delay posting the code any longer.
> Such doc can still be merged later when the design is 100% finalized.
>
> ========================================================================
>
> Maciej S. Szmigiero (32):
>   migration: Clarify that {load,save}_cleanup handlers can run without
>     setup
>   thread-pool: Remove thread_pool_submit() function
>   thread-pool: Rename AIO pool functions to *_aio() and data types to
>     *Aio
>   thread-pool: Implement generic (non-AIO) pool support
>   migration: Add MIG_CMD_SWITCHOVER_START and its load handler
>   migration: Add qemu_loadvm_load_state_buffer() and its handler
>   io: tls: Allow terminating the TLS session gracefully with EOF
>   migration/multifd: Allow premature EOF on TLS incoming channels
>   migration: postcopy_ram_listen_thread() needs to take BQL for some
>     calls
>   error: define g_autoptr() cleanup function for the Error type
>   migration: Add thread pool of optional load threads
>   migration/multifd: Split packet into header and RAM data
>   migration/multifd: Device state transfer support - receive side
>   migration/multifd: Make multifd_send() thread safe
>   migration/multifd: Add an explicit MultiFDSendData destructor
>   migration/multifd: Device state transfer support - send side
>   migration/multifd: Add multifd_device_state_supported()
>   migration: Add save_live_complete_precopy_thread handler
>   vfio/migration: Add x-migration-load-config-after-iter VFIO property
>   vfio/migration: Add load_device_config_state_start trace event
>   vfio/migration: Convert bytes_transferred counter to atomic
>   vfio/migration: Multifd device state transfer support - basic types
>   vfio/migration: Multifd device state transfer support -
>     VFIOStateBuffer(s)
>   vfio/migration: Multifd device state transfer - add support checking
>     function
>   vfio/migration: Multifd device state transfer support - receive
>     init/cleanup
>   vfio/migration: Multifd device state transfer support - received
>     buffers queuing
>   vfio/migration: Multifd device state transfer support - load thread
>   vfio/migration: Multifd device state transfer support - config loading
>     support
>   migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
>   vfio/migration: Multifd device state transfer support - send side
>   vfio/migration: Add x-migration-multifd-transfer VFIO property
>   hw/core/machine: Add compat for x-migration-multifd-transfer VFIO
>     property
>
> Peter Xu (1):
>   migration/multifd: Make MultiFDSendData a struct
>
>  hw/core/machine.c                  |   2 +
>  hw/vfio/migration.c                | 754 ++++++++++++++++++++++++++++-
>  hw/vfio/pci.c                      |  14 +
>  hw/vfio/trace-events               |  11 +-
>  include/block/aio.h                |   8 +-
>  include/block/thread-pool.h        |  62 ++-
>  include/hw/vfio/vfio-common.h      |   7 +
>  include/io/channel-tls.h           |  11 +
>  include/migration/client-options.h |   4 +
>  include/migration/misc.h           |  16 +
>  include/migration/register.h       |  54 ++-
>  include/qapi/error.h               |   2 +
>  include/qemu/typedefs.h            |   6 +
>  io/channel-tls.c                   |   6 +
>  migration/colo.c                   |   3 +
>  migration/meson.build              |   1 +
>  migration/migration-hmp-cmds.c     |   2 +
>  migration/migration.c              |   6 +-
>  migration/migration.h              |   7 +
>  migration/multifd-device-state.c   | 192 ++++++++
>  migration/multifd-nocomp.c         |  30 +-
>  migration/multifd.c                | 248 ++++++++--
>  migration/multifd.h                |  74 ++-
>  migration/options.c                |   9 +
>  migration/qemu-file.h              |   2 +
>  migration/savevm.c                 | 195 +++++++-
>  migration/savevm.h                 |   6 +-
>  migration/trace-events             |   1 +
>  scripts/analyze-migration.py       |  11 +
>  tests/unit/test-thread-pool.c      |   6 +-
>  util/async.c                       |   6 +-
>  util/thread-pool.c                 | 184 +++++--
>  util/trace-events                  |   6 +-
>  33 files changed, 1814 insertions(+), 132 deletions(-)
>  create mode 100644 migration/multifd-device-state.c

Hi!

We have build issues:

https://gitlab.com/farosas/qemu/-/pipelines/1649146958

And the postcopy/recovery test is failing. It seems the migration
finishes before the test can issue migrate-pause:

QTEST_QEMU_BINARY=./qemu-system-x86_64  ./tests/qtest/migration-test -p
/x86_64/migration/postcopy/recovery/plain
...
{"execute": "migrate-start-postcopy"}
{"return": {}}
{"secs": 1738267018, "usecs": 860991}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
{"secs": 1738267018, "usecs": 861284}, "event": "STOP"
{"secs": 1738267017, "usecs": 960322}, "event": "MIGRATION", "data": {"status": "active"}
{"secs": 1738267018, "usecs": 865589}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
{"secs": 1738267099, "usecs": 120971}, "event": "MIGRATION", "data": {"status": "completed"}
{"secs": 1738267099, "usecs": 121154}, "event": "RESUME"
{"execute": "query-migrate"}

ERROR:../tests/qtest/migration/migration-qmp.c:172:check_migration_status:
assertion failed (current_status != "completed"): ("completed" !=
"completed")



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
  2025-01-30 20:19 ` [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
@ 2025-01-30 20:27   ` Maciej S. Szmigiero
  2025-01-30 20:46     ` Fabiano Rosas
  2025-01-31 18:16     ` Maciej S. Szmigiero
  0 siblings, 2 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-30 20:27 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 30.01.2025 21:19, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is an updated v4 patch series of the v3 series located here:
>> https://lore.kernel.org/qemu-devel/cover.1731773021.git.maciej.szmigiero@oracle.com/
>>
>> Changes from v3:
>> * MigrationLoadThread now returns bool and an Error complex error type
>> instead of just an int.
>>
>> * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
>> instead of dedicated load_threads_ret variable.
>>
>> * Since the change above uncovered an issue with respect to multifd send
>> channels not terminating TLS session properly QIOChannelTLS now allows
>> gracefully handling this situation.
>>
>> * qemu_loadvm_load_thread_pool state is now part of MigrationIncomingState
>> instead of being stored in global variables.
>> This state now also has its own init/cleanup helpers.
>>
>> * qemu_loadvm_load_thread_pool code is now moved into a separate section
>> of the savevm.c file, marked by an appropriate comment.
>>
>> * thread_pool_free() is now documented to have wait-before-free semantic,
>> which allowed removal of explicit waits from thread pool cleanup paths.
>>
>> * thread_pool_submit_immediate() method was added since this functionality
>> is used by both generic thread pool users in this patch set.
>>
>> * postcopy_ram_listen_thread() now takes BQL around function calls that
>> ultimately call migration methods requiring BQL.
>> This fixes one of QEMU tests failing when explicitly BQL-sensitive code
>> is added later to these methods.
>>
>> * qemu_loadvm_load_state_buffer() now returns a bool value instead of int.
>>
>> * "Send final SYNC only after device state is complete" patch was
>> dropped since Peter implemented equivalent functionality upstream.
>>
>> * "Document the BQL behavior of load SaveVMHandlers" patch was dropped
>> since that's something better done later, separately from this patch set.
>>
>> * Header size is now added to mig_stats.multifd_bytes where it is actually
>> sent in the zero copy case - in multifd_nocomp_send_prepare().
>>
>> * Spurious wakeups from qemu_cond_wait() are now handled properly as
>> pointed out by Avihai.
>>
>> * VFIO migration FD now allows partial write() completion as pointed out
>> by Avihai.
>>
>> * Patch "vfio/migration: Don't run load cleanup if load setup didn't run"
>> was dropped, instead all objects related to multifd load are now located in
>> their own VFIOMultifd struct which is allocated only if multifd device state
>> transfer is actually in use.
>>
>> * Intermediate VFIOStateBuffers API as suggested by Avihai is now introduced
>> to simplify vfio_load_state_buffer() and vfio_load_bufs_thread().
>>
>> * Optional VFIO device config state loading interlocking with loading
>> other iterables is now possible due to ARM64 platform VFIO dependency on
>> interrupt controller being loaded first as pointed out by Avihai.
>>
>> * Patch "Multifd device state transfer support - receive side" was split
>> into a few smaller patches as suggested by Cédric.
>>
>> * x-migration-multifd-transfer VFIO property compat changes were moved
>> into a separate patch as suggested by Cédric.
>>
>> * Other small changes, like renamed functions and variables/members, added
>> review tags, code formatting, moved QEMU_LOCK_GUARD() instances closer to
>> actual protected blocks, etc.
>>
>> ========================================================================
>>
>> This patch set is targeting QEMU 10.0.
>>
>> What's not yet present is documentation update under docs/devel/migration
>> but I didn't want to delay posting the code any longer.
>> Such doc can still be merged later when the design is 100% finalized.
>>
>> ========================================================================
>>
>> Maciej S. Szmigiero (32):
>>    migration: Clarify that {load,save}_cleanup handlers can run without
>>      setup
>>    thread-pool: Remove thread_pool_submit() function
>>    thread-pool: Rename AIO pool functions to *_aio() and data types to
>>      *Aio
>>    thread-pool: Implement generic (non-AIO) pool support
>>    migration: Add MIG_CMD_SWITCHOVER_START and its load handler
>>    migration: Add qemu_loadvm_load_state_buffer() and its handler
>>    io: tls: Allow terminating the TLS session gracefully with EOF
>>    migration/multifd: Allow premature EOF on TLS incoming channels
>>    migration: postcopy_ram_listen_thread() needs to take BQL for some
>>      calls
>>    error: define g_autoptr() cleanup function for the Error type
>>    migration: Add thread pool of optional load threads
>>    migration/multifd: Split packet into header and RAM data
>>    migration/multifd: Device state transfer support - receive side
>>    migration/multifd: Make multifd_send() thread safe
>>    migration/multifd: Add an explicit MultiFDSendData destructor
>>    migration/multifd: Device state transfer support - send side
>>    migration/multifd: Add multifd_device_state_supported()
>>    migration: Add save_live_complete_precopy_thread handler
>>    vfio/migration: Add x-migration-load-config-after-iter VFIO property
>>    vfio/migration: Add load_device_config_state_start trace event
>>    vfio/migration: Convert bytes_transferred counter to atomic
>>    vfio/migration: Multifd device state transfer support - basic types
>>    vfio/migration: Multifd device state transfer support -
>>      VFIOStateBuffer(s)
>>    vfio/migration: Multifd device state transfer - add support checking
>>      function
>>    vfio/migration: Multifd device state transfer support - receive
>>      init/cleanup
>>    vfio/migration: Multifd device state transfer support - received
>>      buffers queuing
>>    vfio/migration: Multifd device state transfer support - load thread
>>    vfio/migration: Multifd device state transfer support - config loading
>>      support
>>    migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
>>    vfio/migration: Multifd device state transfer support - send side
>>    vfio/migration: Add x-migration-multifd-transfer VFIO property
>>    hw/core/machine: Add compat for x-migration-multifd-transfer VFIO
>>      property
>>
>> Peter Xu (1):
>>    migration/multifd: Make MultiFDSendData a struct
>>
>>   hw/core/machine.c                  |   2 +
>>   hw/vfio/migration.c                | 754 ++++++++++++++++++++++++++++-
>>   hw/vfio/pci.c                      |  14 +
>>   hw/vfio/trace-events               |  11 +-
>>   include/block/aio.h                |   8 +-
>>   include/block/thread-pool.h        |  62 ++-
>>   include/hw/vfio/vfio-common.h      |   7 +
>>   include/io/channel-tls.h           |  11 +
>>   include/migration/client-options.h |   4 +
>>   include/migration/misc.h           |  16 +
>>   include/migration/register.h       |  54 ++-
>>   include/qapi/error.h               |   2 +
>>   include/qemu/typedefs.h            |   6 +
>>   io/channel-tls.c                   |   6 +
>>   migration/colo.c                   |   3 +
>>   migration/meson.build              |   1 +
>>   migration/migration-hmp-cmds.c     |   2 +
>>   migration/migration.c              |   6 +-
>>   migration/migration.h              |   7 +
>>   migration/multifd-device-state.c   | 192 ++++++++
>>   migration/multifd-nocomp.c         |  30 +-
>>   migration/multifd.c                | 248 ++++++++--
>>   migration/multifd.h                |  74 ++-
>>   migration/options.c                |   9 +
>>   migration/qemu-file.h              |   2 +
>>   migration/savevm.c                 | 195 +++++++-
>>   migration/savevm.h                 |   6 +-
>>   migration/trace-events             |   1 +
>>   scripts/analyze-migration.py       |  11 +
>>   tests/unit/test-thread-pool.c      |   6 +-
>>   util/async.c                       |   6 +-
>>   util/thread-pool.c                 | 184 +++++--
>>   util/trace-events                  |   6 +-
>>   33 files changed, 1814 insertions(+), 132 deletions(-)
>>   create mode 100644 migration/multifd-device-state.c
> 
> Hi!
> 
> We have build issues:
> 
> https://gitlab.com/farosas/qemu/-/pipelines/1649146958
> 

Looks like that's an issue that qatomics on 64-bit
VFIO bytes transferred counters aren't available on
32-bit host platforms.

The easiest way would be probably to change these to
32-bit counters on 32-bit platforms since they can't
realistically address more memory anyway.

> And the postcopy/recovery test is failing. It seems the migration
> finishes before the test can issue migrate-pause:
> 
> QTEST_QEMU_BINARY=./qemu-system-x86_64  ./tests/qtest/migration-test -p
> /x86_64/migration/postcopy/recovery/plain
> ...
> {"execute": "migrate-start-postcopy"}
> {"return": {}}
> {"secs": 1738267018, "usecs": 860991}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
> {"secs": 1738267018, "usecs": 861284}, "event": "STOP"
> {"secs": 1738267017, "usecs": 960322}, "event": "MIGRATION", "data": {"status": "active"}
> {"secs": 1738267018, "usecs": 865589}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
> {"secs": 1738267099, "usecs": 120971}, "event": "MIGRATION", "data": {"status": "completed"}
> {"secs": 1738267099, "usecs": 121154}, "event": "RESUME"
> {"execute": "query-migrate"}
> 
> ERROR:../tests/qtest/migration/migration-qmp.c:172:check_migration_status:
> assertion failed (current_status != "completed"): ("completed" !=
> "completed")
> 

Hmm, it looks like this failure wasn't showing
in my tests because the test was skipped due to
missing userfaultfd support:

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/postcopy/recovery/plain
TAP version 14
# random seed: R02Sc99a7d93274064bb87f3e0789fbf8326
# Skipping test: userfaultfd not available
# Start of x86_64 tests
# Start of migration tests
# End of migration tests
# End of x86_64 tests
1..0

Will try to make this test run and investigate the reason for
failure.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
  2025-01-30 20:27   ` Maciej S. Szmigiero
@ 2025-01-30 20:46     ` Fabiano Rosas
  2025-01-31 18:16     ` Maciej S. Szmigiero
  1 sibling, 0 replies; 137+ messages in thread
From: Fabiano Rosas @ 2025-01-30 20:46 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 30.01.2025 21:19, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This is an updated v4 patch series of the v3 series located here:
>>> https://lore.kernel.org/qemu-devel/cover.1731773021.git.maciej.szmigiero@oracle.com/
>>>
>>> Changes from v3:
>>> * MigrationLoadThread now returns bool and an Error complex error type
>>> instead of just an int.
>>>
>>> * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
>>> instead of dedicated load_threads_ret variable.
>>>
>>> * Since the change above uncovered an issue with respect to multifd send
>>> channels not terminating TLS session properly QIOChannelTLS now allows
>>> gracefully handling this situation.
>>>
>>> * qemu_loadvm_load_thread_pool state is now part of MigrationIncomingState
>>> instead of being stored in global variables.
>>> This state now also has its own init/cleanup helpers.
>>>
>>> * qemu_loadvm_load_thread_pool code is now moved into a separate section
>>> of the savevm.c file, marked by an appropriate comment.
>>>
>>> * thread_pool_free() is now documented to have wait-before-free semantic,
>>> which allowed removal of explicit waits from thread pool cleanup paths.
>>>
>>> * thread_pool_submit_immediate() method was added since this functionality
>>> is used by both generic thread pool users in this patch set.
>>>
>>> * postcopy_ram_listen_thread() now takes BQL around function calls that
>>> ultimately call migration methods requiring BQL.
>>> This fixes one of QEMU tests failing when explicitly BQL-sensitive code
>>> is added later to these methods.
>>>
>>> * qemu_loadvm_load_state_buffer() now returns a bool value instead of int.
>>>
>>> * "Send final SYNC only after device state is complete" patch was
>>> dropped since Peter implemented equivalent functionality upstream.
>>>
>>> * "Document the BQL behavior of load SaveVMHandlers" patch was dropped
>>> since that's something better done later, separately from this patch set.
>>>
>>> * Header size is now added to mig_stats.multifd_bytes where it is actually
>>> sent in the zero copy case - in multifd_nocomp_send_prepare().
>>>
>>> * Spurious wakeups from qemu_cond_wait() are now handled properly as
>>> pointed out by Avihai.
>>>
>>> * VFIO migration FD now allows partial write() completion as pointed out
>>> by Avihai.
>>>
>>> * Patch "vfio/migration: Don't run load cleanup if load setup didn't run"
>>> was dropped, instead all objects related to multifd load are now located in
>>> their own VFIOMultifd struct which is allocated only if multifd device state
>>> transfer is actually in use.
>>>
>>> * Intermediate VFIOStateBuffers API as suggested by Avihai is now introduced
>>> to simplify vfio_load_state_buffer() and vfio_load_bufs_thread().
>>>
>>> * Optional VFIO device config state loading interlocking with loading
>>> other iterables is now possible due to ARM64 platform VFIO dependency on
>>> interrupt controller being loaded first as pointed out by Avihai.
>>>
>>> * Patch "Multifd device state transfer support - receive side" was split
>>> into a few smaller patches as suggested by Cédric.
>>>
>>> * x-migration-multifd-transfer VFIO property compat changes were moved
>>> into a separate patch as suggested by Cédric.
>>>
>>> * Other small changes, like renamed functions and variables/members, added
>>> review tags, code formatting, moved QEMU_LOCK_GUARD() instances closer to
>>> actual protected blocks, etc.
>>>
>>> ========================================================================
>>>
>>> This patch set is targeting QEMU 10.0.
>>>
>>> What's not yet present is documentation update under docs/devel/migration
>>> but I didn't want to delay posting the code any longer.
>>> Such doc can still be merged later when the design is 100% finalized.
>>>
>>> ========================================================================
>>>
>>> Maciej S. Szmigiero (32):
>>>    migration: Clarify that {load,save}_cleanup handlers can run without
>>>      setup
>>>    thread-pool: Remove thread_pool_submit() function
>>>    thread-pool: Rename AIO pool functions to *_aio() and data types to
>>>      *Aio
>>>    thread-pool: Implement generic (non-AIO) pool support
>>>    migration: Add MIG_CMD_SWITCHOVER_START and its load handler
>>>    migration: Add qemu_loadvm_load_state_buffer() and its handler
>>>    io: tls: Allow terminating the TLS session gracefully with EOF
>>>    migration/multifd: Allow premature EOF on TLS incoming channels
>>>    migration: postcopy_ram_listen_thread() needs to take BQL for some
>>>      calls
>>>    error: define g_autoptr() cleanup function for the Error type
>>>    migration: Add thread pool of optional load threads
>>>    migration/multifd: Split packet into header and RAM data
>>>    migration/multifd: Device state transfer support - receive side
>>>    migration/multifd: Make multifd_send() thread safe
>>>    migration/multifd: Add an explicit MultiFDSendData destructor
>>>    migration/multifd: Device state transfer support - send side
>>>    migration/multifd: Add multifd_device_state_supported()
>>>    migration: Add save_live_complete_precopy_thread handler
>>>    vfio/migration: Add x-migration-load-config-after-iter VFIO property
>>>    vfio/migration: Add load_device_config_state_start trace event
>>>    vfio/migration: Convert bytes_transferred counter to atomic
>>>    vfio/migration: Multifd device state transfer support - basic types
>>>    vfio/migration: Multifd device state transfer support -
>>>      VFIOStateBuffer(s)
>>>    vfio/migration: Multifd device state transfer - add support checking
>>>      function
>>>    vfio/migration: Multifd device state transfer support - receive
>>>      init/cleanup
>>>    vfio/migration: Multifd device state transfer support - received
>>>      buffers queuing
>>>    vfio/migration: Multifd device state transfer support - load thread
>>>    vfio/migration: Multifd device state transfer support - config loading
>>>      support
>>>    migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
>>>    vfio/migration: Multifd device state transfer support - send side
>>>    vfio/migration: Add x-migration-multifd-transfer VFIO property
>>>    hw/core/machine: Add compat for x-migration-multifd-transfer VFIO
>>>      property
>>>
>>> Peter Xu (1):
>>>    migration/multifd: Make MultiFDSendData a struct
>>>
>>>   hw/core/machine.c                  |   2 +
>>>   hw/vfio/migration.c                | 754 ++++++++++++++++++++++++++++-
>>>   hw/vfio/pci.c                      |  14 +
>>>   hw/vfio/trace-events               |  11 +-
>>>   include/block/aio.h                |   8 +-
>>>   include/block/thread-pool.h        |  62 ++-
>>>   include/hw/vfio/vfio-common.h      |   7 +
>>>   include/io/channel-tls.h           |  11 +
>>>   include/migration/client-options.h |   4 +
>>>   include/migration/misc.h           |  16 +
>>>   include/migration/register.h       |  54 ++-
>>>   include/qapi/error.h               |   2 +
>>>   include/qemu/typedefs.h            |   6 +
>>>   io/channel-tls.c                   |   6 +
>>>   migration/colo.c                   |   3 +
>>>   migration/meson.build              |   1 +
>>>   migration/migration-hmp-cmds.c     |   2 +
>>>   migration/migration.c              |   6 +-
>>>   migration/migration.h              |   7 +
>>>   migration/multifd-device-state.c   | 192 ++++++++
>>>   migration/multifd-nocomp.c         |  30 +-
>>>   migration/multifd.c                | 248 ++++++++--
>>>   migration/multifd.h                |  74 ++-
>>>   migration/options.c                |   9 +
>>>   migration/qemu-file.h              |   2 +
>>>   migration/savevm.c                 | 195 +++++++-
>>>   migration/savevm.h                 |   6 +-
>>>   migration/trace-events             |   1 +
>>>   scripts/analyze-migration.py       |  11 +
>>>   tests/unit/test-thread-pool.c      |   6 +-
>>>   util/async.c                       |   6 +-
>>>   util/thread-pool.c                 | 184 +++++--
>>>   util/trace-events                  |   6 +-
>>>   33 files changed, 1814 insertions(+), 132 deletions(-)
>>>   create mode 100644 migration/multifd-device-state.c
>> 
>> Hi!
>> 
>> We have build issues:
>> 
>> https://gitlab.com/farosas/qemu/-/pipelines/1649146958
>> 
>
> Looks like that's an issue that qatomics on 64-bit
> VFIO bytes transferred counters aren't available on
> 32-bit host platforms.
>
> The easiest way would be probably to change these to
> 32-bit counters on 32-bit platforms since they can't
> realistically address more memory anyway.
>
>> And the postcopy/recovery test is failing. It seems the migration
>> finishes before the test can issue migrate-pause:
>> 
>> QTEST_QEMU_BINARY=./qemu-system-x86_64  ./tests/qtest/migration-test -p
>> /x86_64/migration/postcopy/recovery/plain
>> ...
>> {"execute": "migrate-start-postcopy"}
>> {"return": {}}
>> {"secs": 1738267018, "usecs": 860991}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
>> {"secs": 1738267018, "usecs": 861284}, "event": "STOP"
>> {"secs": 1738267017, "usecs": 960322}, "event": "MIGRATION", "data": {"status": "active"}
>> {"secs": 1738267018, "usecs": 865589}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
>> {"secs": 1738267099, "usecs": 120971}, "event": "MIGRATION", "data": {"status": "completed"}
>> {"secs": 1738267099, "usecs": 121154}, "event": "RESUME"
>> {"execute": "query-migrate"}
>> 
>> ERROR:../tests/qtest/migration/migration-qmp.c:172:check_migration_status:
>> assertion failed (current_status != "completed"): ("completed" !=
>> "completed")
>> 
>
> Hmm, it looks like this failure wasn't showing
> in my tests because the test was skipped due to
> missing userfaultfd support:
>
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/postcopy/recovery/plain
> TAP version 14
> # random seed: R02Sc99a7d93274064bb87f3e0789fbf8326
> # Skipping test: userfaultfd not available
> # Start of x86_64 tests
> # Start of migration tests
> # End of migration tests
> # End of x86_64 tests
> 1..0
>
> Will try to make this test run and investigate the reason for
> failure.

This will probably help:

sysctl -w vm.unprivileged_userfaultfd=1

I also had broken userfaultfd detection in the tests a while back. But
it's fixed now.

>
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 22/33] vfio/migration: Convert bytes_transferred counter to atomic
  2025-01-30 10:08 ` [PATCH v4 22/33] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
@ 2025-01-30 21:35   ` Cédric Le Goater
  2025-01-31  9:47     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-01-30 21:35 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> So it can be safety accessed from multiple threads.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f5df5ef17080..cbb1e0b6f852 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -416,7 +416,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>       qemu_put_be64(f, data_size);
>       qemu_put_buffer(f, migration->data_buffer, data_size);
> -    bytes_transferred += data_size;
> +    qatomic_add(&bytes_transferred, data_size);

bytes_transferred should of type : aligned_uint64_t

>   
>       trace_vfio_save_block(migration->vbasedev->name, data_size);
>   
> @@ -1038,12 +1038,12 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
>   
>   int64_t vfio_mig_bytes_transferred(void)
>   {
> -    return bytes_transferred;
> +    return qatomic_read(&bytes_transferred);

please use qatomic_read_u64()

>   }
>   
>   void vfio_reset_bytes_transferred(void)
>   {
> -    bytes_transferred = 0;
> +    qatomic_set(&bytes_transferred, 0);

and qatomic_set_u64().

No need to resend for that (yet) but it might explain the test issues.

Thanks,

C.


>   }
>   
>   /*
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 22/33] vfio/migration: Convert bytes_transferred counter to atomic
  2025-01-30 21:35   ` Cédric Le Goater
@ 2025-01-31  9:47     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-31  9:47 UTC (permalink / raw)
  To: Cédric Le Goater, Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 30.01.2025 22:35, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> So it can be safety accessed from multiple threads.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index f5df5ef17080..cbb1e0b6f852 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -416,7 +416,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>>       qemu_put_be64(f, data_size);
>>       qemu_put_buffer(f, migration->data_buffer, data_size);
>> -    bytes_transferred += data_size;
>> +    qatomic_add(&bytes_transferred, data_size);
> 
> bytes_transferred should of type : aligned_uint64_t
> 
>>       trace_vfio_save_block(migration->vbasedev->name, data_size);
>> @@ -1038,12 +1038,12 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
>>   int64_t vfio_mig_bytes_transferred(void)
>>   {
>> -    return bytes_transferred;
>> +    return qatomic_read(&bytes_transferred);
> 
> please use qatomic_read_u64()
> 
>>   }
>>   void vfio_reset_bytes_transferred(void)
>>   {
>> -    bytes_transferred = 0;
>> +    qatomic_set(&bytes_transferred, 0);
> 
> and qatomic_set_u64().
> 

Unfortunately, using aligned_uint64_t does not work since there's
no qatomic_add_u64().

I think best fix here is to use an "unsigned long" - it's 64 bit on
64-bit Unix systems and 32-bit on 32-bit ones.

On the other hand it's 32-bit on 64-bit Windows but Windows does not
use VFIO anyway.

> No need to resend for that (yet) but it might explain the test issues.
> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
  2025-01-30 20:27   ` Maciej S. Szmigiero
  2025-01-30 20:46     ` Fabiano Rosas
@ 2025-01-31 18:16     ` Maciej S. Szmigiero
  1 sibling, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-01-31 18:16 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 30.01.2025 21:27, Maciej S. Szmigiero wrote:
> On 30.01.2025 21:19, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This is an updated v4 patch series of the v3 series located here:
>>> https://lore.kernel.org/qemu-devel/cover.1731773021.git.maciej.szmigiero@oracle.com/
>>>
>>> Changes from v3:
>>> * MigrationLoadThread now returns bool and an Error complex error type
>>> instead of just an int.
>>>
>>> * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
>>> instead of dedicated load_threads_ret variable.
>>>
>>> * Since the change above uncovered an issue with respect to multifd send
>>> channels not terminating TLS session properly QIOChannelTLS now allows
>>> gracefully handling this situation.
>>>
>>> * qemu_loadvm_load_thread_pool state is now part of MigrationIncomingState
>>> instead of being stored in global variables.
>>> This state now also has its own init/cleanup helpers.
>>>
>>> * qemu_loadvm_load_thread_pool code is now moved into a separate section
>>> of the savevm.c file, marked by an appropriate comment.
>>>
>>> * thread_pool_free() is now documented to have wait-before-free semantic,
>>> which allowed removal of explicit waits from thread pool cleanup paths.
>>>
>>> * thread_pool_submit_immediate() method was added since this functionality
>>> is used by both generic thread pool users in this patch set.
>>>
>>> * postcopy_ram_listen_thread() now takes BQL around function calls that
>>> ultimately call migration methods requiring BQL.
>>> This fixes one of QEMU tests failing when explicitly BQL-sensitive code
>>> is added later to these methods.
>>>
>>> * qemu_loadvm_load_state_buffer() now returns a bool value instead of int.
>>>
>>> * "Send final SYNC only after device state is complete" patch was
>>> dropped since Peter implemented equivalent functionality upstream.
>>>
>>> * "Document the BQL behavior of load SaveVMHandlers" patch was dropped
>>> since that's something better done later, separately from this patch set.
>>>
>>> * Header size is now added to mig_stats.multifd_bytes where it is actually
>>> sent in the zero copy case - in multifd_nocomp_send_prepare().
>>>
>>> * Spurious wakeups from qemu_cond_wait() are now handled properly as
>>> pointed out by Avihai.
>>>
>>> * VFIO migration FD now allows partial write() completion as pointed out
>>> by Avihai.
>>>
>>> * Patch "vfio/migration: Don't run load cleanup if load setup didn't run"
>>> was dropped, instead all objects related to multifd load are now located in
>>> their own VFIOMultifd struct which is allocated only if multifd device state
>>> transfer is actually in use.
>>>
>>> * Intermediate VFIOStateBuffers API as suggested by Avihai is now introduced
>>> to simplify vfio_load_state_buffer() and vfio_load_bufs_thread().
>>>
>>> * Optional VFIO device config state loading interlocking with loading
>>> other iterables is now possible due to ARM64 platform VFIO dependency on
>>> interrupt controller being loaded first as pointed out by Avihai.
>>>
>>> * Patch "Multifd device state transfer support - receive side" was split
>>> into a few smaller patches as suggested by Cédric.
>>>
>>> * x-migration-multifd-transfer VFIO property compat changes were moved
>>> into a separate patch as suggested by Cédric.
>>>
>>> * Other small changes, like renamed functions and variables/members, added
>>> review tags, code formatting, moved QEMU_LOCK_GUARD() instances closer to
>>> actual protected blocks, etc.
>>>
>>> ========================================================================
>>>
>>> This patch set is targeting QEMU 10.0.
>>>
>>> What's not yet present is documentation update under docs/devel/migration
>>> but I didn't want to delay posting the code any longer.
>>> Such doc can still be merged later when the design is 100% finalized.
>>>
>>> ========================================================================
>>>
>>> Maciej S. Szmigiero (32):
>>>    migration: Clarify that {load,save}_cleanup handlers can run without
>>>      setup
>>>    thread-pool: Remove thread_pool_submit() function
>>>    thread-pool: Rename AIO pool functions to *_aio() and data types to
>>>      *Aio
>>>    thread-pool: Implement generic (non-AIO) pool support
>>>    migration: Add MIG_CMD_SWITCHOVER_START and its load handler
>>>    migration: Add qemu_loadvm_load_state_buffer() and its handler
>>>    io: tls: Allow terminating the TLS session gracefully with EOF
>>>    migration/multifd: Allow premature EOF on TLS incoming channels
>>>    migration: postcopy_ram_listen_thread() needs to take BQL for some
>>>      calls
>>>    error: define g_autoptr() cleanup function for the Error type
>>>    migration: Add thread pool of optional load threads
>>>    migration/multifd: Split packet into header and RAM data
>>>    migration/multifd: Device state transfer support - receive side
>>>    migration/multifd: Make multifd_send() thread safe
>>>    migration/multifd: Add an explicit MultiFDSendData destructor
>>>    migration/multifd: Device state transfer support - send side
>>>    migration/multifd: Add multifd_device_state_supported()
>>>    migration: Add save_live_complete_precopy_thread handler
>>>    vfio/migration: Add x-migration-load-config-after-iter VFIO property
>>>    vfio/migration: Add load_device_config_state_start trace event
>>>    vfio/migration: Convert bytes_transferred counter to atomic
>>>    vfio/migration: Multifd device state transfer support - basic types
>>>    vfio/migration: Multifd device state transfer support -
>>>      VFIOStateBuffer(s)
>>>    vfio/migration: Multifd device state transfer - add support checking
>>>      function
>>>    vfio/migration: Multifd device state transfer support - receive
>>>      init/cleanup
>>>    vfio/migration: Multifd device state transfer support - received
>>>      buffers queuing
>>>    vfio/migration: Multifd device state transfer support - load thread
>>>    vfio/migration: Multifd device state transfer support - config loading
>>>      support
>>>    migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
>>>    vfio/migration: Multifd device state transfer support - send side
>>>    vfio/migration: Add x-migration-multifd-transfer VFIO property
>>>    hw/core/machine: Add compat for x-migration-multifd-transfer VFIO
>>>      property
>>>
>>> Peter Xu (1):
>>>    migration/multifd: Make MultiFDSendData a struct
>>>
>>>   hw/core/machine.c                  |   2 +
>>>   hw/vfio/migration.c                | 754 ++++++++++++++++++++++++++++-
>>>   hw/vfio/pci.c                      |  14 +
>>>   hw/vfio/trace-events               |  11 +-
>>>   include/block/aio.h                |   8 +-
>>>   include/block/thread-pool.h        |  62 ++-
>>>   include/hw/vfio/vfio-common.h      |   7 +
>>>   include/io/channel-tls.h           |  11 +
>>>   include/migration/client-options.h |   4 +
>>>   include/migration/misc.h           |  16 +
>>>   include/migration/register.h       |  54 ++-
>>>   include/qapi/error.h               |   2 +
>>>   include/qemu/typedefs.h            |   6 +
>>>   io/channel-tls.c                   |   6 +
>>>   migration/colo.c                   |   3 +
>>>   migration/meson.build              |   1 +
>>>   migration/migration-hmp-cmds.c     |   2 +
>>>   migration/migration.c              |   6 +-
>>>   migration/migration.h              |   7 +
>>>   migration/multifd-device-state.c   | 192 ++++++++
>>>   migration/multifd-nocomp.c         |  30 +-
>>>   migration/multifd.c                | 248 ++++++++--
>>>   migration/multifd.h                |  74 ++-
>>>   migration/options.c                |   9 +
>>>   migration/qemu-file.h              |   2 +
>>>   migration/savevm.c                 | 195 +++++++-
>>>   migration/savevm.h                 |   6 +-
>>>   migration/trace-events             |   1 +
>>>   scripts/analyze-migration.py       |  11 +
>>>   tests/unit/test-thread-pool.c      |   6 +-
>>>   util/async.c                       |   6 +-
>>>   util/thread-pool.c                 | 184 +++++--
>>>   util/trace-events                  |   6 +-
>>>   33 files changed, 1814 insertions(+), 132 deletions(-)
>>>   create mode 100644 migration/multifd-device-state.c
>>
>> Hi!
>>
>> We have build issues:
>>
>> https://gitlab.com/farosas/qemu/-/pipelines/1649146958
>>
> 
> Looks like that's an issue that qatomics on 64-bit
> VFIO bytes transferred counters aren't available on
> 32-bit host platforms.
> 
> The easiest way would be probably to change these to
> 32-bit counters on 32-bit platforms since they can't
> realistically address more memory anyway.

Updated the patch to use "unsigned long" counter instead:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/e42b16d2009067bff5a11936aece8a7af2436dc4

>> And the postcopy/recovery test is failing. It seems the migration
>> finishes before the test can issue migrate-pause:
>>
>> QTEST_QEMU_BINARY=./qemu-system-x86_64  ./tests/qtest/migration-test -p
>> /x86_64/migration/postcopy/recovery/plain
>> ...
>> {"execute": "migrate-start-postcopy"}
>> {"return": {}}
>> {"secs": 1738267018, "usecs": 860991}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
>> {"secs": 1738267018, "usecs": 861284}, "event": "STOP"
>> {"secs": 1738267017, "usecs": 960322}, "event": "MIGRATION", "data": {"status": "active"}
>> {"secs": 1738267018, "usecs": 865589}, "event": "MIGRATION", "data": {"status": "postcopy-active"}
>> {"secs": 1738267099, "usecs": 120971}, "event": "MIGRATION", "data": {"status": "completed"}
>> {"secs": 1738267099, "usecs": 121154}, "event": "RESUME"
>> {"execute": "query-migrate"}
>>
>> ERROR:../tests/qtest/migration/migration-qmp.c:172:check_migration_status:
>> assertion failed (current_status != "completed"): ("completed" !=
>> "completed")
>>
> 
> Hmm, it looks like this failure wasn't showing
> in my tests because the test was skipped due to
> missing userfaultfd support:
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/postcopy/recovery/plain
> TAP version 14
> # random seed: R02Sc99a7d93274064bb87f3e0789fbf8326
> # Skipping test: userfaultfd not available
> # Start of x86_64 tests
> # Start of migration tests
> # End of migration tests
> # End of x86_64 tests
> 1..0
> 
> Will try to make this test run and investigate the reason for
> failure.

It looks like the issue here is that holding BQL around
qemu_loadvm_state_main() in postcopy_ram_listen_thread() causes
blocking QEMUFile operations in qemu_loadvm_state_main()
(and its children) to effectively block the whole QEMU while they
are waiting for I/O.

This causes the test to fail because when that qemu_loadvm_state_main()
call finishes and the postcopy thread relinquishes BQL the migration
state immediately reaches "completed" without giving the test chance
to abort the migration attempt.

I still think that that qemu_loadvm_state_main() call needs BQL
since every other its caller seems to hold it and
qemu_loadvm_state_main() ultimately calls "load_state" SaveVMHandlers
which otherwise would have inconsistent BQL semantics.

Since only the second BQL lock in postcopy_ram_listen_thread()
(the one around migration_incoming_state_destroy()) is technically
necessary for other parts of this patch set I have "downgraded"
that other BQL around qemu_loadvm_state_main() to a TODO remark
for now so postcopy tests now pass:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/005a79953aaa75cad160b95252ba421122d5a6a4

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-01-30 10:08 ` [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls Maciej S. Szmigiero
@ 2025-02-02  2:06   ` Dr. David Alan Gilbert
  2025-02-02 11:55     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Dr. David Alan Gilbert @ 2025-02-02  2:06 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

* Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> postcopy_ram_listen_thread() is a free running thread, so it needs to
> take BQL around function calls to migration methods requiring BQL.
> 
> qemu_loadvm_state_main() needs BQL held since it ultimately calls
> "load_state" SaveVMHandlers.
> 
> migration_incoming_state_destroy() needs BQL held since it ultimately calls
> "load_cleanup" SaveVMHandlers.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  migration/savevm.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b0b74140daea..0ceea9638cc1 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>       * in qemu_file, and thus we must be blocking now.
>       */
>      qemu_file_set_blocking(f, true);
> +    bql_lock();
>      load_res = qemu_loadvm_state_main(f, mis);
> +    bql_unlock();

Doesn't that leave that held for a heck of a long time?
That RAM loading has to happen in parallel with the loading of
devices doesn't it - especially if one of the devices
being loaded touches RAM.

(I wish this series had a description in the cover letter!)

Dave


>      /*
>       * This is tricky, but, mis->from_src_file can change after it
> @@ -2073,7 +2075,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>       * (If something broke then qemu will have to exit anyway since it's
>       * got a bad migration state).
>       */
> +    bql_lock();
>      migration_incoming_state_destroy();
> +    bql_unlock();
>  
>      rcu_unregister_thread();
>      mis->have_listen_thread = false;
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-02  2:06   ` Dr. David Alan Gilbert
@ 2025-02-02 11:55     ` Maciej S. Szmigiero
  2025-02-02 12:45       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-02 11:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> postcopy_ram_listen_thread() is a free running thread, so it needs to
>> take BQL around function calls to migration methods requiring BQL.
>>
>> qemu_loadvm_state_main() needs BQL held since it ultimately calls
>> "load_state" SaveVMHandlers.
>>
>> migration_incoming_state_destroy() needs BQL held since it ultimately calls
>> "load_cleanup" SaveVMHandlers.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   migration/savevm.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index b0b74140daea..0ceea9638cc1 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>>        * in qemu_file, and thus we must be blocking now.
>>        */
>>       qemu_file_set_blocking(f, true);
>> +    bql_lock();
>>       load_res = qemu_loadvm_state_main(f, mis);
>> +    bql_unlock();
> 
> Doesn't that leave that held for a heck of a long time?

Yes, and it effectively broke "postcopy recover" test but I
think the reason for that is qemu_loadvm_state_main() and
its children don't drop BQL while waiting for I/O.

I've described this case in more detail in my reply to Fabiano here:
https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/

I still think that "load_state" SaveVMHandlers need to be called
with BQL held since implementations apparently expect it that way:
for example, I think PCI device configuration restore calls
address space manipulation methods which abort() if called
without BQL held.

I have previously even submitted a patch to explicitly document
"load_state" SaveVMHandler as requiring BQL (which was also
included in the previous version of this patch set) and it
received a "Reviewed-by:" tag:
https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/

It's also worth noting that COLO equivalent of postcopy
incoming thread (colo_process_incoming_thread()) explicitly
takes BQL around qemu_loadvm_state_main():
>     bql_lock();
>     cpu_synchronize_all_states();
>     ret = qemu_loadvm_state_main(mis->from_src_file, mis);
>     bql_unlock();


> That RAM loading has to happen in parallel with the loading of
> devices doesn't it - especially if one of the devices
> being loaded touches RAM.
> 
> (I wish this series had a description in the cover letter!)

I guess you mean "more detailed description" since there's
a paragraph about this patch in this series cover letter change log:
> * postcopy_ram_listen_thread() now takes BQL around function calls that
> ultimately call migration methods requiring BQL.
> This fixes one of QEMU tests failing when explicitly BQL-sensitive code
> is added later to these methods.

> Dave

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-02 11:55     ` Maciej S. Szmigiero
@ 2025-02-02 12:45       ` Dr. David Alan Gilbert
  2025-02-03 13:57         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Dr. David Alan Gilbert @ 2025-02-02 12:45 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P. Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

* Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
> > * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > postcopy_ram_listen_thread() is a free running thread, so it needs to
> > > take BQL around function calls to migration methods requiring BQL.
> > > 
> > > qemu_loadvm_state_main() needs BQL held since it ultimately calls
> > > "load_state" SaveVMHandlers.
> > > 
> > > migration_incoming_state_destroy() needs BQL held since it ultimately calls
> > > "load_cleanup" SaveVMHandlers.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > ---
> > >   migration/savevm.c | 4 ++++
> > >   1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index b0b74140daea..0ceea9638cc1 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > >        * in qemu_file, and thus we must be blocking now.
> > >        */
> > >       qemu_file_set_blocking(f, true);
> > > +    bql_lock();
> > >       load_res = qemu_loadvm_state_main(f, mis);
> > > +    bql_unlock();
> > 
> > Doesn't that leave that held for a heck of a long time?
> 
> Yes, and it effectively broke "postcopy recover" test but I
> think the reason for that is qemu_loadvm_state_main() and
> its children don't drop BQL while waiting for I/O.
> 
> I've described this case in more detail in my reply to Fabiano here:
> https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/

While it might be the cause in this case, my feeling is it's more fundamental
here - it's the whole reason that postcopy has a separate ram listen
thread.  As the destination is running, after it loads it's devices
and as it starts up the destination will be still loading RAM 
(and other postcopiable devices) potentially for quite a while.
Holding the bql around the ram listen thread means that the
execution of the destination won't be able to take that lock
until the postcopy load has finished; so while that might apparently
complete, it'll lead to the destination stalling until that's finished
which defeats the whole point of postcopy.
That last one probably won't fail a test but it will lead to a long stall
if you give it a nice big guest with lots of RAM that it's rapidly
changing.

> I still think that "load_state" SaveVMHandlers need to be called
> with BQL held since implementations apparently expect it that way:
> for example, I think PCI device configuration restore calls
> address space manipulation methods which abort() if called
> without BQL held.

However, the only devices that *should* be arriving on the channel
that the postcopy_ram_listen_thread is reading from are those
that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
Those load handlers are safe to be run while the other devices
are being changed.   Note the *should* - you could add a check
to fail if any other device arrives on that channel.

> I have previously even submitted a patch to explicitly document
> "load_state" SaveVMHandler as requiring BQL (which was also
> included in the previous version of this patch set) and it
> received a "Reviewed-by:" tag:
> https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
> https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
> https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/

It happens!
You could make this safer by having a load_state and a load_state_postcopy
member, and only mark the load_state as requiring the lock.

> It's also worth noting that COLO equivalent of postcopy
> incoming thread (colo_process_incoming_thread()) explicitly
> takes BQL around qemu_loadvm_state_main():
> >     bql_lock();
> >     cpu_synchronize_all_states();
> >     ret = qemu_loadvm_state_main(mis->from_src_file, mis);
> >     bql_unlock();
> 

It's not a straight equivalent; it's about a decade since I've
thought about COLO, so I can't quite remember when that thread
runs.

> > That RAM loading has to happen in parallel with the loading of
> > devices doesn't it - especially if one of the devices
> > being loaded touches RAM.
> > 
> > (I wish this series had a description in the cover letter!)
> 
> I guess you mean "more detailed description" since there's
> a paragraph about this patch in this series cover letter change log:
> > * postcopy_ram_listen_thread() now takes BQL around function calls that
> > ultimately call migration methods requiring BQL.
> > This fixes one of QEMU tests failing when explicitly BQL-sensitive code
> > is added later to these methods.

I meant a higher level description of what the series is doing.

Dave

> 
> > Dave
> 
> Thanks,
> Maciej
> 
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-02 12:45       ` Dr. David Alan Gilbert
@ 2025-02-03 13:57         ` Maciej S. Szmigiero
  2025-02-03 19:58           ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-03 13:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P. Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

On 2.02.2025 13:45, Dr. David Alan Gilbert wrote:
> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>> On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
>>> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> postcopy_ram_listen_thread() is a free running thread, so it needs to
>>>> take BQL around function calls to migration methods requiring BQL.
>>>>
>>>> qemu_loadvm_state_main() needs BQL held since it ultimately calls
>>>> "load_state" SaveVMHandlers.
>>>>
>>>> migration_incoming_state_destroy() needs BQL held since it ultimately calls
>>>> "load_cleanup" SaveVMHandlers.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>    migration/savevm.c | 4 ++++
>>>>    1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index b0b74140daea..0ceea9638cc1 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>>>>         * in qemu_file, and thus we must be blocking now.
>>>>         */
>>>>        qemu_file_set_blocking(f, true);
>>>> +    bql_lock();
>>>>        load_res = qemu_loadvm_state_main(f, mis);
>>>> +    bql_unlock();
>>>
>>> Doesn't that leave that held for a heck of a long time?
>>
>> Yes, and it effectively broke "postcopy recover" test but I
>> think the reason for that is qemu_loadvm_state_main() and
>> its children don't drop BQL while waiting for I/O.
>>
>> I've described this case in more detail in my reply to Fabiano here:
>> https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/
> 
> While it might be the cause in this case, my feeling is it's more fundamental
> here - it's the whole reason that postcopy has a separate ram listen
> thread.  As the destination is running, after it loads it's devices
> and as it starts up the destination will be still loading RAM
> (and other postcopiable devices) potentially for quite a while.
> Holding the bql around the ram listen thread means that the
> execution of the destination won't be able to take that lock
> until the postcopy load has finished; so while that might apparently
> complete, it'll lead to the destination stalling until that's finished
> which defeats the whole point of postcopy.
> That last one probably won't fail a test but it will lead to a long stall
> if you give it a nice big guest with lots of RAM that it's rapidly
> changing.

Okay, I understand the postcopy case/flow now.
Thanks for explaining it clearly.

>> I still think that "load_state" SaveVMHandlers need to be called
>> with BQL held since implementations apparently expect it that way:
>> for example, I think PCI device configuration restore calls
>> address space manipulation methods which abort() if called
>> without BQL held.
> 
> However, the only devices that *should* be arriving on the channel
> that the postcopy_ram_listen_thread is reading from are those
> that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
> Those load handlers are safe to be run while the other devices
> are being changed.   Note the *should* - you could add a check
> to fail if any other device arrives on that channel.

I think ultimately there should be either an explicit check, or,
as you suggest in the paragraph below, a separate SaveVMHandler
that runs without BQL held.
Since the current state of just running these SaveVMHandlers
without BQL in this case and hoping that nothing breaks is
clearly sub-optimal.

>> I have previously even submitted a patch to explicitly document
>> "load_state" SaveVMHandler as requiring BQL (which was also
>> included in the previous version of this patch set) and it
>> received a "Reviewed-by:" tag:
>> https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
>> https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
>> https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/
> 
> It happens!
> You could make this safer by having a load_state and a load_state_postcopy
> member, and only mark the load_state as requiring the lock.

To not digress too much from the subject of this patch set
(multifd VFIO device state transfer) for now I've just updated the
TODO comment around that qemu_loadvm_state_main(), so hopefully this
discussion won't get forgotten:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/046e3deac5b1dbc406b3e9571f62468bd6743e79

(..)
>>> That RAM loading has to happen in parallel with the loading of
>>> devices doesn't it - especially if one of the devices
>>> being loaded touches RAM.
>>>
>>> (I wish this series had a description in the cover letter!)
>>
>> I guess you mean "more detailed description" since there's
>> a paragraph about this patch in this series cover letter change log:
>>> * postcopy_ram_listen_thread() now takes BQL around function calls that
>>> ultimately call migration methods requiring BQL.
>>> This fixes one of QEMU tests failing when explicitly BQL-sensitive code
>>> is added later to these methods.
> 
> I meant a higher level description of what the series is doing.

There was a general overview what this series is trying to achieve
on its RFC (version 0) cover letter:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/

I also did a presentation about this patch set during last year's KVM Forum,
its slide deck is here:
https://pretalx.com/media/kvm-forum-2024/submissions/ZSYR9Z/resources/kvm-forum-2024-multifd-device-state-transfer_3K5EQIG.pdf

> Dave
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
  2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (33 preceding siblings ...)
  2025-01-30 20:19 ` [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
@ 2025-02-03 14:19 ` Cédric Le Goater
  2025-02-21  6:57   ` Yanghang Liu
  34 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-03 14:19 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

Hello Maciej,

> This patch set is targeting QEMU 10.0.
> 
> What's not yet present is documentation update under docs/devel/migration
> but I didn't want to delay posting the code any longer.
> Such doc can still be merged later when the design is 100% finalized.
The changes are quite complex, the design is not trivial, the benefits are
not huge as far as we know. I'd rather have the doc update first please.

Thanks,

C.




^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-01-30 10:08 ` [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels Maciej S. Szmigiero
@ 2025-02-03 18:20   ` Peter Xu
  2025-02-03 18:53     ` Maciej S. Szmigiero
  2025-02-04 15:08     ` Daniel P. Berrangé
  0 siblings, 2 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-03 18:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Multifd send channels are terminated by calling
> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> multifd_send_terminate_threads(), which in the TLS case essentially
> calls shutdown(SHUT_RDWR) on the underlying raw socket.
> 
> Unfortunately, this does not terminate the TLS session properly and
> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> 
> The only reason why this wasn't causing migration failures is because
> the current migration code apparently does not check for migration
> error being set after the end of the multifd receive process.
> 
> However, this will change soon so the multifd receive code has to be
> prepared to not return an error on such premature TLS session EOF.
> Use the newly introduced QIOChannelTLS method for that.
> 
> It's worth noting that even if the sender were to be changed to terminate
> the TLS connection properly the receive side still needs to remain
> compatible with older QEMU bit stream which does not do this.

If this is an existing bug, we could add a Fixes.

Two pure questions..

  - What is the correct way to terminate the TLS session without this flag?

  - Why this is only needed by multifd sessions?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 18:20   ` Peter Xu
@ 2025-02-03 18:53     ` Maciej S. Szmigiero
  2025-02-03 20:20       ` Peter Xu
  2025-02-04 15:08     ` Daniel P. Berrangé
  1 sibling, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-03 18:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 3.02.2025 19:20, Peter Xu wrote:
> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Multifd send channels are terminated by calling
>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> multifd_send_terminate_threads(), which in the TLS case essentially
>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>
>> Unfortunately, this does not terminate the TLS session properly and
>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>
>> The only reason why this wasn't causing migration failures is because
>> the current migration code apparently does not check for migration
>> error being set after the end of the multifd receive process.
>>
>> However, this will change soon so the multifd receive code has to be
>> prepared to not return an error on such premature TLS session EOF.
>> Use the newly introduced QIOChannelTLS method for that.
>>
>> It's worth noting that even if the sender were to be changed to terminate
>> the TLS connection properly the receive side still needs to remain
>> compatible with older QEMU bit stream which does not do this.
> 
> If this is an existing bug, we could add a Fixes.

It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.
  
> Two pure questions..
> 
>    - What is the correct way to terminate the TLS session without this flag?

I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102

>    - Why this is only needed by multifd sessions?

What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.

Another option would be to simply return to using
load_threads_ret like the previous versions did and not
experiment with touching global migration state because
as we can see other places can unintentionally break.

If we go this route then these TLS EOF patches could be
dropped.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-03 13:57         ` Maciej S. Szmigiero
@ 2025-02-03 19:58           ` Peter Xu
  2025-02-03 20:15             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-03 19:58 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 02:57:36PM +0100, Maciej S. Szmigiero wrote:
> On 2.02.2025 13:45, Dr. David Alan Gilbert wrote:
> > * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> > > On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
> > > > * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > postcopy_ram_listen_thread() is a free running thread, so it needs to
> > > > > take BQL around function calls to migration methods requiring BQL.
> > > > > 
> > > > > qemu_loadvm_state_main() needs BQL held since it ultimately calls
> > > > > "load_state" SaveVMHandlers.
> > > > > 
> > > > > migration_incoming_state_destroy() needs BQL held since it ultimately calls
> > > > > "load_cleanup" SaveVMHandlers.
> > > > > 
> > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > ---
> > > > >    migration/savevm.c | 4 ++++
> > > > >    1 file changed, 4 insertions(+)
> > > > > 
> > > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > > index b0b74140daea..0ceea9638cc1 100644
> > > > > --- a/migration/savevm.c
> > > > > +++ b/migration/savevm.c
> > > > > @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > > > >         * in qemu_file, and thus we must be blocking now.
> > > > >         */
> > > > >        qemu_file_set_blocking(f, true);
> > > > > +    bql_lock();
> > > > >        load_res = qemu_loadvm_state_main(f, mis);
> > > > > +    bql_unlock();
> > > > 
> > > > Doesn't that leave that held for a heck of a long time?
> > > 
> > > Yes, and it effectively broke "postcopy recover" test but I
> > > think the reason for that is qemu_loadvm_state_main() and
> > > its children don't drop BQL while waiting for I/O.
> > > 
> > > I've described this case in more detail in my reply to Fabiano here:
> > > https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/
> > 
> > While it might be the cause in this case, my feeling is it's more fundamental
> > here - it's the whole reason that postcopy has a separate ram listen
> > thread.  As the destination is running, after it loads it's devices
> > and as it starts up the destination will be still loading RAM
> > (and other postcopiable devices) potentially for quite a while.
> > Holding the bql around the ram listen thread means that the
> > execution of the destination won't be able to take that lock
> > until the postcopy load has finished; so while that might apparently
> > complete, it'll lead to the destination stalling until that's finished
> > which defeats the whole point of postcopy.
> > That last one probably won't fail a test but it will lead to a long stall
> > if you give it a nice big guest with lots of RAM that it's rapidly
> > changing.
> 
> Okay, I understand the postcopy case/flow now.
> Thanks for explaining it clearly.
> 
> > > I still think that "load_state" SaveVMHandlers need to be called
> > > with BQL held since implementations apparently expect it that way:
> > > for example, I think PCI device configuration restore calls
> > > address space manipulation methods which abort() if called
> > > without BQL held.
> > 
> > However, the only devices that *should* be arriving on the channel
> > that the postcopy_ram_listen_thread is reading from are those
> > that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
> > Those load handlers are safe to be run while the other devices
> > are being changed.   Note the *should* - you could add a check
> > to fail if any other device arrives on that channel.
> 
> I think ultimately there should be either an explicit check, or,
> as you suggest in the paragraph below, a separate SaveVMHandler
> that runs without BQL held.

To me those are bugs happening during postcopy, so those abort()s in
memory.c are indeed for catching these issues too.

> Since the current state of just running these SaveVMHandlers
> without BQL in this case and hoping that nothing breaks is
> clearly sub-optimal.
> 
> > > I have previously even submitted a patch to explicitly document
> > > "load_state" SaveVMHandler as requiring BQL (which was also
> > > included in the previous version of this patch set) and it
> > > received a "Reviewed-by:" tag:
> > > https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
> > > https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
> > > https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/
> > 
> > It happens!
> > You could make this safer by having a load_state and a load_state_postcopy
> > member, and only mark the load_state as requiring the lock.
> 
> To not digress too much from the subject of this patch set
> (multifd VFIO device state transfer) for now I've just updated the
> TODO comment around that qemu_loadvm_state_main(), so hopefully this
> discussion won't get forgotten:
> https://gitlab.com/maciejsszmigiero/qemu/-/commit/046e3deac5b1dbc406b3e9571f62468bd6743e79

The commit message may still need some touch ups, e.g.:

  postcopy_ram_listen_thread() is a free running thread, so it needs to
  take BQL around function calls to migration methods requiring BQL.

This sentence is still not correct, IMHO. As Dave explained, the ram load
thread is designed to run without BQL at least for the major workloads it
runs.

I don't worry on src sending something that crashes the dest: if that
happens, that's a bug, we need to fix it..  In that case abort() either in
memory.c or migration/ would be the same.  We could add some explicit check
in migration code, but I don't expect it to catch anything real, at least
such never happened since postcopy introduced.. so it's roughly 10 years
without anything like that happens.

Taking BQL for migration_incoming_state_destroy() looks all safe.  There's
one qemu_ram_block_writeback() which made me a bit nervous initially, but
then looks like RAM backends should be almost noop (for shmem and
hugetlbfs) but except pmem.

The other alternative is we define load_cleanup() to not rely on BQL (which
I believe is true before this series?), then take it only when VFIO's path
needs it.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-03 19:58           ` Peter Xu
@ 2025-02-03 20:15             ` Maciej S. Szmigiero
  2025-02-03 20:36               ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-03 20:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 3.02.2025 20:58, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 02:57:36PM +0100, Maciej S. Szmigiero wrote:
>> On 2.02.2025 13:45, Dr. David Alan Gilbert wrote:
>>> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>>>> On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
>>>>> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> postcopy_ram_listen_thread() is a free running thread, so it needs to
>>>>>> take BQL around function calls to migration methods requiring BQL.
>>>>>>
>>>>>> qemu_loadvm_state_main() needs BQL held since it ultimately calls
>>>>>> "load_state" SaveVMHandlers.
>>>>>>
>>>>>> migration_incoming_state_destroy() needs BQL held since it ultimately calls
>>>>>> "load_cleanup" SaveVMHandlers.
>>>>>>
>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>> ---
>>>>>>     migration/savevm.c | 4 ++++
>>>>>>     1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>>>> index b0b74140daea..0ceea9638cc1 100644
>>>>>> --- a/migration/savevm.c
>>>>>> +++ b/migration/savevm.c
>>>>>> @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>>>>>>          * in qemu_file, and thus we must be blocking now.
>>>>>>          */
>>>>>>         qemu_file_set_blocking(f, true);
>>>>>> +    bql_lock();
>>>>>>         load_res = qemu_loadvm_state_main(f, mis);
>>>>>> +    bql_unlock();
>>>>>
>>>>> Doesn't that leave that held for a heck of a long time?
>>>>
>>>> Yes, and it effectively broke "postcopy recover" test but I
>>>> think the reason for that is qemu_loadvm_state_main() and
>>>> its children don't drop BQL while waiting for I/O.
>>>>
>>>> I've described this case in more detail in my reply to Fabiano here:
>>>> https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/
>>>
>>> While it might be the cause in this case, my feeling is it's more fundamental
>>> here - it's the whole reason that postcopy has a separate ram listen
>>> thread.  As the destination is running, after it loads it's devices
>>> and as it starts up the destination will be still loading RAM
>>> (and other postcopiable devices) potentially for quite a while.
>>> Holding the bql around the ram listen thread means that the
>>> execution of the destination won't be able to take that lock
>>> until the postcopy load has finished; so while that might apparently
>>> complete, it'll lead to the destination stalling until that's finished
>>> which defeats the whole point of postcopy.
>>> That last one probably won't fail a test but it will lead to a long stall
>>> if you give it a nice big guest with lots of RAM that it's rapidly
>>> changing.
>>
>> Okay, I understand the postcopy case/flow now.
>> Thanks for explaining it clearly.
>>
>>>> I still think that "load_state" SaveVMHandlers need to be called
>>>> with BQL held since implementations apparently expect it that way:
>>>> for example, I think PCI device configuration restore calls
>>>> address space manipulation methods which abort() if called
>>>> without BQL held.
>>>
>>> However, the only devices that *should* be arriving on the channel
>>> that the postcopy_ram_listen_thread is reading from are those
>>> that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
>>> Those load handlers are safe to be run while the other devices
>>> are being changed.   Note the *should* - you could add a check
>>> to fail if any other device arrives on that channel.
>>
>> I think ultimately there should be either an explicit check, or,
>> as you suggest in the paragraph below, a separate SaveVMHandler
>> that runs without BQL held.
> 
> To me those are bugs happening during postcopy, so those abort()s in
> memory.c are indeed for catching these issues too.
> 
>> Since the current state of just running these SaveVMHandlers
>> without BQL in this case and hoping that nothing breaks is
>> clearly sub-optimal.
>>
>>>> I have previously even submitted a patch to explicitly document
>>>> "load_state" SaveVMHandler as requiring BQL (which was also
>>>> included in the previous version of this patch set) and it
>>>> received a "Reviewed-by:" tag:
>>>> https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
>>>> https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
>>>> https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/
>>>
>>> It happens!
>>> You could make this safer by having a load_state and a load_state_postcopy
>>> member, and only mark the load_state as requiring the lock.
>>
>> To not digress too much from the subject of this patch set
>> (multifd VFIO device state transfer) for now I've just updated the
>> TODO comment around that qemu_loadvm_state_main(), so hopefully this
>> discussion won't get forgotten:
>> https://gitlab.com/maciejsszmigiero/qemu/-/commit/046e3deac5b1dbc406b3e9571f62468bd6743e79
> 
> The commit message may still need some touch ups, e.g.:
> 
>    postcopy_ram_listen_thread() is a free running thread, so it needs to
>    take BQL around function calls to migration methods requiring BQL.
>
>
> This sentence is still not correct, IMHO. As Dave explained, the ram load
> thread is designed to run without BQL at least for the major workloads it
> runs.

So what's your proposed wording of this commit then?

> I don't worry on src sending something that crashes the dest: if that
> happens, that's a bug, we need to fix it..  In that case abort() either in
> memory.c or migration/ would be the same.  

Yeah, but it would be a bug in the source (or just bit stream corruption for
any reason), yet it's the destination which would abort() or crash.

I think cases like that in principle should be handled more gracefully,
like exiting the destination QEMU with an error.
But that's something outside of the scope of this patch set.

> We could add some explicit check
> in migration code, but I don't expect it to catch anything real, at least
> such never happened since postcopy introduced.. so it's roughly 10 years
> without anything like that happens.
> 
> Taking BQL for migration_incoming_state_destroy() looks all safe.  There's
> one qemu_ram_block_writeback() which made me a bit nervous initially, but
> then looks like RAM backends should be almost noop (for shmem and
> hugetlbfs) but except pmem.

That's the only part where taking BQL is actually necessary for the
functionality of this patch set to work properly, so it's fine to leave
that call to qemu_loadvm_state_main() as-is (without BQL) for time being.

> 
> The other alternative is we define load_cleanup() to not rely on BQL (which
> I believe is true before this series?), then take it only when VFIO's path
> needs it.

I think other paths always call load_cleanup() with BQL so it's probably
safer to have consistent semantics here.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 18:53     ` Maciej S. Szmigiero
@ 2025-02-03 20:20       ` Peter Xu
  2025-02-03 21:41         ` Maciej S. Szmigiero
  2025-02-04 15:10         ` Daniel P. Berrangé
  0 siblings, 2 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-03 20:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 19:20, Peter Xu wrote:
> > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Multifd send channels are terminated by calling
> > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > 
> > > Unfortunately, this does not terminate the TLS session properly and
> > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > 
> > > The only reason why this wasn't causing migration failures is because
> > > the current migration code apparently does not check for migration
> > > error being set after the end of the multifd receive process.
> > > 
> > > However, this will change soon so the multifd receive code has to be
> > > prepared to not return an error on such premature TLS session EOF.
> > > Use the newly introduced QIOChannelTLS method for that.
> > > 
> > > It's worth noting that even if the sender were to be changed to terminate
> > > the TLS connection properly the receive side still needs to remain
> > > compatible with older QEMU bit stream which does not do this.
> > 
> > If this is an existing bug, we could add a Fixes.
> 
> It is an existing issue but only uncovered by this patch set.
> 
> As far as I can see it was always there, so it would need some
> thought where to point that Fixes tag.

If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.

> > Two pure questions..
> > 
> >    - What is the correct way to terminate the TLS session without this flag?
> 
> I guess one would need to call gnutls_bye() like in this GnuTLS example:
> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> 
> >    - Why this is only needed by multifd sessions?
> 
> What uncovered the issue was switching the load threads to using
> migrate_set_error() instead of their own result variable
> (load_threads_ret) which you had requested during the previous
> patch set version review:
> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> 
> Turns out that the multifd receive code always returned
> error in the TLS case, just nothing was previously checking for
> that error presence.

What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.

If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?

Thanks,

> 
> Another option would be to simply return to using
> load_threads_ret like the previous versions did and not
> experiment with touching global migration state because
> as we can see other places can unintentionally break.
> 
> If we go this route then these TLS EOF patches could be
> dropped.
> 
> > Thanks,
> > 
> 
> Thanks,
> Maciej
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-03 20:15             ` Maciej S. Szmigiero
@ 2025-02-03 20:36               ` Peter Xu
  2025-02-03 21:41                 ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-03 20:36 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 09:15:52PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 20:58, Peter Xu wrote:
> > On Mon, Feb 03, 2025 at 02:57:36PM +0100, Maciej S. Szmigiero wrote:
> > > On 2.02.2025 13:45, Dr. David Alan Gilbert wrote:
> > > > * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> > > > > On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
> > > > > > * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > 
> > > > > > > postcopy_ram_listen_thread() is a free running thread, so it needs to
> > > > > > > take BQL around function calls to migration methods requiring BQL.
> > > > > > > 
> > > > > > > qemu_loadvm_state_main() needs BQL held since it ultimately calls
> > > > > > > "load_state" SaveVMHandlers.
> > > > > > > 
> > > > > > > migration_incoming_state_destroy() needs BQL held since it ultimately calls
> > > > > > > "load_cleanup" SaveVMHandlers.
> > > > > > > 
> > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > > ---
> > > > > > >     migration/savevm.c | 4 ++++
> > > > > > >     1 file changed, 4 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > > > > index b0b74140daea..0ceea9638cc1 100644
> > > > > > > --- a/migration/savevm.c
> > > > > > > +++ b/migration/savevm.c
> > > > > > > @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > > > > > >          * in qemu_file, and thus we must be blocking now.
> > > > > > >          */
> > > > > > >         qemu_file_set_blocking(f, true);
> > > > > > > +    bql_lock();
> > > > > > >         load_res = qemu_loadvm_state_main(f, mis);
> > > > > > > +    bql_unlock();
> > > > > > 
> > > > > > Doesn't that leave that held for a heck of a long time?
> > > > > 
> > > > > Yes, and it effectively broke "postcopy recover" test but I
> > > > > think the reason for that is qemu_loadvm_state_main() and
> > > > > its children don't drop BQL while waiting for I/O.
> > > > > 
> > > > > I've described this case in more detail in my reply to Fabiano here:
> > > > > https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/
> > > > 
> > > > While it might be the cause in this case, my feeling is it's more fundamental
> > > > here - it's the whole reason that postcopy has a separate ram listen
> > > > thread.  As the destination is running, after it loads it's devices
> > > > and as it starts up the destination will be still loading RAM
> > > > (and other postcopiable devices) potentially for quite a while.
> > > > Holding the bql around the ram listen thread means that the
> > > > execution of the destination won't be able to take that lock
> > > > until the postcopy load has finished; so while that might apparently
> > > > complete, it'll lead to the destination stalling until that's finished
> > > > which defeats the whole point of postcopy.
> > > > That last one probably won't fail a test but it will lead to a long stall
> > > > if you give it a nice big guest with lots of RAM that it's rapidly
> > > > changing.
> > > 
> > > Okay, I understand the postcopy case/flow now.
> > > Thanks for explaining it clearly.
> > > 
> > > > > I still think that "load_state" SaveVMHandlers need to be called
> > > > > with BQL held since implementations apparently expect it that way:
> > > > > for example, I think PCI device configuration restore calls
> > > > > address space manipulation methods which abort() if called
> > > > > without BQL held.
> > > > 
> > > > However, the only devices that *should* be arriving on the channel
> > > > that the postcopy_ram_listen_thread is reading from are those
> > > > that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
> > > > Those load handlers are safe to be run while the other devices
> > > > are being changed.   Note the *should* - you could add a check
> > > > to fail if any other device arrives on that channel.
> > > 
> > > I think ultimately there should be either an explicit check, or,
> > > as you suggest in the paragraph below, a separate SaveVMHandler
> > > that runs without BQL held.
> > 
> > To me those are bugs happening during postcopy, so those abort()s in
> > memory.c are indeed for catching these issues too.
> > 
> > > Since the current state of just running these SaveVMHandlers
> > > without BQL in this case and hoping that nothing breaks is
> > > clearly sub-optimal.
> > > 
> > > > > I have previously even submitted a patch to explicitly document
> > > > > "load_state" SaveVMHandler as requiring BQL (which was also
> > > > > included in the previous version of this patch set) and it
> > > > > received a "Reviewed-by:" tag:
> > > > > https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
> > > > > https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
> > > > > https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/
> > > > 
> > > > It happens!
> > > > You could make this safer by having a load_state and a load_state_postcopy
> > > > member, and only mark the load_state as requiring the lock.
> > > 
> > > To not digress too much from the subject of this patch set
> > > (multifd VFIO device state transfer) for now I've just updated the
> > > TODO comment around that qemu_loadvm_state_main(), so hopefully this
> > > discussion won't get forgotten:
> > > https://gitlab.com/maciejsszmigiero/qemu/-/commit/046e3deac5b1dbc406b3e9571f62468bd6743e79
> > 
> > The commit message may still need some touch ups, e.g.:
> > 
> >    postcopy_ram_listen_thread() is a free running thread, so it needs to
> >    take BQL around function calls to migration methods requiring BQL.
> > 
> > 
> > This sentence is still not correct, IMHO. As Dave explained, the ram load
> > thread is designed to run without BQL at least for the major workloads it
> > runs.
> 
> So what's your proposed wording of this commit then?

Perhaps dropping it? As either it implies qemu_loadvm_state_main() needs to
take bql (which could be wrong in case of postcopy at least from
design.. not sanity check pov), or it provides no real meaning to suggest
where to take it..

Personally I would put the comment as easy as possible - the large portion
isn't helping me to understand the code but only made it slightly more
confusing..

    /*
     * TODO: qemu_loadvm_state_main() could call "load_state" SaveVMHandlers
     * that are expecting BQL to be held, which isn't in this case.
     *
     * In principle, the only devices that should be arriving on this channel
     * now are those that are postcopiable and whose load handlers are safe
     * to be called without BQL being held.
     *
     * But nothing currently prevents the source from sending data for "unsafe"
     * devices which would cause trouble here.
     */

IMHO we could put it very simple if you think we need such sanity check
later:

    /* TODO: sanity check that only postcopiable data will be loaded here */

> 
> > I don't worry on src sending something that crashes the dest: if that
> > happens, that's a bug, we need to fix it..  In that case abort() either in
> > memory.c or migration/ would be the same.
> 
> Yeah, but it would be a bug in the source (or just bit stream corruption for
> any reason), yet it's the destination which would abort() or crash.
> 
> I think cases like that in principle should be handled more gracefully,
> like exiting the destination QEMU with an error.
> But that's something outside of the scope of this patch set.

Yes I agree.  It's just that postcopy normally cannot gracefully quits on
dest anyway.. as src QEMU cannot continue with a dead dest QEMU. For
obvious programming errors, I think abort() is still ok in this case, on
either src or dest if postcopy already started.

For this series, we could always stick with precopy, it could help converge
the series.

> 
> > We could add some explicit check
> > in migration code, but I don't expect it to catch anything real, at least
> > such never happened since postcopy introduced.. so it's roughly 10 years
> > without anything like that happens.
> > 
> > Taking BQL for migration_incoming_state_destroy() looks all safe.  There's
> > one qemu_ram_block_writeback() which made me a bit nervous initially, but
> > then looks like RAM backends should be almost noop (for shmem and
> > hugetlbfs) but except pmem.
> 
> That's the only part where taking BQL is actually necessary for the
> functionality of this patch set to work properly, so it's fine to leave
> that call to qemu_loadvm_state_main() as-is (without BQL) for time being.
> 
> > 
> > The other alternative is we define load_cleanup() to not rely on BQL (which
> > I believe is true before this series?), then take it only when VFIO's path
> > needs it.
> 
> I think other paths always call load_cleanup() with BQL so it's probably
> safer to have consistent semantics here.

IMHO we don't necessarily need to make it the default that vmstate handler
hooks will need BQL by default - we can always properly define them to best
suite our need.

For this case I think it's ok either way. But I'm assuming: (1) no serious
users run QEMU RAMs on normal file systems (or RAM's cleanup() can do
msync() on those, which can flush page caches for a long time to disks),
and (2) pmem isn't important.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type
  2025-01-30 10:08 ` [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
@ 2025-02-03 20:53   ` Peter Xu
  2025-02-03 21:13   ` Daniel P. Berrangé
  1 sibling, 0 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-03 20:53 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Jan 30, 2025 at 11:08:31AM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Automatic memory management helps avoid memory safety issues.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type
  2025-01-30 10:08 ` [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
  2025-02-03 20:53   ` Peter Xu
@ 2025-02-03 21:13   ` Daniel P. Berrangé
  2025-02-03 21:51     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 137+ messages in thread
From: Daniel P. Berrangé @ 2025-02-03 21:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On Thu, Jan 30, 2025 at 11:08:31AM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Automatic memory management helps avoid memory safety issues.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/qapi/error.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/qapi/error.h b/include/qapi/error.h
> index 71f8fb2c50ee..649ec8f1b6a2 100644
> --- a/include/qapi/error.h
> +++ b/include/qapi/error.h
> @@ -437,6 +437,8 @@ Error *error_copy(const Error *err);>
q   */
>  void error_free(Error *err);
>  
> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(Error, error_free)
> +

This has been rejected by Markus in the past when I proposed. See the
rationale at the time here:

  https://lists.nongnu.org/archive/html/qemu-devel/2024-07/msg05503.html

If you want this, the commit message will need to explain the use
case and justify why the existing error usage patterns are insufficient.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side
  2025-01-30 10:08 ` [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2025-02-03 21:27   ` Peter Xu
  2025-02-03 22:18     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-03 21:27 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Jan 30, 2025 at 11:08:34AM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add a basic support for receiving device state via multifd channels -
> channels that are shared with RAM transfers.
> 
> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
> packet header either device state (MultiFDPacketDeviceState_t) or RAM
> data (existing MultiFDPacket_t) is read.
> 
> The received device state data is provided to
> qemu_loadvm_load_state_buffer() function for processing in the
> device's load_state_buffer handler.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

I think I acked this one.  You could keep my R-b if...

[...]

> diff --git a/migration/multifd.h b/migration/multifd.h
> index 9e4baa066312..abf3acdcee40 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>  #define MULTIFD_FLAG_UADK (8 << 1)
>  #define MULTIFD_FLAG_QATZIP (16 << 1)
>  
> +/*
> + * If set it means that this packet contains device state
> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
> + */
> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 6)

... if this won't conflict with MULTIFD_FLAG_QATZIP.

I think we should stick with one way to write it, then when rebase you can
see such conflicts - either your patch uses 32 << 1, or perhaps we should
start to switch to BIT() for all above instead..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 20:20       ` Peter Xu
@ 2025-02-03 21:41         ` Maciej S. Szmigiero
  2025-02-03 22:56           ` Peter Xu
  2025-02-04 15:10         ` Daniel P. Berrangé
  1 sibling, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-03 21:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 3.02.2025 21:20, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 19:20, Peter Xu wrote:
>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Multifd send channels are terminated by calling
>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>
>>>> Unfortunately, this does not terminate the TLS session properly and
>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>
>>>> The only reason why this wasn't causing migration failures is because
>>>> the current migration code apparently does not check for migration
>>>> error being set after the end of the multifd receive process.
>>>>
>>>> However, this will change soon so the multifd receive code has to be
>>>> prepared to not return an error on such premature TLS session EOF.
>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>
>>>> It's worth noting that even if the sender were to be changed to terminate
>>>> the TLS connection properly the receive side still needs to remain
>>>> compatible with older QEMU bit stream which does not do this.
>>>
>>> If this is an existing bug, we could add a Fixes.
>>
>> It is an existing issue but only uncovered by this patch set.
>>
>> As far as I can see it was always there, so it would need some
>> thought where to point that Fixes tag.
> 
> If there's no way to trigger a real functional bug anyway, it's also ok we
> omit the Fixes.
> 
>>> Two pure questions..
>>>
>>>     - What is the correct way to terminate the TLS session without this flag?
>>
>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>
>>>     - Why this is only needed by multifd sessions?
>>
>> What uncovered the issue was switching the load threads to using
>> migrate_set_error() instead of their own result variable
>> (load_threads_ret) which you had requested during the previous
>> patch set version review:
>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>
>> Turns out that the multifd receive code always returned
>> error in the TLS case, just nothing was previously checking for
>> that error presence.
> 
> What I was curious is whether this issue also exists for the main migration
> channel when with tls, especially when e.g. multifd not enabled at all.  As
> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> 
> I think it's a good to find that we overlooked this before.. and IMHO it's
> always good we could fix this.
> 
> Does it mean we need proper gnutls_bye() somewhere?
> 
> If we need an explicit gnutls_bye(), then I wonder if that should be done
> on the main channel as well.

That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.

Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.

> If we don't need gnutls_bye(), then should we always ignore pre-mature
> termination of tls no matter if it's multifd or non-multifd channel (or
> even a tls session that is not migration-related)?

So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?

> Thanks,

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-03 20:36               ` Peter Xu
@ 2025-02-03 21:41                 ` Maciej S. Szmigiero
  2025-02-03 23:02                   ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-03 21:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 3.02.2025 21:36, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 09:15:52PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 20:58, Peter Xu wrote:
>>> On Mon, Feb 03, 2025 at 02:57:36PM +0100, Maciej S. Szmigiero wrote:
>>>> On 2.02.2025 13:45, Dr. David Alan Gilbert wrote:
>>>>> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>>>>>> On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
>>>>>>> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>
>>>>>>>> postcopy_ram_listen_thread() is a free running thread, so it needs to
>>>>>>>> take BQL around function calls to migration methods requiring BQL.
>>>>>>>>
>>>>>>>> qemu_loadvm_state_main() needs BQL held since it ultimately calls
>>>>>>>> "load_state" SaveVMHandlers.
>>>>>>>>
>>>>>>>> migration_incoming_state_destroy() needs BQL held since it ultimately calls
>>>>>>>> "load_cleanup" SaveVMHandlers.
>>>>>>>>
>>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>>> ---
>>>>>>>>      migration/savevm.c | 4 ++++
>>>>>>>>      1 file changed, 4 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>>>>>> index b0b74140daea..0ceea9638cc1 100644
>>>>>>>> --- a/migration/savevm.c
>>>>>>>> +++ b/migration/savevm.c
>>>>>>>> @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>>>>>>>>           * in qemu_file, and thus we must be blocking now.
>>>>>>>>           */
>>>>>>>>          qemu_file_set_blocking(f, true);
>>>>>>>> +    bql_lock();
>>>>>>>>          load_res = qemu_loadvm_state_main(f, mis);
>>>>>>>> +    bql_unlock();
>>>>>>>
>>>>>>> Doesn't that leave that held for a heck of a long time?
>>>>>>
>>>>>> Yes, and it effectively broke "postcopy recover" test but I
>>>>>> think the reason for that is qemu_loadvm_state_main() and
>>>>>> its children don't drop BQL while waiting for I/O.
>>>>>>
>>>>>> I've described this case in more detail in my reply to Fabiano here:
>>>>>> https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/
>>>>>
>>>>> While it might be the cause in this case, my feeling is it's more fundamental
>>>>> here - it's the whole reason that postcopy has a separate ram listen
>>>>> thread.  As the destination is running, after it loads it's devices
>>>>> and as it starts up the destination will be still loading RAM
>>>>> (and other postcopiable devices) potentially for quite a while.
>>>>> Holding the bql around the ram listen thread means that the
>>>>> execution of the destination won't be able to take that lock
>>>>> until the postcopy load has finished; so while that might apparently
>>>>> complete, it'll lead to the destination stalling until that's finished
>>>>> which defeats the whole point of postcopy.
>>>>> That last one probably won't fail a test but it will lead to a long stall
>>>>> if you give it a nice big guest with lots of RAM that it's rapidly
>>>>> changing.
>>>>
>>>> Okay, I understand the postcopy case/flow now.
>>>> Thanks for explaining it clearly.
>>>>
>>>>>> I still think that "load_state" SaveVMHandlers need to be called
>>>>>> with BQL held since implementations apparently expect it that way:
>>>>>> for example, I think PCI device configuration restore calls
>>>>>> address space manipulation methods which abort() if called
>>>>>> without BQL held.
>>>>>
>>>>> However, the only devices that *should* be arriving on the channel
>>>>> that the postcopy_ram_listen_thread is reading from are those
>>>>> that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
>>>>> Those load handlers are safe to be run while the other devices
>>>>> are being changed.   Note the *should* - you could add a check
>>>>> to fail if any other device arrives on that channel.
>>>>
>>>> I think ultimately there should be either an explicit check, or,
>>>> as you suggest in the paragraph below, a separate SaveVMHandler
>>>> that runs without BQL held.
>>>
>>> To me those are bugs happening during postcopy, so those abort()s in
>>> memory.c are indeed for catching these issues too.
>>>
>>>> Since the current state of just running these SaveVMHandlers
>>>> without BQL in this case and hoping that nothing breaks is
>>>> clearly sub-optimal.
>>>>
>>>>>> I have previously even submitted a patch to explicitly document
>>>>>> "load_state" SaveVMHandler as requiring BQL (which was also
>>>>>> included in the previous version of this patch set) and it
>>>>>> received a "Reviewed-by:" tag:
>>>>>> https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
>>>>>> https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
>>>>>> https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/
>>>>>
>>>>> It happens!
>>>>> You could make this safer by having a load_state and a load_state_postcopy
>>>>> member, and only mark the load_state as requiring the lock.
>>>>
>>>> To not digress too much from the subject of this patch set
>>>> (multifd VFIO device state transfer) for now I've just updated the
>>>> TODO comment around that qemu_loadvm_state_main(), so hopefully this
>>>> discussion won't get forgotten:
>>>> https://gitlab.com/maciejsszmigiero/qemu/-/commit/046e3deac5b1dbc406b3e9571f62468bd6743e79
>>>
>>> The commit message may still need some touch ups, e.g.:
>>>
>>>     postcopy_ram_listen_thread() is a free running thread, so it needs to
>>>     take BQL around function calls to migration methods requiring BQL.
>>>
>>>
>>> This sentence is still not correct, IMHO. As Dave explained, the ram load
>>> thread is designed to run without BQL at least for the major workloads it
>>> runs.
>>
>> So what's your proposed wording of this commit then?
> 
> Perhaps dropping it? As either it implies qemu_loadvm_state_main() needs to
> take bql (which could be wrong in case of postcopy at least from
> design.. not sanity check pov), or it provides no real meaning to suggest
> where to take it..
> 
> Personally I would put the comment as easy as possible - the large portion
> isn't helping me to understand the code but only made it slightly more
> confusing..
> 
>      /*
>       * TODO: qemu_loadvm_state_main() could call "load_state" SaveVMHandlers
>       * that are expecting BQL to be held, which isn't in this case.
>       *
>       * In principle, the only devices that should be arriving on this channel
>       * now are those that are postcopiable and whose load handlers are safe
>       * to be called without BQL being held.
>       *
>       * But nothing currently prevents the source from sending data for "unsafe"
>       * devices which would cause trouble here.
>       */
> 
> IMHO we could put it very simple if you think we need such sanity check
> later:
> 
>      /* TODO: sanity check that only postcopiable data will be loaded here */

I think I will change that comment wording to the one ^^^^ you suggested above,
since we still need to have this commit to take BQL around that
migration_incoming_state_destroy() call in postcopy_ram_listen_thread().

>>
>>> I don't worry on src sending something that crashes the dest: if that
>>> happens, that's a bug, we need to fix it..  In that case abort() either in
>>> memory.c or migration/ would be the same.
>>
>> Yeah, but it would be a bug in the source (or just bit stream corruption for
>> any reason), yet it's the destination which would abort() or crash.
>>
>> I think cases like that in principle should be handled more gracefully,
>> like exiting the destination QEMU with an error.
>> But that's something outside of the scope of this patch set.
> 
> Yes I agree.  It's just that postcopy normally cannot gracefully quits on
> dest anyway.. as src QEMU cannot continue with a dead dest QEMU. For
> obvious programming errors, I think abort() is still ok in this case, on
> either src or dest if postcopy already started.
> 
> For this series, we could always stick with precopy, it could help converge
> the series.

To be clear I'm messing with postcopy only because without adding that
BQL lock around migration_incoming_state_destroy() in
postcopy_ram_listen_thread() other changes in this patch set would break
postcopy.
And that's obviously not acceptable.

>>
>>> We could add some explicit check
>>> in migration code, but I don't expect it to catch anything real, at least
>>> such never happened since postcopy introduced.. so it's roughly 10 years
>>> without anything like that happens.
>>>
>>> Taking BQL for migration_incoming_state_destroy() looks all safe.  There's
>>> one qemu_ram_block_writeback() which made me a bit nervous initially, but
>>> then looks like RAM backends should be almost noop (for shmem and
>>> hugetlbfs) but except pmem.
>>
>> That's the only part where taking BQL is actually necessary for the
>> functionality of this patch set to work properly, so it's fine to leave
>> that call to qemu_loadvm_state_main() as-is (without BQL) for time being.
>>
>>>
>>> The other alternative is we define load_cleanup() to not rely on BQL (which
>>> I believe is true before this series?), then take it only when VFIO's path
>>> needs it.
>>
>> I think other paths always call load_cleanup() with BQL so it's probably
>> safer to have consistent semantics here.
> 
> IMHO we don't necessarily need to make it the default that vmstate handler
> hooks will need BQL by default - we can always properly define them to best
> suite our need.

But I think consistency is important - if other callers take BQL for
load_cleanup() then it makes sense to take it in all places (only if to make
the code simpler).

> For this case I think it's ok either way. But I'm assuming: (1) no serious
> users run QEMU RAMs on normal file systems (or RAM's cleanup() can do
> msync() on those, which can flush page caches for a long time to disks),
> and (2) pmem isn't important.
> 
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 16/33] migration/multifd: Device state transfer support - send side
  2025-01-30 10:08 ` [PATCH v4 16/33] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2025-02-03 21:47   ` Peter Xu
  0 siblings, 0 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-03 21:47 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Jan 30, 2025 at 11:08:37AM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> A new function multifd_queue_device_state() is provided for device to queue
> its state for transmission via a multifd channel.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type
  2025-02-03 21:13   ` Daniel P. Berrangé
@ 2025-02-03 21:51     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-03 21:51 UTC (permalink / raw)
  To: Daniel P. Berrangé, Markus Armbruster
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On 3.02.2025 22:13, Daniel P. Berrangé wrote:
> On Thu, Jan 30, 2025 at 11:08:31AM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Automatic memory management helps avoid memory safety issues.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/qapi/error.h | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/include/qapi/error.h b/include/qapi/error.h
>> index 71f8fb2c50ee..649ec8f1b6a2 100644
>> --- a/include/qapi/error.h
>> +++ b/include/qapi/error.h
>> @@ -437,6 +437,8 @@ Error *error_copy(const Error *err);>
> q   */
>>   void error_free(Error *err);
>>   
>> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(Error, error_free)
>> +
> 
> This has been rejected by Markus in the past when I proposed. See the
> rationale at the time here:
> 
>    https://lists.nongnu.org/archive/html/qemu-devel/2024-07/msg05503.html

Thanks for the pointer, I wasn't expecting this change to be controversial.
  
> If you want this, the commit message will need to explain the use
> case and justify why the existing error usage patterns are insufficient.

In this case it's about giving received Error to migrate_set_error()
which does *not* take ownership of it.

And the reason why migrate_set_error() does not take ownership of
incoming Error is that it might have an Error already set in
MigrationState, in this case it simply ignores the passed Error
(almost like being a NOP in this case).

I don't know whether this is enough of a justification for introducing
g_autoptr(Error).
I'm happy to drop this commit and change it to manual memory management
instead if it is not.

@Markus, what's your opinion here?

> With regards,
> Daniel

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side
  2025-02-03 21:27   ` Peter Xu
@ 2025-02-03 22:18     ` Maciej S. Szmigiero
  2025-02-03 22:59       ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-03 22:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 3.02.2025 22:27, Peter Xu wrote:
> On Thu, Jan 30, 2025 at 11:08:34AM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add a basic support for receiving device state via multifd channels -
>> channels that are shared with RAM transfers.
>>
>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>> data (existing MultiFDPacket_t) is read.
>>
>> The received device state data is provided to
>> qemu_loadvm_load_state_buffer() function for processing in the
>> device's load_state_buffer handler.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> I think I acked this one.  You could keep my R-b if...
> 
> [...]
> 
>> diff --git a/migration/multifd.h b/migration/multifd.h
>> index 9e4baa066312..abf3acdcee40 100644
>> --- a/migration/multifd.h
>> +++ b/migration/multifd.h
>> @@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>   #define MULTIFD_FLAG_UADK (8 << 1)
>>   #define MULTIFD_FLAG_QATZIP (16 << 1)
>>   
>> +/*
>> + * If set it means that this packet contains device state
>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>> + */
>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 6)
> 
> ... if this won't conflict with MULTIFD_FLAG_QATZIP.

Hmm, isn't (16 << 1) = 32 while (1 << 6) = 64?
  
> I think we should stick with one way to write it, then when rebase you can
> see such conflicts - either your patch uses 32 << 1, or perhaps we should
> start to switch to BIT() for all above instead..
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 21:41         ` Maciej S. Szmigiero
@ 2025-02-03 22:56           ` Peter Xu
  2025-02-04 13:51             ` Fabiano Rosas
  2025-02-04 14:39             ` Maciej S. Szmigiero
  0 siblings, 2 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-03 22:56 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 21:20, Peter Xu wrote:
> > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > > On 3.02.2025 19:20, Peter Xu wrote:
> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > Multifd send channels are terminated by calling
> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > 
> > > > > Unfortunately, this does not terminate the TLS session properly and
> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > > 
> > > > > The only reason why this wasn't causing migration failures is because
> > > > > the current migration code apparently does not check for migration
> > > > > error being set after the end of the multifd receive process.
> > > > > 
> > > > > However, this will change soon so the multifd receive code has to be
> > > > > prepared to not return an error on such premature TLS session EOF.
> > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > 
> > > > > It's worth noting that even if the sender were to be changed to terminate
> > > > > the TLS connection properly the receive side still needs to remain
> > > > > compatible with older QEMU bit stream which does not do this.
> > > > 
> > > > If this is an existing bug, we could add a Fixes.
> > > 
> > > It is an existing issue but only uncovered by this patch set.
> > > 
> > > As far as I can see it was always there, so it would need some
> > > thought where to point that Fixes tag.
> > 
> > If there's no way to trigger a real functional bug anyway, it's also ok we
> > omit the Fixes.
> > 
> > > > Two pure questions..
> > > > 
> > > >     - What is the correct way to terminate the TLS session without this flag?
> > > 
> > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
> > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > > 
> > > >     - Why this is only needed by multifd sessions?
> > > 
> > > What uncovered the issue was switching the load threads to using
> > > migrate_set_error() instead of their own result variable
> > > (load_threads_ret) which you had requested during the previous
> > > patch set version review:
> > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > > 
> > > Turns out that the multifd receive code always returned
> > > error in the TLS case, just nothing was previously checking for
> > > that error presence.
> > 
> > What I was curious is whether this issue also exists for the main migration
> > channel when with tls, especially when e.g. multifd not enabled at all.  As
> > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> > 
> > I think it's a good to find that we overlooked this before.. and IMHO it's
> > always good we could fix this.
> > 
> > Does it mean we need proper gnutls_bye() somewhere?
> > 
> > If we need an explicit gnutls_bye(), then I wonder if that should be done
> > on the main channel as well.
> 
> That's a good question and looking at the code qemu_loadvm_state_main() exits
> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
> in qemu_loadvm_state() - so still not until channel EOF.

I had a closer look, I do feel like such pre-mature termination is caused
by explicit shutdown()s of the iochannels, looks like that can cause issue
even after everything is sent.  Then I noticed indeed multifd sender
iochannels will get explicit shutdown()s since commit 077fbb5942, while we
don't do that for the main channel.  Maybe that is a major difference.

Now I wonder whether we should shutdown() the channel at all if migration
succeeded, because looks like it can cause tls session to interrupt even if
the shutdown() is done after sent everything, and if so it'll explain why
you hit the issue with tls.

> 
> Then I can't see anything else reading the channel until it is closed in
> migration_incoming_state_destroy().
> 
> So most likely the main migration channel will never read far enough to
> reach that GNUTLS_E_PREMATURE_TERMINATION error.
> 
> > If we don't need gnutls_bye(), then should we always ignore pre-mature
> > termination of tls no matter if it's multifd or non-multifd channel (or
> > even a tls session that is not migration-related)?
> 
> So basically have this patch extended to calling
> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?

If above theory can stand, then eof-okay could be a workaround papering
over the real problem that we shouldn't always shutdown()..

Could you have a look at below patch and see whether it can fix the problem
you hit too, in replace of these two patches (including the previous
iochannel change)?

Thanks,

===8<===
From 3147084174b0e0bda076ad205ae139f8fc433892 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Mon, 3 Feb 2025 17:27:45 -0500
Subject: [PATCH] migration: Avoid shutdown multifd channels if migration
 succeeded
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Multifd channels behave differently from the main channel when shutting
down: the sender side will always shutdown() on the multifd iochannels, no
matter whether migration succeeded or not.  QEMU doesn't do that on src.

Such behavior was introduced in commit 077fbb5942 ("multifd: Shut down the
QIO channels to avoid blocking the send threads when they are terminated.")
to fix a hang issue when multifd enabled.

This might be problematic though, especially in TLS context, because it
looks like such shutdown() on src (even if succeeded) could cause
destination multifd iochannels to receive pre-mature terminations of TLS
sessions.

It's debatable whether such shutdown() should be explicitly done even for a
succeeded migration.  This patch moves the shutdown() instead from
finalization phase into qmp_migrate_cancel(), so that we only do the
shutdown() when cancels, and we should avoid such when it succeeds.

When at it, keep all the shutdown() code together, moving the return path
shutdown() (which seems a bit redundant, but no harm to do) to where the
rest channels are shutdown.

Cc: Li Zhang <lizhang@suse.de>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Reported-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h   |  1 +
 migration/migration.c | 24 +++++++++++++++++-------
 migration/multifd.c   | 14 +++++++++++---
 3 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index bd785b9873..26ef94ac93 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -340,6 +340,7 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
 
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
+void multifd_send_shutdown_iochannels(void);
 MultiFDSendData *multifd_send_data_alloc(void);
 
 static inline uint32_t multifd_ram_page_size(void)
diff --git a/migration/migration.c b/migration/migration.c
index 74c50cc72c..e43f8222dc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1565,13 +1565,6 @@ static void migrate_fd_cancel(MigrationState *s)
 
     trace_migrate_fd_cancel();
 
-    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
-        if (s->rp_state.from_dst_file) {
-            /* shutdown the rp socket, so causing the rp thread to shutdown */
-            qemu_file_shutdown(s->rp_state.from_dst_file);
-        }
-    }
-
     do {
         old_state = s->state;
         if (!migration_is_running()) {
@@ -1594,6 +1587,23 @@ static void migrate_fd_cancel(MigrationState *s)
             if (s->to_dst_file) {
                 qemu_file_shutdown(s->to_dst_file);
             }
+            /*
+             * Above should work already, because the iochannel is shared
+             * between outgoing and return path qemufiles, however just to
+             * be on the safe side to set qemufile error on return path too
+             * if existed.
+             */
+            if (s->rp_state.from_dst_file) {
+                qemu_file_shutdown(s->rp_state.from_dst_file);
+            }
+        }
+
+        /*
+         * We need to shutdown multifd channels too if they are available,
+         * to make sure no multifd send threads will be stuck at syscalls.
+         */
+        if (migrate_multifd()) {
+            multifd_send_shutdown_iochannels();
         }
     }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index ab73d6d984..96bcbb1e0c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -384,6 +384,17 @@ static void multifd_send_set_error(Error *err)
     }
 }
 
+void multifd_send_shutdown_iochannels(void)
+{
+    QIOChannel *c;
+    int i;
+
+    for (i = 0; i < migrate_multifd_channels(); i++) {
+        c = multifd_send_state->params[i].c;
+        qio_channel_shutdown(c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+    }
+}
+
 static void multifd_send_terminate_threads(void)
 {
     int i;
@@ -404,9 +415,6 @@ static void multifd_send_terminate_threads(void)
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
         qemu_sem_post(&p->sem);
-        if (p->c) {
-            qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
-        }
     }
 
     /*
-- 
2.47.0


-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side
  2025-02-03 22:18     ` Maciej S. Szmigiero
@ 2025-02-03 22:59       ` Peter Xu
  2025-02-04 14:40         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-03 22:59 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 11:18:11PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 22:27, Peter Xu wrote:
> > On Thu, Jan 30, 2025 at 11:08:34AM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Add a basic support for receiving device state via multifd channels -
> > > channels that are shared with RAM transfers.
> > > 
> > > Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
> > > packet header either device state (MultiFDPacketDeviceState_t) or RAM
> > > data (existing MultiFDPacket_t) is read.
> > > 
> > > The received device state data is provided to
> > > qemu_loadvm_load_state_buffer() function for processing in the
> > > device's load_state_buffer handler.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > 
> > I think I acked this one.  You could keep my R-b if...
> > 
> > [...]
> > 
> > > diff --git a/migration/multifd.h b/migration/multifd.h
> > > index 9e4baa066312..abf3acdcee40 100644
> > > --- a/migration/multifd.h
> > > +++ b/migration/multifd.h
> > > @@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
> > >   #define MULTIFD_FLAG_UADK (8 << 1)
> > >   #define MULTIFD_FLAG_QATZIP (16 << 1)
> > > +/*
> > > + * If set it means that this packet contains device state
> > > + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
> > > + */
> > > +#define MULTIFD_FLAG_DEVICE_STATE (1 << 6)
> > 
> > ... if this won't conflict with MULTIFD_FLAG_QATZIP.
> 
> Hmm, isn't (16 << 1) = 32 while (1 << 6) = 64?

Oops. :)

> > I think we should stick with one way to write it, then when rebase you can
> > see such conflicts - either your patch uses 32 << 1, or perhaps we should
> > start to switch to BIT() for all above instead..

Still, do you mind switch to "32 << 1" (or use BIT())?

With either, feel free to take:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-03 21:41                 ` Maciej S. Szmigiero
@ 2025-02-03 23:02                   ` Peter Xu
  2025-02-04 14:57                     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-03 23:02 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 10:41:43PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 21:36, Peter Xu wrote:
> > On Mon, Feb 03, 2025 at 09:15:52PM +0100, Maciej S. Szmigiero wrote:
> > > On 3.02.2025 20:58, Peter Xu wrote:
> > > > On Mon, Feb 03, 2025 at 02:57:36PM +0100, Maciej S. Szmigiero wrote:
> > > > > On 2.02.2025 13:45, Dr. David Alan Gilbert wrote:
> > > > > > * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> > > > > > > On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
> > > > > > > > * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
> > > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > > 
> > > > > > > > > postcopy_ram_listen_thread() is a free running thread, so it needs to
> > > > > > > > > take BQL around function calls to migration methods requiring BQL.
> > > > > > > > > 
> > > > > > > > > qemu_loadvm_state_main() needs BQL held since it ultimately calls
> > > > > > > > > "load_state" SaveVMHandlers.
> > > > > > > > > 
> > > > > > > > > migration_incoming_state_destroy() needs BQL held since it ultimately calls
> > > > > > > > > "load_cleanup" SaveVMHandlers.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > > > > > ---
> > > > > > > > >      migration/savevm.c | 4 ++++
> > > > > > > > >      1 file changed, 4 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > > > > > > index b0b74140daea..0ceea9638cc1 100644
> > > > > > > > > --- a/migration/savevm.c
> > > > > > > > > +++ b/migration/savevm.c
> > > > > > > > > @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > > > > > > > >           * in qemu_file, and thus we must be blocking now.
> > > > > > > > >           */
> > > > > > > > >          qemu_file_set_blocking(f, true);
> > > > > > > > > +    bql_lock();
> > > > > > > > >          load_res = qemu_loadvm_state_main(f, mis);
> > > > > > > > > +    bql_unlock();
> > > > > > > > 
> > > > > > > > Doesn't that leave that held for a heck of a long time?
> > > > > > > 
> > > > > > > Yes, and it effectively broke "postcopy recover" test but I
> > > > > > > think the reason for that is qemu_loadvm_state_main() and
> > > > > > > its children don't drop BQL while waiting for I/O.
> > > > > > > 
> > > > > > > I've described this case in more detail in my reply to Fabiano here:
> > > > > > > https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/
> > > > > > 
> > > > > > While it might be the cause in this case, my feeling is it's more fundamental
> > > > > > here - it's the whole reason that postcopy has a separate ram listen
> > > > > > thread.  As the destination is running, after it loads it's devices
> > > > > > and as it starts up the destination will be still loading RAM
> > > > > > (and other postcopiable devices) potentially for quite a while.
> > > > > > Holding the bql around the ram listen thread means that the
> > > > > > execution of the destination won't be able to take that lock
> > > > > > until the postcopy load has finished; so while that might apparently
> > > > > > complete, it'll lead to the destination stalling until that's finished
> > > > > > which defeats the whole point of postcopy.
> > > > > > That last one probably won't fail a test but it will lead to a long stall
> > > > > > if you give it a nice big guest with lots of RAM that it's rapidly
> > > > > > changing.
> > > > > 
> > > > > Okay, I understand the postcopy case/flow now.
> > > > > Thanks for explaining it clearly.
> > > > > 
> > > > > > > I still think that "load_state" SaveVMHandlers need to be called
> > > > > > > with BQL held since implementations apparently expect it that way:
> > > > > > > for example, I think PCI device configuration restore calls
> > > > > > > address space manipulation methods which abort() if called
> > > > > > > without BQL held.
> > > > > > 
> > > > > > However, the only devices that *should* be arriving on the channel
> > > > > > that the postcopy_ram_listen_thread is reading from are those
> > > > > > that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
> > > > > > Those load handlers are safe to be run while the other devices
> > > > > > are being changed.   Note the *should* - you could add a check
> > > > > > to fail if any other device arrives on that channel.
> > > > > 
> > > > > I think ultimately there should be either an explicit check, or,
> > > > > as you suggest in the paragraph below, a separate SaveVMHandler
> > > > > that runs without BQL held.
> > > > 
> > > > To me those are bugs happening during postcopy, so those abort()s in
> > > > memory.c are indeed for catching these issues too.
> > > > 
> > > > > Since the current state of just running these SaveVMHandlers
> > > > > without BQL in this case and hoping that nothing breaks is
> > > > > clearly sub-optimal.
> > > > > 
> > > > > > > I have previously even submitted a patch to explicitly document
> > > > > > > "load_state" SaveVMHandler as requiring BQL (which was also
> > > > > > > included in the previous version of this patch set) and it
> > > > > > > received a "Reviewed-by:" tag:
> > > > > > > https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
> > > > > > > https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
> > > > > > > https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/
> > > > > > 
> > > > > > It happens!
> > > > > > You could make this safer by having a load_state and a load_state_postcopy
> > > > > > member, and only mark the load_state as requiring the lock.
> > > > > 
> > > > > To not digress too much from the subject of this patch set
> > > > > (multifd VFIO device state transfer) for now I've just updated the
> > > > > TODO comment around that qemu_loadvm_state_main(), so hopefully this
> > > > > discussion won't get forgotten:
> > > > > https://gitlab.com/maciejsszmigiero/qemu/-/commit/046e3deac5b1dbc406b3e9571f62468bd6743e79
> > > > 
> > > > The commit message may still need some touch ups, e.g.:
> > > > 
> > > >     postcopy_ram_listen_thread() is a free running thread, so it needs to
> > > >     take BQL around function calls to migration methods requiring BQL.
> > > > 
> > > > 
> > > > This sentence is still not correct, IMHO. As Dave explained, the ram load
> > > > thread is designed to run without BQL at least for the major workloads it
> > > > runs.
> > > 
> > > So what's your proposed wording of this commit then?
> > 
> > Perhaps dropping it? As either it implies qemu_loadvm_state_main() needs to
> > take bql (which could be wrong in case of postcopy at least from
> > design.. not sanity check pov), or it provides no real meaning to suggest
> > where to take it..
> > 
> > Personally I would put the comment as easy as possible - the large portion
> > isn't helping me to understand the code but only made it slightly more
> > confusing..
> > 
> >      /*
> >       * TODO: qemu_loadvm_state_main() could call "load_state" SaveVMHandlers
> >       * that are expecting BQL to be held, which isn't in this case.
> >       *
> >       * In principle, the only devices that should be arriving on this channel
> >       * now are those that are postcopiable and whose load handlers are safe
> >       * to be called without BQL being held.
> >       *
> >       * But nothing currently prevents the source from sending data for "unsafe"
> >       * devices which would cause trouble here.
> >       */
> > 
> > IMHO we could put it very simple if you think we need such sanity check
> > later:
> > 
> >      /* TODO: sanity check that only postcopiable data will be loaded here */
> 
> I think I will change that comment wording to the one ^^^^ you suggested above,
> since we still need to have this commit to take BQL around that
> migration_incoming_state_destroy() call in postcopy_ram_listen_thread().
> 
> > > 
> > > > I don't worry on src sending something that crashes the dest: if that
> > > > happens, that's a bug, we need to fix it..  In that case abort() either in
> > > > memory.c or migration/ would be the same.
> > > 
> > > Yeah, but it would be a bug in the source (or just bit stream corruption for
> > > any reason), yet it's the destination which would abort() or crash.
> > > 
> > > I think cases like that in principle should be handled more gracefully,
> > > like exiting the destination QEMU with an error.
> > > But that's something outside of the scope of this patch set.
> > 
> > Yes I agree.  It's just that postcopy normally cannot gracefully quits on
> > dest anyway.. as src QEMU cannot continue with a dead dest QEMU. For
> > obvious programming errors, I think abort() is still ok in this case, on
> > either src or dest if postcopy already started.
> > 
> > For this series, we could always stick with precopy, it could help converge
> > the series.
> 
> To be clear I'm messing with postcopy only because without adding that
> BQL lock around migration_incoming_state_destroy() in
> postcopy_ram_listen_thread() other changes in this patch set would break
> postcopy.
> And that's obviously not acceptable.

Ah, of course.

> 
> > > 
> > > > We could add some explicit check
> > > > in migration code, but I don't expect it to catch anything real, at least
> > > > such never happened since postcopy introduced.. so it's roughly 10 years
> > > > without anything like that happens.
> > > > 
> > > > Taking BQL for migration_incoming_state_destroy() looks all safe.  There's
> > > > one qemu_ram_block_writeback() which made me a bit nervous initially, but
> > > > then looks like RAM backends should be almost noop (for shmem and
> > > > hugetlbfs) but except pmem.
> > > 
> > > That's the only part where taking BQL is actually necessary for the
> > > functionality of this patch set to work properly, so it's fine to leave
> > > that call to qemu_loadvm_state_main() as-is (without BQL) for time being.
> > > 
> > > > 
> > > > The other alternative is we define load_cleanup() to not rely on BQL (which
> > > > I believe is true before this series?), then take it only when VFIO's path
> > > > needs it.
> > > 
> > > I think other paths always call load_cleanup() with BQL so it's probably
> > > safer to have consistent semantics here.
> > 
> > IMHO we don't necessarily need to make it the default that vmstate handler
> > hooks will need BQL by default - we can always properly define them to best
> > suite our need.
> 
> But I think consistency is important - if other callers take BQL for
> load_cleanup() then it makes sense to take it in all places (only if to make
> the code simpler).

I assume current QEMU master branch doesn't need bql for all existing (only
RAM and VFIO..) load_cleanup(), or am I wrong?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 22:56           ` Peter Xu
@ 2025-02-04 13:51             ` Fabiano Rosas
  2025-02-04 14:39             ` Maciej S. Szmigiero
  1 sibling, 0 replies; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-04 13:51 UTC (permalink / raw)
  To: Peter Xu, Maciej S. Szmigiero
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 21:20, Peter Xu wrote:
>> > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>> > > On 3.02.2025 19:20, Peter Xu wrote:
>> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> > > > > 
>> > > > > Multifd send channels are terminated by calling
>> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> > > > > multifd_send_terminate_threads(), which in the TLS case essentially
>> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> > > > > 
>> > > > > Unfortunately, this does not terminate the TLS session properly and
>> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>> > > > > 
>> > > > > The only reason why this wasn't causing migration failures is because
>> > > > > the current migration code apparently does not check for migration
>> > > > > error being set after the end of the multifd receive process.
>> > > > > 
>> > > > > However, this will change soon so the multifd receive code has to be
>> > > > > prepared to not return an error on such premature TLS session EOF.
>> > > > > Use the newly introduced QIOChannelTLS method for that.
>> > > > > 
>> > > > > It's worth noting that even if the sender were to be changed to terminate
>> > > > > the TLS connection properly the receive side still needs to remain
>> > > > > compatible with older QEMU bit stream which does not do this.
>> > > > 
>> > > > If this is an existing bug, we could add a Fixes.
>> > > 
>> > > It is an existing issue but only uncovered by this patch set.
>> > > 
>> > > As far as I can see it was always there, so it would need some
>> > > thought where to point that Fixes tag.
>> > 
>> > If there's no way to trigger a real functional bug anyway, it's also ok we
>> > omit the Fixes.
>> > 
>> > > > Two pure questions..
>> > > > 
>> > > >     - What is the correct way to terminate the TLS session without this flag?
>> > > 
>> > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
>> > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>> > > 
>> > > >     - Why this is only needed by multifd sessions?
>> > > 
>> > > What uncovered the issue was switching the load threads to using
>> > > migrate_set_error() instead of their own result variable
>> > > (load_threads_ret) which you had requested during the previous
>> > > patch set version review:
>> > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>> > > 
>> > > Turns out that the multifd receive code always returned
>> > > error in the TLS case, just nothing was previously checking for
>> > > that error presence.
>> > 
>> > What I was curious is whether this issue also exists for the main migration
>> > channel when with tls, especially when e.g. multifd not enabled at all.  As
>> > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>> > 
>> > I think it's a good to find that we overlooked this before.. and IMHO it's
>> > always good we could fix this.
>> > 
>> > Does it mean we need proper gnutls_bye() somewhere?
>> > 
>> > If we need an explicit gnutls_bye(), then I wonder if that should be done
>> > on the main channel as well.
>> 
>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>> in qemu_loadvm_state() - so still not until channel EOF.
>
> I had a closer look, I do feel like such pre-mature termination is caused
> by explicit shutdown()s of the iochannels, looks like that can cause issue
> even after everything is sent.  Then I noticed indeed multifd sender
> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> don't do that for the main channel.  Maybe that is a major difference.
>
> Now I wonder whether we should shutdown() the channel at all if migration
> succeeded, because looks like it can cause tls session to interrupt even if
> the shutdown() is done after sent everything, and if so it'll explain why
> you hit the issue with tls.
>
>> 
>> Then I can't see anything else reading the channel until it is closed in
>> migration_incoming_state_destroy().
>> 
>> So most likely the main migration channel will never read far enough to
>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>> 
>> > If we don't need gnutls_bye(), then should we always ignore pre-mature
>> > termination of tls no matter if it's multifd or non-multifd channel (or
>> > even a tls session that is not migration-related)?
>> 
>> So basically have this patch extended to calling
>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>
> If above theory can stand, then eof-okay could be a workaround papering
> over the real problem that we shouldn't always shutdown()..
>
> Could you have a look at below patch and see whether it can fix the problem
> you hit too, in replace of these two patches (including the previous
> iochannel change)?
>
> Thanks,
>
> ===8<===
> From 3147084174b0e0bda076ad205ae139f8fc433892 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Mon, 3 Feb 2025 17:27:45 -0500
> Subject: [PATCH] migration: Avoid shutdown multifd channels if migration
>  succeeded
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Multifd channels behave differently from the main channel when shutting
> down: the sender side will always shutdown() on the multifd iochannels, no
> matter whether migration succeeded or not.  QEMU doesn't do that on src.
>
> Such behavior was introduced in commit 077fbb5942 ("multifd: Shut down the
> QIO channels to avoid blocking the send threads when they are terminated.")
> to fix a hang issue when multifd enabled.

I'm always skeptical of "hangs" being fixed with shutdown(), when the
multifd code has had multiple issues with incorrect locking and ordering
of thread creation/deletion.

>
> This might be problematic though, especially in TLS context, because it
> looks like such shutdown() on src (even if succeeded) could cause
> destination multifd iochannels to receive pre-mature terminations of TLS
> sessions.

Speaking of TLS, enabling asan and running the TLS tests crashes QEMU in
a very obscure way. May or may not be related to thread termination
issues. I've been skipping postcopy/recovery/tls/psk when doing asan
builds for a while now. Just FYI, I'm not asking we take action on this
and I'm keeping an eye on it.

>
> It's debatable whether such shutdown() should be explicitly done even for a
> succeeded migration.  This patch moves the shutdown() instead from
> finalization phase into qmp_migrate_cancel(), so that we only do the
> shutdown() when cancels, and we should avoid such when it succeeds.
>
> When at it, keep all the shutdown() code together, moving the return path
> shutdown() (which seems a bit redundant, but no harm to do) to where the
> rest channels are shutdown.
>
> Cc: Li Zhang <lizhang@suse.de>

This will bounce, she's no longer at SUSE.

> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Reported-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.h   |  1 +
>  migration/migration.c | 24 +++++++++++++++++-------
>  migration/multifd.c   | 14 +++++++++++---
>  3 files changed, 29 insertions(+), 10 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index bd785b9873..26ef94ac93 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -340,6 +340,7 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
>  
>  void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
>  bool multifd_send(MultiFDSendData **send_data);
> +void multifd_send_shutdown_iochannels(void);
>  MultiFDSendData *multifd_send_data_alloc(void);
>  
>  static inline uint32_t multifd_ram_page_size(void)
> diff --git a/migration/migration.c b/migration/migration.c
> index 74c50cc72c..e43f8222dc 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1565,13 +1565,6 @@ static void migrate_fd_cancel(MigrationState *s)
>  
>      trace_migrate_fd_cancel();
>  
> -    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
> -        if (s->rp_state.from_dst_file) {
> -            /* shutdown the rp socket, so causing the rp thread to shutdown */
> -            qemu_file_shutdown(s->rp_state.from_dst_file);
> -        }
> -    }
> -
>      do {
>          old_state = s->state;
>          if (!migration_is_running()) {
> @@ -1594,6 +1587,23 @@ static void migrate_fd_cancel(MigrationState *s)
>              if (s->to_dst_file) {
>                  qemu_file_shutdown(s->to_dst_file);
>              }
> +            /*
> +             * Above should work already, because the iochannel is shared
> +             * between outgoing and return path qemufiles, however just to
> +             * be on the safe side to set qemufile error on return path too
> +             * if existed.
> +             */
> +            if (s->rp_state.from_dst_file) {
> +                qemu_file_shutdown(s->rp_state.from_dst_file);
> +            }
> +        }
> +
> +        /*
> +         * We need to shutdown multifd channels too if they are available,
> +         * to make sure no multifd send threads will be stuck at syscalls.
> +         */
> +        if (migrate_multifd()) {
> +            multifd_send_shutdown_iochannels();
>          }
>      }
>  
> diff --git a/migration/multifd.c b/migration/multifd.c
> index ab73d6d984..96bcbb1e0c 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -384,6 +384,17 @@ static void multifd_send_set_error(Error *err)
>      }
>  }
>  
> +void multifd_send_shutdown_iochannels(void)
> +{
> +    QIOChannel *c;
> +    int i;
> +
> +    for (i = 0; i < migrate_multifd_channels(); i++) {
> +        c = multifd_send_state->params[i].c;
> +        qio_channel_shutdown(c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> +    }
> +}
> +
>  static void multifd_send_terminate_threads(void)
>  {
>      int i;
> @@ -404,9 +415,6 @@ static void multifd_send_terminate_threads(void)
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
>          qemu_sem_post(&p->sem);
> -        if (p->c) {
> -            qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> -        }
>      }
>  
>      /*
> -- 
> 2.47.0


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 22:56           ` Peter Xu
  2025-02-04 13:51             ` Fabiano Rosas
@ 2025-02-04 14:39             ` Maciej S. Szmigiero
  2025-02-04 15:00               ` Fabiano Rosas
  2025-02-04 15:31               ` Peter Xu
  1 sibling, 2 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 14:39 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 3.02.2025 23:56, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 21:20, Peter Xu wrote:
>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> Multifd send channels are terminated by calling
>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>
>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>
>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>> the current migration code apparently does not check for migration
>>>>>> error being set after the end of the multifd receive process.
>>>>>>
>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>
>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>
>>>>> If this is an existing bug, we could add a Fixes.
>>>>
>>>> It is an existing issue but only uncovered by this patch set.
>>>>
>>>> As far as I can see it was always there, so it would need some
>>>> thought where to point that Fixes tag.
>>>
>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>> omit the Fixes.
>>>
>>>>> Two pure questions..
>>>>>
>>>>>      - What is the correct way to terminate the TLS session without this flag?
>>>>
>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>
>>>>>      - Why this is only needed by multifd sessions?
>>>>
>>>> What uncovered the issue was switching the load threads to using
>>>> migrate_set_error() instead of their own result variable
>>>> (load_threads_ret) which you had requested during the previous
>>>> patch set version review:
>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>
>>>> Turns out that the multifd receive code always returned
>>>> error in the TLS case, just nothing was previously checking for
>>>> that error presence.
>>>
>>> What I was curious is whether this issue also exists for the main migration
>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>
>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>> always good we could fix this.
>>>
>>> Does it mean we need proper gnutls_bye() somewhere?
>>>
>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>> on the main channel as well.
>>
>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>> in qemu_loadvm_state() - so still not until channel EOF.
> 
> I had a closer look, I do feel like such pre-mature termination is caused
> by explicit shutdown()s of the iochannels, looks like that can cause issue
> even after everything is sent.  Then I noticed indeed multifd sender
> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> don't do that for the main channel.  Maybe that is a major difference.
> 
> Now I wonder whether we should shutdown() the channel at all if migration
> succeeded, because looks like it can cause tls session to interrupt even if
> the shutdown() is done after sent everything, and if so it'll explain why
> you hit the issue with tls.
> 
>>
>> Then I can't see anything else reading the channel until it is closed in
>> migration_incoming_state_destroy().
>>
>> So most likely the main migration channel will never read far enough to
>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>
>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>> even a tls session that is not migration-related)?
>>
>> So basically have this patch extended to calling
>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
> 
> If above theory can stand, then eof-okay could be a workaround papering
> over the real problem that we shouldn't always shutdown()..
> 
> Could you have a look at below patch and see whether it can fix the problem
> you hit too, in replace of these two patches (including the previous
> iochannel change)?
> 

Unfortunately, the patch below does not fix the problem:
> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.

I think that, even in the absence of shutdown(), if the sender does not
call gnutls_bye() the TLS session is considered improperly terminated.

> Thanks,
> 
Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side
  2025-02-03 22:59       ` Peter Xu
@ 2025-02-04 14:40         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 14:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 3.02.2025 23:59, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 11:18:11PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 22:27, Peter Xu wrote:
>>> On Thu, Jan 30, 2025 at 11:08:34AM +0100, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Add a basic support for receiving device state via multifd channels -
>>>> channels that are shared with RAM transfers.
>>>>
>>>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>>>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>>>> data (existing MultiFDPacket_t) is read.
>>>>
>>>> The received device state data is provided to
>>>> qemu_loadvm_load_state_buffer() function for processing in the
>>>> device's load_state_buffer handler.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>
>>> I think I acked this one.  You could keep my R-b if...
>>>
>>> [...]
>>>
>>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>>> index 9e4baa066312..abf3acdcee40 100644
>>>> --- a/migration/multifd.h
>>>> +++ b/migration/multifd.h
>>>> @@ -62,6 +62,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>>>>    #define MULTIFD_FLAG_UADK (8 << 1)
>>>>    #define MULTIFD_FLAG_QATZIP (16 << 1)
>>>> +/*
>>>> + * If set it means that this packet contains device state
>>>> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
>>>> + */
>>>> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 6)
>>>
>>> ... if this won't conflict with MULTIFD_FLAG_QATZIP.
>>
>> Hmm, isn't (16 << 1) = 32 while (1 << 6) = 64?
> 
> Oops. :)
> 
>>> I think we should stick with one way to write it, then when rebase you can
>>> see such conflicts - either your patch uses 32 << 1, or perhaps we should
>>> start to switch to BIT() for all above instead..
> 
> Still, do you mind switch to "32 << 1" (or use BIT())?

I will switch to 32 << 1 for consistency with the compression flags
above.
  > With either, feel free to take:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-03 23:02                   ` Peter Xu
@ 2025-02-04 14:57                     ` Maciej S. Szmigiero
  2025-02-04 15:39                       ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 14:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 4.02.2025 00:02, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 10:41:43PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 21:36, Peter Xu wrote:
>>> On Mon, Feb 03, 2025 at 09:15:52PM +0100, Maciej S. Szmigiero wrote:
>>>> On 3.02.2025 20:58, Peter Xu wrote:
>>>>> On Mon, Feb 03, 2025 at 02:57:36PM +0100, Maciej S. Szmigiero wrote:
>>>>>> On 2.02.2025 13:45, Dr. David Alan Gilbert wrote:
>>>>>>> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>>>>>>>> On 2.02.2025 03:06, Dr. David Alan Gilbert wrote:
>>>>>>>>> * Maciej S. Szmigiero (mail@maciej.szmigiero.name) wrote:
>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>
>>>>>>>>>> postcopy_ram_listen_thread() is a free running thread, so it needs to
>>>>>>>>>> take BQL around function calls to migration methods requiring BQL.
>>>>>>>>>>
>>>>>>>>>> qemu_loadvm_state_main() needs BQL held since it ultimately calls
>>>>>>>>>> "load_state" SaveVMHandlers.
>>>>>>>>>>
>>>>>>>>>> migration_incoming_state_destroy() needs BQL held since it ultimately calls
>>>>>>>>>> "load_cleanup" SaveVMHandlers.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>>>>> ---
>>>>>>>>>>       migration/savevm.c | 4 ++++
>>>>>>>>>>       1 file changed, 4 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>>>>>>>> index b0b74140daea..0ceea9638cc1 100644
>>>>>>>>>> --- a/migration/savevm.c
>>>>>>>>>> +++ b/migration/savevm.c
>>>>>>>>>> @@ -2013,7 +2013,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>>>>>>>>>>            * in qemu_file, and thus we must be blocking now.
>>>>>>>>>>            */
>>>>>>>>>>           qemu_file_set_blocking(f, true);
>>>>>>>>>> +    bql_lock();
>>>>>>>>>>           load_res = qemu_loadvm_state_main(f, mis);
>>>>>>>>>> +    bql_unlock();
>>>>>>>>>
>>>>>>>>> Doesn't that leave that held for a heck of a long time?
>>>>>>>>
>>>>>>>> Yes, and it effectively broke "postcopy recover" test but I
>>>>>>>> think the reason for that is qemu_loadvm_state_main() and
>>>>>>>> its children don't drop BQL while waiting for I/O.
>>>>>>>>
>>>>>>>> I've described this case in more detail in my reply to Fabiano here:
>>>>>>>> https://lore.kernel.org/qemu-devel/0a09e627-955e-4f26-8d08-0192ecd250a8@maciej.szmigiero.name/
>>>>>>>
>>>>>>> While it might be the cause in this case, my feeling is it's more fundamental
>>>>>>> here - it's the whole reason that postcopy has a separate ram listen
>>>>>>> thread.  As the destination is running, after it loads it's devices
>>>>>>> and as it starts up the destination will be still loading RAM
>>>>>>> (and other postcopiable devices) potentially for quite a while.
>>>>>>> Holding the bql around the ram listen thread means that the
>>>>>>> execution of the destination won't be able to take that lock
>>>>>>> until the postcopy load has finished; so while that might apparently
>>>>>>> complete, it'll lead to the destination stalling until that's finished
>>>>>>> which defeats the whole point of postcopy.
>>>>>>> That last one probably won't fail a test but it will lead to a long stall
>>>>>>> if you give it a nice big guest with lots of RAM that it's rapidly
>>>>>>> changing.
>>>>>>
>>>>>> Okay, I understand the postcopy case/flow now.
>>>>>> Thanks for explaining it clearly.
>>>>>>
>>>>>>>> I still think that "load_state" SaveVMHandlers need to be called
>>>>>>>> with BQL held since implementations apparently expect it that way:
>>>>>>>> for example, I think PCI device configuration restore calls
>>>>>>>> address space manipulation methods which abort() if called
>>>>>>>> without BQL held.
>>>>>>>
>>>>>>> However, the only devices that *should* be arriving on the channel
>>>>>>> that the postcopy_ram_listen_thread is reading from are those
>>>>>>> that are postcopiable (i.e. RAM and hmm block's dirty_bitmap).
>>>>>>> Those load handlers are safe to be run while the other devices
>>>>>>> are being changed.   Note the *should* - you could add a check
>>>>>>> to fail if any other device arrives on that channel.
>>>>>>
>>>>>> I think ultimately there should be either an explicit check, or,
>>>>>> as you suggest in the paragraph below, a separate SaveVMHandler
>>>>>> that runs without BQL held.
>>>>>
>>>>> To me those are bugs happening during postcopy, so those abort()s in
>>>>> memory.c are indeed for catching these issues too.
>>>>>
>>>>>> Since the current state of just running these SaveVMHandlers
>>>>>> without BQL in this case and hoping that nothing breaks is
>>>>>> clearly sub-optimal.
>>>>>>
>>>>>>>> I have previously even submitted a patch to explicitly document
>>>>>>>> "load_state" SaveVMHandler as requiring BQL (which was also
>>>>>>>> included in the previous version of this patch set) and it
>>>>>>>> received a "Reviewed-by:" tag:
>>>>>>>> https://lore.kernel.org/qemu-devel/6976f129df610c8207da4e531c8c0475ec204fa4.1730203967.git.maciej.szmigiero@oracle.com/
>>>>>>>> https://lore.kernel.org/qemu-devel/e1949839932efaa531e2fe63ac13324e5787439c.1731773021.git.maciej.szmigiero@oracle.com/
>>>>>>>> https://lore.kernel.org/qemu-devel/87o732bti7.fsf@suse.de/
>>>>>>>
>>>>>>> It happens!
>>>>>>> You could make this safer by having a load_state and a load_state_postcopy
>>>>>>> member, and only mark the load_state as requiring the lock.
>>>>>>
>>>>>> To not digress too much from the subject of this patch set
>>>>>> (multifd VFIO device state transfer) for now I've just updated the
>>>>>> TODO comment around that qemu_loadvm_state_main(), so hopefully this
>>>>>> discussion won't get forgotten:
>>>>>> https://gitlab.com/maciejsszmigiero/qemu/-/commit/046e3deac5b1dbc406b3e9571f62468bd6743e79
>>>>>
>>>>> The commit message may still need some touch ups, e.g.:
>>>>>
>>>>>      postcopy_ram_listen_thread() is a free running thread, so it needs to
>>>>>      take BQL around function calls to migration methods requiring BQL.
>>>>>
>>>>>
>>>>> This sentence is still not correct, IMHO. As Dave explained, the ram load
>>>>> thread is designed to run without BQL at least for the major workloads it
>>>>> runs.
>>>>
>>>> So what's your proposed wording of this commit then?
>>>
>>> Perhaps dropping it? As either it implies qemu_loadvm_state_main() needs to
>>> take bql (which could be wrong in case of postcopy at least from
>>> design.. not sanity check pov), or it provides no real meaning to suggest
>>> where to take it..
>>>
>>> Personally I would put the comment as easy as possible - the large portion
>>> isn't helping me to understand the code but only made it slightly more
>>> confusing..
>>>
>>>       /*
>>>        * TODO: qemu_loadvm_state_main() could call "load_state" SaveVMHandlers
>>>        * that are expecting BQL to be held, which isn't in this case.
>>>        *
>>>        * In principle, the only devices that should be arriving on this channel
>>>        * now are those that are postcopiable and whose load handlers are safe
>>>        * to be called without BQL being held.
>>>        *
>>>        * But nothing currently prevents the source from sending data for "unsafe"
>>>        * devices which would cause trouble here.
>>>        */
>>>
>>> IMHO we could put it very simple if you think we need such sanity check
>>> later:
>>>
>>>       /* TODO: sanity check that only postcopiable data will be loaded here */
>>
>> I think I will change that comment wording to the one ^^^^ you suggested above,
>> since we still need to have this commit to take BQL around that
>> migration_incoming_state_destroy() call in postcopy_ram_listen_thread().
>>
>>>>
>>>>> I don't worry on src sending something that crashes the dest: if that
>>>>> happens, that's a bug, we need to fix it..  In that case abort() either in
>>>>> memory.c or migration/ would be the same.
>>>>
>>>> Yeah, but it would be a bug in the source (or just bit stream corruption for
>>>> any reason), yet it's the destination which would abort() or crash.
>>>>
>>>> I think cases like that in principle should be handled more gracefully,
>>>> like exiting the destination QEMU with an error.
>>>> But that's something outside of the scope of this patch set.
>>>
>>> Yes I agree.  It's just that postcopy normally cannot gracefully quits on
>>> dest anyway.. as src QEMU cannot continue with a dead dest QEMU. For
>>> obvious programming errors, I think abort() is still ok in this case, on
>>> either src or dest if postcopy already started.
>>>
>>> For this series, we could always stick with precopy, it could help converge
>>> the series.
>>
>> To be clear I'm messing with postcopy only because without adding that
>> BQL lock around migration_incoming_state_destroy() in
>> postcopy_ram_listen_thread() other changes in this patch set would break
>> postcopy.
>> And that's obviously not acceptable.
> 
> Ah, of course.
> 
>>
>>>>
>>>>> We could add some explicit check
>>>>> in migration code, but I don't expect it to catch anything real, at least
>>>>> such never happened since postcopy introduced.. so it's roughly 10 years
>>>>> without anything like that happens.
>>>>>
>>>>> Taking BQL for migration_incoming_state_destroy() looks all safe.  There's
>>>>> one qemu_ram_block_writeback() which made me a bit nervous initially, but
>>>>> then looks like RAM backends should be almost noop (for shmem and
>>>>> hugetlbfs) but except pmem.
>>>>
>>>> That's the only part where taking BQL is actually necessary for the
>>>> functionality of this patch set to work properly, so it's fine to leave
>>>> that call to qemu_loadvm_state_main() as-is (without BQL) for time being.
>>>>
>>>>>
>>>>> The other alternative is we define load_cleanup() to not rely on BQL (which
>>>>> I believe is true before this series?), then take it only when VFIO's path
>>>>> needs it.
>>>>
>>>> I think other paths always call load_cleanup() with BQL so it's probably
>>>> safer to have consistent semantics here.
>>>
>>> IMHO we don't necessarily need to make it the default that vmstate handler
>>> hooks will need BQL by default - we can always properly define them to best
>>> suite our need.
>>
>> But I think consistency is important - if other callers take BQL for
>> load_cleanup() then it makes sense to take it in all places (only if to make
>> the code simpler).
> 
> I assume current QEMU master branch doesn't need bql for all existing (only
> RAM and VFIO..) load_cleanup(), or am I wrong?

The vfio_migration_cleanup() used to just close a migration FD, while
RAM might end up calling qemu_ram_msync(), which sounds like something
that should be called under BQL.

But I am not sure whether that lack of BQL around qemu_ram_msync()
actually causes problems.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 14:39             ` Maciej S. Szmigiero
@ 2025-02-04 15:00               ` Fabiano Rosas
  2025-02-04 15:10                 ` Maciej S. Szmigiero
  2025-02-04 15:31               ` Peter Xu
  1 sibling, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-04 15:00 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 3.02.2025 23:56, Peter Xu wrote:
>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>
>>>>>>> Multifd send channels are terminated by calling
>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>
>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>
>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>> the current migration code apparently does not check for migration
>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>
>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>
>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>
>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>
>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>
>>>>> As far as I can see it was always there, so it would need some
>>>>> thought where to point that Fixes tag.
>>>>
>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>> omit the Fixes.
>>>>
>>>>>> Two pure questions..
>>>>>>
>>>>>>      - What is the correct way to terminate the TLS session without this flag?
>>>>>
>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>
>>>>>>      - Why this is only needed by multifd sessions?
>>>>>
>>>>> What uncovered the issue was switching the load threads to using
>>>>> migrate_set_error() instead of their own result variable
>>>>> (load_threads_ret) which you had requested during the previous
>>>>> patch set version review:
>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>
>>>>> Turns out that the multifd receive code always returned
>>>>> error in the TLS case, just nothing was previously checking for
>>>>> that error presence.
>>>>
>>>> What I was curious is whether this issue also exists for the main migration
>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>
>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>> always good we could fix this.
>>>>
>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>
>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>> on the main channel as well.
>>>
>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>> in qemu_loadvm_state() - so still not until channel EOF.
>> 
>> I had a closer look, I do feel like such pre-mature termination is caused
>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>> even after everything is sent.  Then I noticed indeed multifd sender
>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>> don't do that for the main channel.  Maybe that is a major difference.
>> 
>> Now I wonder whether we should shutdown() the channel at all if migration
>> succeeded, because looks like it can cause tls session to interrupt even if
>> the shutdown() is done after sent everything, and if so it'll explain why
>> you hit the issue with tls.
>> 
>>>
>>> Then I can't see anything else reading the channel until it is closed in
>>> migration_incoming_state_destroy().
>>>
>>> So most likely the main migration channel will never read far enough to
>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>
>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>> even a tls session that is not migration-related)?
>>>
>>> So basically have this patch extended to calling
>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>> 
>> If above theory can stand, then eof-okay could be a workaround papering
>> over the real problem that we shouldn't always shutdown()..
>> 
>> Could you have a look at below patch and see whether it can fix the problem
>> you hit too, in replace of these two patches (including the previous
>> iochannel change)?
>> 
>
> Unfortunately, the patch below does not fix the problem:
>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>
> I think that, even in the absence of shutdown(), if the sender does not
> call gnutls_bye() the TLS session is considered improperly terminated.
>

I havent't looked much further into this, but can we craft a reproducer
for it with current master code? It would help us take a look at this
problem independently of this series. Even an assert somewhere would
help.

>> Thanks,
>> 
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 18:20   ` Peter Xu
  2025-02-03 18:53     ` Maciej S. Szmigiero
@ 2025-02-04 15:08     ` Daniel P. Berrangé
  2025-02-04 16:02       ` Peter Xu
  1 sibling, 1 reply; 137+ messages in thread
From: Daniel P. Berrangé @ 2025-02-04 15:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > 
> > Multifd send channels are terminated by calling
> > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > multifd_send_terminate_threads(), which in the TLS case essentially
> > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > 
> > Unfortunately, this does not terminate the TLS session properly and
> > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > 
> > The only reason why this wasn't causing migration failures is because
> > the current migration code apparently does not check for migration
> > error being set after the end of the multifd receive process.
> > 
> > However, this will change soon so the multifd receive code has to be
> > prepared to not return an error on such premature TLS session EOF.
> > Use the newly introduced QIOChannelTLS method for that.
> > 
> > It's worth noting that even if the sender were to be changed to terminate
> > the TLS connection properly the receive side still needs to remain
> > compatible with older QEMU bit stream which does not do this.
> 
> If this is an existing bug, we could add a Fixes.
> 
> Two pure questions..
> 
>   - What is the correct way to terminate the TLS session without this flag?
> 
>   - Why this is only needed by multifd sessions?

Graceful TLS termination (via gnutls_bye()) should only be important to
security if the QEMU protocol in question does not know how much data it
is expecting to recieve. ie it cannot otherwise distinguish between an
expected EOF, and a premature EOF triggered by an attacker.

If the migration protocol has sufficient info to know when a chanel is
expected to see EOF, then we should stop trying to read from the TLS
channel before seeing the underlying EOF.

Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
migration will still fail corretly in the case of a malicious attack
causing premature termination.

If there's a risk that migration may succeed, but with incomplete data,
then we would need the full gnutls_bye dance.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 15:00               ` Fabiano Rosas
@ 2025-02-04 15:10                 ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 15:10 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.02.2025 16:00, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 3.02.2025 23:56, Peter Xu wrote:
>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>
>>>>>>>> Multifd send channels are terminated by calling
>>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>>
>>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>
>>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>>> the current migration code apparently does not check for migration
>>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>>
>>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>>
>>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>>
>>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>>
>>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>>
>>>>>> As far as I can see it was always there, so it would need some
>>>>>> thought where to point that Fixes tag.
>>>>>
>>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>>> omit the Fixes.
>>>>>
>>>>>>> Two pure questions..
>>>>>>>
>>>>>>>       - What is the correct way to terminate the TLS session without this flag?
>>>>>>
>>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>>
>>>>>>>       - Why this is only needed by multifd sessions?
>>>>>>
>>>>>> What uncovered the issue was switching the load threads to using
>>>>>> migrate_set_error() instead of their own result variable
>>>>>> (load_threads_ret) which you had requested during the previous
>>>>>> patch set version review:
>>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>>
>>>>>> Turns out that the multifd receive code always returned
>>>>>> error in the TLS case, just nothing was previously checking for
>>>>>> that error presence.
>>>>>
>>>>> What I was curious is whether this issue also exists for the main migration
>>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>>
>>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>>> always good we could fix this.
>>>>>
>>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>>
>>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>>> on the main channel as well.
>>>>
>>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>>> in qemu_loadvm_state() - so still not until channel EOF.
>>>
>>> I had a closer look, I do feel like such pre-mature termination is caused
>>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>>> even after everything is sent.  Then I noticed indeed multifd sender
>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>> don't do that for the main channel.  Maybe that is a major difference.
>>>
>>> Now I wonder whether we should shutdown() the channel at all if migration
>>> succeeded, because looks like it can cause tls session to interrupt even if
>>> the shutdown() is done after sent everything, and if so it'll explain why
>>> you hit the issue with tls.
>>>
>>>>
>>>> Then I can't see anything else reading the channel until it is closed in
>>>> migration_incoming_state_destroy().
>>>>
>>>> So most likely the main migration channel will never read far enough to
>>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>>
>>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>>> even a tls session that is not migration-related)?
>>>>
>>>> So basically have this patch extended to calling
>>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>>
>>> If above theory can stand, then eof-okay could be a workaround papering
>>> over the real problem that we shouldn't always shutdown()..
>>>
>>> Could you have a look at below patch and see whether it can fix the problem
>>> you hit too, in replace of these two patches (including the previous
>>> iochannel change)?
>>>
>>
>> Unfortunately, the patch below does not fix the problem:
>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>
>> I think that, even in the absence of shutdown(), if the sender does not
>> call gnutls_bye() the TLS session is considered improperly terminated.
>>
> 
> I havent't looked much further into this, but can we craft a reproducer
> for it with current master code? It would help us take a look at this
> problem independently of this series. Even an assert somewhere would
> help.

Sure:
diff --git a/migration/savevm.c b/migration/savevm.c
index bc375db282c2..f1a34b73f507 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2940,7 +2940,13 @@ int qemu_loadvm_state(QEMUFile *f)
  
      /* When reaching here, it must be precopy */
      if (ret == 0) {
-        ret = qemu_file_get_error(f);
+        MigrationState *s = migrate_get_current();
+
+        if (migrate_has_error(s)) {
+            ret = -EINVAL;
+        } else {
+            ret = qemu_file_get_error(f);
+        }
      }
  
      /*

QTEST_QEMU_BINARY=./qemu-system-x86_64 tests/qtest/migration-test -p '/x86_64/migration/multifd/tcp/tls/psk/match'
> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> Broken pipe
> ../tests/qtest/libqtest.c:199: kill_qemu() tried to terminate QEMU process but encountered exit status 1 (expected 0)
> Aborted

But we still need to support existing QEMU versions
that do not properly terminate the TLS stream anyway.

Thanks,
Maciej



^ permalink raw reply related	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-03 20:20       ` Peter Xu
  2025-02-03 21:41         ` Maciej S. Szmigiero
@ 2025-02-04 15:10         ` Daniel P. Berrangé
  1 sibling, 0 replies; 137+ messages in thread
From: Daniel P. Berrangé @ 2025-02-04 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Mon, Feb 03, 2025 at 03:20:50PM -0500, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > On 3.02.2025 19:20, Peter Xu wrote:
> > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > 
> > > > Multifd send channels are terminated by calling
> > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > 
> > > > Unfortunately, this does not terminate the TLS session properly and
> > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > 
> > > > The only reason why this wasn't causing migration failures is because
> > > > the current migration code apparently does not check for migration
> > > > error being set after the end of the multifd receive process.
> > > > 
> > > > However, this will change soon so the multifd receive code has to be
> > > > prepared to not return an error on such premature TLS session EOF.
> > > > Use the newly introduced QIOChannelTLS method for that.
> > > > 
> > > > It's worth noting that even if the sender were to be changed to terminate
> > > > the TLS connection properly the receive side still needs to remain
> > > > compatible with older QEMU bit stream which does not do this.
> > > 
> > > If this is an existing bug, we could add a Fixes.
> > 
> > It is an existing issue but only uncovered by this patch set.
> > 
> > As far as I can see it was always there, so it would need some
> > thought where to point that Fixes tag.
> 
> If there's no way to trigger a real functional bug anyway, it's also ok we
> omit the Fixes.
> 
> > > Two pure questions..
> > > 
> > >    - What is the correct way to terminate the TLS session without this flag?
> > 
> > I guess one would need to call gnutls_bye() like in this GnuTLS example:
> > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > 
> > >    - Why this is only needed by multifd sessions?
> > 
> > What uncovered the issue was switching the load threads to using
> > migrate_set_error() instead of their own result variable
> > (load_threads_ret) which you had requested during the previous
> > patch set version review:
> > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > 
> > Turns out that the multifd receive code always returned
> > error in the TLS case, just nothing was previously checking for
> > that error presence.
> 
> What I was curious is whether this issue also exists for the main migration
> channel when with tls, especially when e.g. multifd not enabled at all.  As
> I don't see anywhere that qemu uses gnutls_bye() for any tls session.

We've been lazy and avoided using gnutls_bye because it adds a
bunch more complexity to the shutdown sequence, and premature
shutdown from an malicious attack would be expected to cause
the QEMU protocol in the TLS channel to fail anyway.

> Does it mean we need proper gnutls_bye() somewhere?

Depends if the protocol we run over TLS can identify premature
termination itself, or whether it relies on TLS to identify
it.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF
  2025-01-30 10:08 ` [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF Maciej S. Szmigiero
@ 2025-02-04 15:15   ` Daniel P. Berrangé
  2025-02-04 16:02     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Daniel P. Berrangé @ 2025-02-04 15:15 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On Thu, Jan 30, 2025 at 11:08:28AM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Currently, hitting EOF on receive without sender terminating the TLS
> session properly causes the TLS channel to return an error (unless
> the channel was already shut down for read).
> 
> Add an optional setting whether we instead just return EOF in that
> case.
> 
> This possibility will be soon used by the migration multifd code.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/io/channel-tls.h | 11 +++++++++++
>  io/channel-tls.c         |  6 ++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/io/channel-tls.h b/include/io/channel-tls.h
> index 26c67f17e2d3..8552c0d0266e 100644
> --- a/include/io/channel-tls.h
> +++ b/include/io/channel-tls.h
> @@ -49,6 +49,7 @@ struct QIOChannelTLS {
>      QCryptoTLSSession *session;
>      QIOChannelShutdown shutdown;
>      guint hs_ioc_tag;
> +    bool premature_eof_okay;
>  };
>  
>  /**
> @@ -143,4 +144,14 @@ void qio_channel_tls_handshake(QIOChannelTLS *ioc,
>  QCryptoTLSSession *
>  qio_channel_tls_get_session(QIOChannelTLS *ioc);
>  
> +/**
> + * qio_channel_tls_set_premature_eof_okay:
> + * @ioc: the TLS channel object
> + *
> + * Sets whether receiving an EOF without terminating the TLS session properly
> + * by used the other side is considered okay or an error (the
> + * default behaviour).
> + */
> +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled);
> +
>  #endif /* QIO_CHANNEL_TLS_H */
> diff --git a/io/channel-tls.c b/io/channel-tls.c
> index aab630e5ae32..1079d6d10de1 100644
> --- a/io/channel-tls.c
> +++ b/io/channel-tls.c
> @@ -147,6 +147,11 @@ qio_channel_tls_new_client(QIOChannel *master,
>      return NULL;
>  }
>  
> +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled)
> +{
> +    ioc->premature_eof_okay = enabled;
> +}
> +
>  struct QIOChannelTLSData {
>      QIOTask *task;
>      GMainContext *context;
> @@ -279,6 +284,7 @@ static ssize_t qio_channel_tls_readv(QIOChannel *ioc,
>              tioc->session,
>              iov[i].iov_base,
>              iov[i].iov_len,
> +            tioc->premature_eof_okay ||
>              qatomic_load_acquire(&tioc->shutdown) & QIO_CHANNEL_SHUTDOWN_READ,
>              errp);
>          if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) {

IMHO a better way to do this is by defining an new flag for use with
the qio_channel_readv_full() method. That makes the ignoring of
premature shutdown a contextually scoped behaviour rather than a
global behaviour.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 14:39             ` Maciej S. Szmigiero
  2025-02-04 15:00               ` Fabiano Rosas
@ 2025-02-04 15:31               ` Peter Xu
  2025-02-04 15:39                 ` Daniel P. Berrangé
  1 sibling, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-04 15:31 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 23:56, Peter Xu wrote:
> > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> > > On 3.02.2025 21:20, Peter Xu wrote:
> > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > > > > On 3.02.2025 19:20, Peter Xu wrote:
> > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > 
> > > > > > > Multifd send channels are terminated by calling
> > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > > > 
> > > > > > > Unfortunately, this does not terminate the TLS session properly and
> > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > > > > 
> > > > > > > The only reason why this wasn't causing migration failures is because
> > > > > > > the current migration code apparently does not check for migration
> > > > > > > error being set after the end of the multifd receive process.
> > > > > > > 
> > > > > > > However, this will change soon so the multifd receive code has to be
> > > > > > > prepared to not return an error on such premature TLS session EOF.
> > > > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > > > 
> > > > > > > It's worth noting that even if the sender were to be changed to terminate
> > > > > > > the TLS connection properly the receive side still needs to remain
> > > > > > > compatible with older QEMU bit stream which does not do this.
> > > > > > 
> > > > > > If this is an existing bug, we could add a Fixes.
> > > > > 
> > > > > It is an existing issue but only uncovered by this patch set.
> > > > > 
> > > > > As far as I can see it was always there, so it would need some
> > > > > thought where to point that Fixes tag.
> > > > 
> > > > If there's no way to trigger a real functional bug anyway, it's also ok we
> > > > omit the Fixes.
> > > > 
> > > > > > Two pure questions..
> > > > > > 
> > > > > >      - What is the correct way to terminate the TLS session without this flag?
> > > > > 
> > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
> > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > > > > 
> > > > > >      - Why this is only needed by multifd sessions?
> > > > > 
> > > > > What uncovered the issue was switching the load threads to using
> > > > > migrate_set_error() instead of their own result variable
> > > > > (load_threads_ret) which you had requested during the previous
> > > > > patch set version review:
> > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > > > > 
> > > > > Turns out that the multifd receive code always returned
> > > > > error in the TLS case, just nothing was previously checking for
> > > > > that error presence.
> > > > 
> > > > What I was curious is whether this issue also exists for the main migration
> > > > channel when with tls, especially when e.g. multifd not enabled at all.  As
> > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> > > > 
> > > > I think it's a good to find that we overlooked this before.. and IMHO it's
> > > > always good we could fix this.
> > > > 
> > > > Does it mean we need proper gnutls_bye() somewhere?
> > > > 
> > > > If we need an explicit gnutls_bye(), then I wonder if that should be done
> > > > on the main channel as well.
> > > 
> > > That's a good question and looking at the code qemu_loadvm_state_main() exits
> > > on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
> > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
> > > in qemu_loadvm_state() - so still not until channel EOF.
> > 
> > I had a closer look, I do feel like such pre-mature termination is caused
> > by explicit shutdown()s of the iochannels, looks like that can cause issue
> > even after everything is sent.  Then I noticed indeed multifd sender
> > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> > don't do that for the main channel.  Maybe that is a major difference.
> > 
> > Now I wonder whether we should shutdown() the channel at all if migration
> > succeeded, because looks like it can cause tls session to interrupt even if
> > the shutdown() is done after sent everything, and if so it'll explain why
> > you hit the issue with tls.
> > 
> > > 
> > > Then I can't see anything else reading the channel until it is closed in
> > > migration_incoming_state_destroy().
> > > 
> > > So most likely the main migration channel will never read far enough to
> > > reach that GNUTLS_E_PREMATURE_TERMINATION error.
> > > 
> > > > If we don't need gnutls_bye(), then should we always ignore pre-mature
> > > > termination of tls no matter if it's multifd or non-multifd channel (or
> > > > even a tls session that is not migration-related)?
> > > 
> > > So basically have this patch extended to calling
> > > qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
> > 
> > If above theory can stand, then eof-okay could be a workaround papering
> > over the real problem that we shouldn't always shutdown()..
> > 
> > Could you have a look at below patch and see whether it can fix the problem
> > you hit too, in replace of these two patches (including the previous
> > iochannel change)?
> > 
> 
> Unfortunately, the patch below does not fix the problem:
> > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> 
> I think that, even in the absence of shutdown(), if the sender does not
> call gnutls_bye() the TLS session is considered improperly terminated.

Ah..

How about one more change on top of above change to disconnect properly for
TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
make sense to you?

In general, I think it'll be good we fix this from the source rather than
bypassing an error reported by gnutls facilities. We'll see what we can do
to help (this includes Fabiano), but the hope is we can figure out the
right way soon.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 15:31               ` Peter Xu
@ 2025-02-04 15:39                 ` Daniel P. Berrangé
  2025-02-05 19:09                   ` Fabiano Rosas
  0 siblings, 1 reply; 137+ messages in thread
From: Daniel P. Berrangé @ 2025-02-04 15:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
> > On 3.02.2025 23:56, Peter Xu wrote:
> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> > > > On 3.02.2025 21:20, Peter Xu wrote:
> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > > > > 
> > > > > > > > Multifd send channels are terminated by calling
> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > > > > 
> > > > > > > > Unfortunately, this does not terminate the TLS session properly and
> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > > > > > 
> > > > > > > > The only reason why this wasn't causing migration failures is because
> > > > > > > > the current migration code apparently does not check for migration
> > > > > > > > error being set after the end of the multifd receive process.
> > > > > > > > 
> > > > > > > > However, this will change soon so the multifd receive code has to be
> > > > > > > > prepared to not return an error on such premature TLS session EOF.
> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > > > > 
> > > > > > > > It's worth noting that even if the sender were to be changed to terminate
> > > > > > > > the TLS connection properly the receive side still needs to remain
> > > > > > > > compatible with older QEMU bit stream which does not do this.
> > > > > > > 
> > > > > > > If this is an existing bug, we could add a Fixes.
> > > > > > 
> > > > > > It is an existing issue but only uncovered by this patch set.
> > > > > > 
> > > > > > As far as I can see it was always there, so it would need some
> > > > > > thought where to point that Fixes tag.
> > > > > 
> > > > > If there's no way to trigger a real functional bug anyway, it's also ok we
> > > > > omit the Fixes.
> > > > > 
> > > > > > > Two pure questions..
> > > > > > > 
> > > > > > >      - What is the correct way to terminate the TLS session without this flag?
> > > > > > 
> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > > > > > 
> > > > > > >      - Why this is only needed by multifd sessions?
> > > > > > 
> > > > > > What uncovered the issue was switching the load threads to using
> > > > > > migrate_set_error() instead of their own result variable
> > > > > > (load_threads_ret) which you had requested during the previous
> > > > > > patch set version review:
> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > > > > > 
> > > > > > Turns out that the multifd receive code always returned
> > > > > > error in the TLS case, just nothing was previously checking for
> > > > > > that error presence.
> > > > > 
> > > > > What I was curious is whether this issue also exists for the main migration
> > > > > channel when with tls, especially when e.g. multifd not enabled at all.  As
> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> > > > > 
> > > > > I think it's a good to find that we overlooked this before.. and IMHO it's
> > > > > always good we could fix this.
> > > > > 
> > > > > Does it mean we need proper gnutls_bye() somewhere?
> > > > > 
> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be done
> > > > > on the main channel as well.
> > > > 
> > > > That's a good question and looking at the code qemu_loadvm_state_main() exits
> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
> > > > in qemu_loadvm_state() - so still not until channel EOF.
> > > 
> > > I had a closer look, I do feel like such pre-mature termination is caused
> > > by explicit shutdown()s of the iochannels, looks like that can cause issue
> > > even after everything is sent.  Then I noticed indeed multifd sender
> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> > > don't do that for the main channel.  Maybe that is a major difference.
> > > 
> > > Now I wonder whether we should shutdown() the channel at all if migration
> > > succeeded, because looks like it can cause tls session to interrupt even if
> > > the shutdown() is done after sent everything, and if so it'll explain why
> > > you hit the issue with tls.
> > > 
> > > > 
> > > > Then I can't see anything else reading the channel until it is closed in
> > > > migration_incoming_state_destroy().
> > > > 
> > > > So most likely the main migration channel will never read far enough to
> > > > reach that GNUTLS_E_PREMATURE_TERMINATION error.
> > > > 
> > > > > If we don't need gnutls_bye(), then should we always ignore pre-mature
> > > > > termination of tls no matter if it's multifd or non-multifd channel (or
> > > > > even a tls session that is not migration-related)?
> > > > 
> > > > So basically have this patch extended to calling
> > > > qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
> > > 
> > > If above theory can stand, then eof-okay could be a workaround papering
> > > over the real problem that we shouldn't always shutdown()..
> > > 
> > > Could you have a look at below patch and see whether it can fix the problem
> > > you hit too, in replace of these two patches (including the previous
> > > iochannel change)?
> > > 
> > 
> > Unfortunately, the patch below does not fix the problem:
> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> > 
> > I think that, even in the absence of shutdown(), if the sender does not
> > call gnutls_bye() the TLS session is considered improperly terminated.
> 
> Ah..
> 
> How about one more change on top of above change to disconnect properly for
> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
> make sense to you?

Calling gnutls_bye from qio_channel_tls_close is not viable for the
API contract of qio_channel_close. gnutls_bye needs to be able to
perform I/O, which means we need to be able to tell the caller
whether it needs to perform an event loop wait for POLLIN or POLLOUT.

This is the same API design scenario as the gnutls_handshake method.
As such I tdon't think it is practical to abstract it inside any
existing QIOChannel API call, it'll have to be standalone like
qio_channel_tls_handshake() is.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-04 14:57                     ` Maciej S. Szmigiero
@ 2025-02-04 15:39                       ` Peter Xu
  2025-02-04 19:32                         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-04 15:39 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 03:57:37PM +0100, Maciej S. Szmigiero wrote:
> The vfio_migration_cleanup() used to just close a migration FD, while
> RAM might end up calling qemu_ram_msync(), which sounds like something
> that should be called under BQL.
> 
> But I am not sure whether that lack of BQL around qemu_ram_msync()
> actually causes problems.

I believe msync() is thread-safe.  So that doesn't need BQL, AFAICT.

Personally I actually prefer not having the BQL requirement if ever
possible in any vmstate hooks.

I think the only challenge here is if VFIO will start to need BQL for some
specific code path that you added in this series, it means VFIO needs to
detect bql_locked() to make sure it won't deadlock.. and only take BQL if
it's not taken.

From that POV, it might be easier for you to define that hook as "always do
cleanup() with BQL" globally, just to avoid one bql_locked() usage in vfio
specific hook.  We pay that with slow RAM sync in corner cases like pmem
that could potentially block VM from making progress (e.g. vcpu
concurrently accessing MMIO regions).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF
  2025-02-04 15:15   ` Daniel P. Berrangé
@ 2025-02-04 16:02     ` Maciej S. Szmigiero
  2025-02-04 16:14       ` Daniel P. Berrangé
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 16:02 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On 4.02.2025 16:15, Daniel P. Berrangé wrote:
> On Thu, Jan 30, 2025 at 11:08:28AM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Currently, hitting EOF on receive without sender terminating the TLS
>> session properly causes the TLS channel to return an error (unless
>> the channel was already shut down for read).
>>
>> Add an optional setting whether we instead just return EOF in that
>> case.
>>
>> This possibility will be soon used by the migration multifd code.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/io/channel-tls.h | 11 +++++++++++
>>   io/channel-tls.c         |  6 ++++++
>>   2 files changed, 17 insertions(+)
>>
>> diff --git a/include/io/channel-tls.h b/include/io/channel-tls.h
>> index 26c67f17e2d3..8552c0d0266e 100644
>> --- a/include/io/channel-tls.h
>> +++ b/include/io/channel-tls.h
>> @@ -49,6 +49,7 @@ struct QIOChannelTLS {
>>       QCryptoTLSSession *session;
>>       QIOChannelShutdown shutdown;
>>       guint hs_ioc_tag;
>> +    bool premature_eof_okay;
>>   };
>>   
>>   /**
>> @@ -143,4 +144,14 @@ void qio_channel_tls_handshake(QIOChannelTLS *ioc,
>>   QCryptoTLSSession *
>>   qio_channel_tls_get_session(QIOChannelTLS *ioc);
>>   
>> +/**
>> + * qio_channel_tls_set_premature_eof_okay:
>> + * @ioc: the TLS channel object
>> + *
>> + * Sets whether receiving an EOF without terminating the TLS session properly
>> + * by used the other side is considered okay or an error (the
>> + * default behaviour).
>> + */
>> +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled);
>> +
>>   #endif /* QIO_CHANNEL_TLS_H */
>> diff --git a/io/channel-tls.c b/io/channel-tls.c
>> index aab630e5ae32..1079d6d10de1 100644
>> --- a/io/channel-tls.c
>> +++ b/io/channel-tls.c
>> @@ -147,6 +147,11 @@ qio_channel_tls_new_client(QIOChannel *master,
>>       return NULL;
>>   }
>>   
>> +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled)
>> +{
>> +    ioc->premature_eof_okay = enabled;
>> +}
>> +
>>   struct QIOChannelTLSData {
>>       QIOTask *task;
>>       GMainContext *context;
>> @@ -279,6 +284,7 @@ static ssize_t qio_channel_tls_readv(QIOChannel *ioc,
>>               tioc->session,
>>               iov[i].iov_base,
>>               iov[i].iov_len,
>> +            tioc->premature_eof_okay ||
>>               qatomic_load_acquire(&tioc->shutdown) & QIO_CHANNEL_SHUTDOWN_READ,
>>               errp);
>>           if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) {
> 
> IMHO a better way to do this is by defining an new flag for use with
> the qio_channel_readv_full() method. That makes the ignoring of
> premature shutdown a contextually scoped behaviour rather than a
> global behaviour.

Something named like QIO_CHANNEL_READ_FLAG_TLS_EARLY_EOF_OKAY?

> With regards,
> Daniel

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 15:08     ` Daniel P. Berrangé
@ 2025-02-04 16:02       ` Peter Xu
  2025-02-04 16:12         ` Daniel P. Berrangé
  2025-02-04 18:25         ` Fabiano Rosas
  0 siblings, 2 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-04 16:02 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 03:08:02PM +0000, Daniel P. Berrangé wrote:
> On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Multifd send channels are terminated by calling
> > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > 
> > > Unfortunately, this does not terminate the TLS session properly and
> > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > 
> > > The only reason why this wasn't causing migration failures is because
> > > the current migration code apparently does not check for migration
> > > error being set after the end of the multifd receive process.
> > > 
> > > However, this will change soon so the multifd receive code has to be
> > > prepared to not return an error on such premature TLS session EOF.
> > > Use the newly introduced QIOChannelTLS method for that.
> > > 
> > > It's worth noting that even if the sender were to be changed to terminate
> > > the TLS connection properly the receive side still needs to remain
> > > compatible with older QEMU bit stream which does not do this.
> > 
> > If this is an existing bug, we could add a Fixes.
> > 
> > Two pure questions..
> > 
> >   - What is the correct way to terminate the TLS session without this flag?
> > 
> >   - Why this is only needed by multifd sessions?
> 
> Graceful TLS termination (via gnutls_bye()) should only be important to
> security if the QEMU protocol in question does not know how much data it
> is expecting to recieve. ie it cannot otherwise distinguish between an
> expected EOF, and a premature EOF triggered by an attacker.
> 
> If the migration protocol has sufficient info to know when a chanel is
> expected to see EOF, then we should stop trying to read from the TLS
> channel before seeing the underlying EOF.
> 
> Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
> migration will still fail corretly in the case of a malicious attack
> causing premature termination.
> 
> If there's a risk that migration may succeed, but with incomplete data,
> then we would need the full gnutls_bye dance.

IIUC that's not required for migration then, because migration should know
exactly how much data to receive, and migration should need to verify that
and fail if the received data didn't match the expectation along the way.
We also have QEMU_VM_EOF as the end mark of stream.

Said that, are we sure any pre-mature termination will only happen after
all data read in the receive buffer that was sent?

To ask in another way: what happens if the source QEMU sends everything and
shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
data to read, and (2) a pre-mature terminatino of TLS session in a read()
syscall.  Would (2) be reported even before (1), or the order guaranteed
that read of the residue data in (1) always happen before (2) (considering
dest QEMU can be slow sometime on consuming the network buffers)?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 16:02       ` Peter Xu
@ 2025-02-04 16:12         ` Daniel P. Berrangé
  2025-02-04 16:29           ` Peter Xu
  2025-02-04 18:25         ` Fabiano Rosas
  1 sibling, 1 reply; 137+ messages in thread
From: Daniel P. Berrangé @ 2025-02-04 16:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 11:02:28AM -0500, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 03:08:02PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > 
> > > > Multifd send channels are terminated by calling
> > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > 
> > > > Unfortunately, this does not terminate the TLS session properly and
> > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > 
> > > > The only reason why this wasn't causing migration failures is because
> > > > the current migration code apparently does not check for migration
> > > > error being set after the end of the multifd receive process.
> > > > 
> > > > However, this will change soon so the multifd receive code has to be
> > > > prepared to not return an error on such premature TLS session EOF.
> > > > Use the newly introduced QIOChannelTLS method for that.
> > > > 
> > > > It's worth noting that even if the sender were to be changed to terminate
> > > > the TLS connection properly the receive side still needs to remain
> > > > compatible with older QEMU bit stream which does not do this.
> > > 
> > > If this is an existing bug, we could add a Fixes.
> > > 
> > > Two pure questions..
> > > 
> > >   - What is the correct way to terminate the TLS session without this flag?
> > > 
> > >   - Why this is only needed by multifd sessions?
> > 
> > Graceful TLS termination (via gnutls_bye()) should only be important to
> > security if the QEMU protocol in question does not know how much data it
> > is expecting to recieve. ie it cannot otherwise distinguish between an
> > expected EOF, and a premature EOF triggered by an attacker.
> > 
> > If the migration protocol has sufficient info to know when a chanel is
> > expected to see EOF, then we should stop trying to read from the TLS
> > channel before seeing the underlying EOF.
> > 
> > Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
> > migration will still fail corretly in the case of a malicious attack
> > causing premature termination.
> > 
> > If there's a risk that migration may succeed, but with incomplete data,
> > then we would need the full gnutls_bye dance.
> 
> IIUC that's not required for migration then, because migration should know
> exactly how much data to receive, and migration should need to verify that
> and fail if the received data didn't match the expectation along the way.
> We also have QEMU_VM_EOF as the end mark of stream.
> 
> Said that, are we sure any pre-mature termination will only happen after
> all data read in the receive buffer that was sent?
> 
> To ask in another way: what happens if the source QEMU sends everything and
> shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
> data to read, and (2) a pre-mature terminatino of TLS session in a read()
> syscall.  Would (2) be reported even before (1), or the order guaranteed
> that read of the residue data in (1) always happen before (2) (considering
> dest QEMU can be slow sometime on consuming the network buffers)?

That's not logically possible.

In both (1) and (2) you are issuing a read() call to the TLS channel.

The first read call(s) consume all incoming data. Only once the underlying
TCP socket read() returns 0, would GNUTLS  see that it hasn't got any
TLS "bye" packet, and thus return GNUTLS_E_PREMATURE_TERMINATION from
the layered TLS read(). IOW, if you see GNUTLS_E_PREMATURE_TERMINATION
you know you have already read all received data off the socket.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF
  2025-02-04 16:02     ` Maciej S. Szmigiero
@ 2025-02-04 16:14       ` Daniel P. Berrangé
  2025-02-04 18:25         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Daniel P. Berrangé @ 2025-02-04 16:14 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

On Tue, Feb 04, 2025 at 05:02:23PM +0100, Maciej S. Szmigiero wrote:
> On 4.02.2025 16:15, Daniel P. Berrangé wrote:
> > On Thu, Jan 30, 2025 at 11:08:28AM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Currently, hitting EOF on receive without sender terminating the TLS
> > > session properly causes the TLS channel to return an error (unless
> > > the channel was already shut down for read).
> > > 
> > > Add an optional setting whether we instead just return EOF in that
> > > case.
> > > 
> > > This possibility will be soon used by the migration multifd code.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > ---
> > >   include/io/channel-tls.h | 11 +++++++++++
> > >   io/channel-tls.c         |  6 ++++++
> > >   2 files changed, 17 insertions(+)
> > > 
> > > diff --git a/include/io/channel-tls.h b/include/io/channel-tls.h
> > > index 26c67f17e2d3..8552c0d0266e 100644
> > > --- a/include/io/channel-tls.h
> > > +++ b/include/io/channel-tls.h
> > > @@ -49,6 +49,7 @@ struct QIOChannelTLS {
> > >       QCryptoTLSSession *session;
> > >       QIOChannelShutdown shutdown;
> > >       guint hs_ioc_tag;
> > > +    bool premature_eof_okay;
> > >   };
> > >   /**
> > > @@ -143,4 +144,14 @@ void qio_channel_tls_handshake(QIOChannelTLS *ioc,
> > >   QCryptoTLSSession *
> > >   qio_channel_tls_get_session(QIOChannelTLS *ioc);
> > > +/**
> > > + * qio_channel_tls_set_premature_eof_okay:
> > > + * @ioc: the TLS channel object
> > > + *
> > > + * Sets whether receiving an EOF without terminating the TLS session properly
> > > + * by used the other side is considered okay or an error (the
> > > + * default behaviour).
> > > + */
> > > +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled);
> > > +
> > >   #endif /* QIO_CHANNEL_TLS_H */
> > > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > > index aab630e5ae32..1079d6d10de1 100644
> > > --- a/io/channel-tls.c
> > > +++ b/io/channel-tls.c
> > > @@ -147,6 +147,11 @@ qio_channel_tls_new_client(QIOChannel *master,
> > >       return NULL;
> > >   }
> > > +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled)
> > > +{
> > > +    ioc->premature_eof_okay = enabled;
> > > +}
> > > +
> > >   struct QIOChannelTLSData {
> > >       QIOTask *task;
> > >       GMainContext *context;
> > > @@ -279,6 +284,7 @@ static ssize_t qio_channel_tls_readv(QIOChannel *ioc,
> > >               tioc->session,
> > >               iov[i].iov_base,
> > >               iov[i].iov_len,
> > > +            tioc->premature_eof_okay ||
> > >               qatomic_load_acquire(&tioc->shutdown) & QIO_CHANNEL_SHUTDOWN_READ,
> > >               errp);
> > >           if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) {
> > 
> > IMHO a better way to do this is by defining an new flag for use with
> > the qio_channel_readv_full() method. That makes the ignoring of
> > premature shutdown a contextually scoped behaviour rather than a
> > global behaviour.
> 
> Something named like QIO_CHANNEL_READ_FLAG_TLS_EARLY_EOF_OKAY?

Since the flags are defined at the non-TLS layer in the API, I would
pick  "QIO_CHANNEL_READ_RELAXED_EOF", as it could conceptually make
sense to other layered channel protocols beyond TLS, even if we only
ever implement it for TLS.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 16:12         ` Daniel P. Berrangé
@ 2025-02-04 16:29           ` Peter Xu
  0 siblings, 0 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-04 16:29 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 04:12:15PM +0000, Daniel P. Berrangé wrote:
> On Tue, Feb 04, 2025 at 11:02:28AM -0500, Peter Xu wrote:
> > On Tue, Feb 04, 2025 at 03:08:02PM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > Multifd send channels are terminated by calling
> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > 
> > > > > Unfortunately, this does not terminate the TLS session properly and
> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > > 
> > > > > The only reason why this wasn't causing migration failures is because
> > > > > the current migration code apparently does not check for migration
> > > > > error being set after the end of the multifd receive process.
> > > > > 
> > > > > However, this will change soon so the multifd receive code has to be
> > > > > prepared to not return an error on such premature TLS session EOF.
> > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > 
> > > > > It's worth noting that even if the sender were to be changed to terminate
> > > > > the TLS connection properly the receive side still needs to remain
> > > > > compatible with older QEMU bit stream which does not do this.
> > > > 
> > > > If this is an existing bug, we could add a Fixes.
> > > > 
> > > > Two pure questions..
> > > > 
> > > >   - What is the correct way to terminate the TLS session without this flag?
> > > > 
> > > >   - Why this is only needed by multifd sessions?
> > > 
> > > Graceful TLS termination (via gnutls_bye()) should only be important to
> > > security if the QEMU protocol in question does not know how much data it
> > > is expecting to recieve. ie it cannot otherwise distinguish between an
> > > expected EOF, and a premature EOF triggered by an attacker.
> > > 
> > > If the migration protocol has sufficient info to know when a chanel is
> > > expected to see EOF, then we should stop trying to read from the TLS
> > > channel before seeing the underlying EOF.
> > > 
> > > Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
> > > migration will still fail corretly in the case of a malicious attack
> > > causing premature termination.
> > > 
> > > If there's a risk that migration may succeed, but with incomplete data,
> > > then we would need the full gnutls_bye dance.
> > 
> > IIUC that's not required for migration then, because migration should know
> > exactly how much data to receive, and migration should need to verify that
> > and fail if the received data didn't match the expectation along the way.
> > We also have QEMU_VM_EOF as the end mark of stream.
> > 
> > Said that, are we sure any pre-mature termination will only happen after
> > all data read in the receive buffer that was sent?
> > 
> > To ask in another way: what happens if the source QEMU sends everything and
> > shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
> > data to read, and (2) a pre-mature terminatino of TLS session in a read()
> > syscall.  Would (2) be reported even before (1), or the order guaranteed
> > that read of the residue data in (1) always happen before (2) (considering
> > dest QEMU can be slow sometime on consuming the network buffers)?
> 
> That's not logically possible.
> 
> In both (1) and (2) you are issuing a read() call to the TLS channel.
> 
> The first read call(s) consume all incoming data. Only once the underlying
> TCP socket read() returns 0, would GNUTLS  see that it hasn't got any
> TLS "bye" packet, and thus return GNUTLS_E_PREMATURE_TERMINATION from
> the layered TLS read(). IOW, if you see GNUTLS_E_PREMATURE_TERMINATION
> you know you have already read all received data off the socket.

That looks all OK then.  In that case we could set all migration TLS
sessions to ignore premature terminations.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-01-30 10:08 ` [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
@ 2025-02-04 17:54   ` Peter Xu
  2025-02-04 19:32     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-04 17:54 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Jan 30, 2025 at 11:08:40AM +0100, Maciej S. Szmigiero wrote:
> +static int multifd_device_state_save_thread(void *opaque)
> +{
> +    struct MultiFDDSSaveThreadData *data = opaque;
> +    int ret;
> +
> +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
> +                     data->handler_opaque);

I thought we discussed somewhere and the plan was we could use Error** here
to report errors.  Would that still make sense, or maybe I lost some
context?

Meanwhile, I still feel uneasy on having these globals (send_threads_abort,
send_threads_ret).  Can we make MultiFDDSSaveThreadData the only interface
between migration and the threads impl?  So I wonder if it can be:

  ret = data->hdlr(data);

With extended struct like this (I added thread_error and thread_quit):

struct MultiFDDSSaveThreadData {
    SaveLiveCompletePrecopyThreadHandler hdlr;
    char *idstr;
    uint32_t instance_id;
    void *handler_opaque;
    /*
     * Should be NULL when struct passed over to thread, the thread should
     * set this if the handler would return false.  It must be kept NULL if
     * the handler returned true / success.
     */
    Error *thread_error;
    /*
     * Migration core would set this when it wants to notify thread to
     * quit, for example, when error occured in other threads, or migration is
     * cancelled by the user.
     */
    bool thread_quit;
};

Then if any multifd_device_state_save_thread() failed, for example, it
should notify all threads to quit by setting thread_quit, instead of
relying on yet another global variable to show migration needs to quit.

Thanks,

> +    if (ret && !qatomic_read(&send_threads_ret)) {
> +        /*
> +         * Racy with the above read but that's okay - which thread error
> +         * return we report is purely arbitrary anyway.
> +         */
> +        qatomic_set(&send_threads_ret, ret);
> +    }
> +
> +    return 0;
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF
  2025-02-04 16:14       ` Daniel P. Berrangé
@ 2025-02-04 18:25         ` Maciej S. Szmigiero
  2025-02-06 21:53           ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 18:25 UTC (permalink / raw)
  To: Daniel P. Berrangé, Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

On 4.02.2025 17:14, Daniel P. Berrangé wrote:
> On Tue, Feb 04, 2025 at 05:02:23PM +0100, Maciej S. Szmigiero wrote:
>> On 4.02.2025 16:15, Daniel P. Berrangé wrote:
>>> On Thu, Jan 30, 2025 at 11:08:28AM +0100, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Currently, hitting EOF on receive without sender terminating the TLS
>>>> session properly causes the TLS channel to return an error (unless
>>>> the channel was already shut down for read).
>>>>
>>>> Add an optional setting whether we instead just return EOF in that
>>>> case.
>>>>
>>>> This possibility will be soon used by the migration multifd code.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>    include/io/channel-tls.h | 11 +++++++++++
>>>>    io/channel-tls.c         |  6 ++++++
>>>>    2 files changed, 17 insertions(+)
>>>>
>>>> diff --git a/include/io/channel-tls.h b/include/io/channel-tls.h
>>>> index 26c67f17e2d3..8552c0d0266e 100644
>>>> --- a/include/io/channel-tls.h
>>>> +++ b/include/io/channel-tls.h
>>>> @@ -49,6 +49,7 @@ struct QIOChannelTLS {
>>>>        QCryptoTLSSession *session;
>>>>        QIOChannelShutdown shutdown;
>>>>        guint hs_ioc_tag;
>>>> +    bool premature_eof_okay;
>>>>    };
>>>>    /**
>>>> @@ -143,4 +144,14 @@ void qio_channel_tls_handshake(QIOChannelTLS *ioc,
>>>>    QCryptoTLSSession *
>>>>    qio_channel_tls_get_session(QIOChannelTLS *ioc);
>>>> +/**
>>>> + * qio_channel_tls_set_premature_eof_okay:
>>>> + * @ioc: the TLS channel object
>>>> + *
>>>> + * Sets whether receiving an EOF without terminating the TLS session properly
>>>> + * by used the other side is considered okay or an error (the
>>>> + * default behaviour).
>>>> + */
>>>> +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled);
>>>> +
>>>>    #endif /* QIO_CHANNEL_TLS_H */
>>>> diff --git a/io/channel-tls.c b/io/channel-tls.c
>>>> index aab630e5ae32..1079d6d10de1 100644
>>>> --- a/io/channel-tls.c
>>>> +++ b/io/channel-tls.c
>>>> @@ -147,6 +147,11 @@ qio_channel_tls_new_client(QIOChannel *master,
>>>>        return NULL;
>>>>    }
>>>> +void qio_channel_tls_set_premature_eof_okay(QIOChannelTLS *ioc, bool enabled)
>>>> +{
>>>> +    ioc->premature_eof_okay = enabled;
>>>> +}
>>>> +
>>>>    struct QIOChannelTLSData {
>>>>        QIOTask *task;
>>>>        GMainContext *context;
>>>> @@ -279,6 +284,7 @@ static ssize_t qio_channel_tls_readv(QIOChannel *ioc,
>>>>                tioc->session,
>>>>                iov[i].iov_base,
>>>>                iov[i].iov_len,
>>>> +            tioc->premature_eof_okay ||
>>>>                qatomic_load_acquire(&tioc->shutdown) & QIO_CHANNEL_SHUTDOWN_READ,
>>>>                errp);
>>>>            if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) {
>>>
>>> IMHO a better way to do this is by defining an new flag for use with
>>> the qio_channel_readv_full() method. That makes the ignoring of
>>> premature shutdown a contextually scoped behaviour rather than a
>>> global behaviour.
>>
>> Something named like QIO_CHANNEL_READ_FLAG_TLS_EARLY_EOF_OKAY?
> 
> Since the flags are defined at the non-TLS layer in the API, I would
> pick  "QIO_CHANNEL_READ_RELAXED_EOF", as it could conceptually make
> sense to other layered channel protocols beyond TLS, even if we only
> ever implement it for TLS.

This will need extending at least qio_channel_read_all_eof(),
qio_channel_readv_all_eof() and qio_channel_readv_full_all_eof() with
"flags" parameter (and patching their callers accordingly) since they
currently don't take such parameter.

That's for the multifd channel recv thread main loop only, if @Peter
wants to patch also the mid-stream page receive methods and the main
migration channel receive then qio_channel_read(), qio_channel_read_all(),
qio_channel_readv_all() and qio_channel_readv_full_all() would need
such treatment too.
Not sure whether this makes sense since we should never get premature
TLS session termination mid-stream (and if we do get on that would be
a genuine error AFAIK).

> With regards,
> Daniel

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 16:02       ` Peter Xu
  2025-02-04 16:12         ` Daniel P. Berrangé
@ 2025-02-04 18:25         ` Fabiano Rosas
  2025-02-04 19:34           ` Maciej S. Szmigiero
  1 sibling, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-04 18:25 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Tue, Feb 04, 2025 at 03:08:02PM +0000, Daniel P. Berrangé wrote:
>> On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
>> > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> > > 
>> > > Multifd send channels are terminated by calling
>> > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> > > multifd_send_terminate_threads(), which in the TLS case essentially
>> > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> > > 
>> > > Unfortunately, this does not terminate the TLS session properly and
>> > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>> > > 
>> > > The only reason why this wasn't causing migration failures is because
>> > > the current migration code apparently does not check for migration
>> > > error being set after the end of the multifd receive process.
>> > > 
>> > > However, this will change soon so the multifd receive code has to be
>> > > prepared to not return an error on such premature TLS session EOF.
>> > > Use the newly introduced QIOChannelTLS method for that.
>> > > 
>> > > It's worth noting that even if the sender were to be changed to terminate
>> > > the TLS connection properly the receive side still needs to remain
>> > > compatible with older QEMU bit stream which does not do this.
>> > 
>> > If this is an existing bug, we could add a Fixes.
>> > 
>> > Two pure questions..
>> > 
>> >   - What is the correct way to terminate the TLS session without this flag?
>> > 
>> >   - Why this is only needed by multifd sessions?
>> 
>> Graceful TLS termination (via gnutls_bye()) should only be important to
>> security if the QEMU protocol in question does not know how much data it
>> is expecting to recieve. ie it cannot otherwise distinguish between an
>> expected EOF, and a premature EOF triggered by an attacker.
>> 
>> If the migration protocol has sufficient info to know when a chanel is
>> expected to see EOF, then we should stop trying to read from the TLS
>> channel before seeing the underlying EOF.
>> 
>> Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
>> migration will still fail corretly in the case of a malicious attack
>> causing premature termination.
>> 
>> If there's a risk that migration may succeed, but with incomplete data,
>> then we would need the full gnutls_bye dance.
>
> IIUC that's not required for migration then, because migration should know
> exactly how much data to receive, and migration should need to verify that
> and fail if the received data didn't match the expectation along the way.
> We also have QEMU_VM_EOF as the end mark of stream.

The migration overall can detect whether EOF should have been reached,
but multifd threads cannot. If one multifd channel experiences an issue
and sees a premature termination, but ignores it, then that's a hang in
QEMU because nothing provided the syncs needed (p->sem_sync, most
likely).

Aren't we just postponing a bug?

>
> Said that, are we sure any pre-mature termination will only happen after
> all data read in the receive buffer that was sent?
>
> To ask in another way: what happens if the source QEMU sends everything and
> shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
> data to read, and (2) a pre-mature terminatino of TLS session in a read()
> syscall.  Would (2) be reported even before (1), or the order guaranteed
> that read of the residue data in (1) always happen before (2) (considering
> dest QEMU can be slow sometime on consuming the network buffers)?
>
> Thanks,


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-02-04 17:54   ` Peter Xu
@ 2025-02-04 19:32     ` Maciej S. Szmigiero
  2025-02-04 20:34       ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 19:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.02.2025 18:54, Peter Xu wrote:
> On Thu, Jan 30, 2025 at 11:08:40AM +0100, Maciej S. Szmigiero wrote:
>> +static int multifd_device_state_save_thread(void *opaque)
>> +{
>> +    struct MultiFDDSSaveThreadData *data = opaque;
>> +    int ret;
>> +
>> +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
>> +                     data->handler_opaque);
> 
> I thought we discussed somewhere and the plan was we could use Error** here
> to report errors.  Would that still make sense, or maybe I lost some
> context?

That was about *load* threads, here these are *save* threads.

Save handlers do not return an Error value, neither save_live_iterate, nor
save_live_complete_precopy or save_state does so.

> Meanwhile, I still feel uneasy on having these globals (send_threads_abort,
> send_threads_ret).  Can we make MultiFDDSSaveThreadData the only interface
> between migration and the threads impl?  So I wonder if it can be:
> 
>    ret = data->hdlr(data);
> 
> With extended struct like this (I added thread_error and thread_quit):
> 
> struct MultiFDDSSaveThreadData {
>      SaveLiveCompletePrecopyThreadHandler hdlr;
>      char *idstr;
>      uint32_t instance_id;
>      void *handler_opaque;
>      /*
>       * Should be NULL when struct passed over to thread, the thread should
>       * set this if the handler would return false.  It must be kept NULL if
>       * the handler returned true / success.
>       */
>      Error *thread_error;

As I mentioned above, these handlers do not generally return Error type,
so this would need to be an *int;

>      /*
>       * Migration core would set this when it wants to notify thread to
>       * quit, for example, when error occured in other threads, or migration is
>       * cancelled by the user.
>       */
>      bool thread_quit;

             ^ I guess that was supposed to be a pointer too (*thread_quit).

> };
> 
> Then if any multifd_device_state_save_thread() failed, for example, it
> should notify all threads to quit by setting thread_quit, instead of
> relying on yet another global variable to show migration needs to quit.

multifd_abort_device_state_save_threads() needs to access
send_threads_abort too.

And multifd_join_device_state_save_threads() needs to access
send_threads_ret.

These variables ultimately will have to be stored somewhere since
there can be multiple save threads and so multiple instances of
MultiFDDSSaveThreadData.

So these need to be stored somewhere where
multifd_spawn_device_state_save_thread() can reach them to assign
their addresses to MultiFDDSSaveThreadData members.

However, at that point multifd_device_state_save_thread() can
access them too so it does not need to have them passed via
MultiFDDSSaveThreadData.

However, nothing prevents putting send_threads* variables
into a global struct (with internal linkage - "static", just as
these separate ones are) if you like such construct more.

> Thanks,

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls
  2025-02-04 15:39                       ` Peter Xu
@ 2025-02-04 19:32                         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 19:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 4.02.2025 16:39, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 03:57:37PM +0100, Maciej S. Szmigiero wrote:
>> The vfio_migration_cleanup() used to just close a migration FD, while
>> RAM might end up calling qemu_ram_msync(), which sounds like something
>> that should be called under BQL.
>>
>> But I am not sure whether that lack of BQL around qemu_ram_msync()
>> actually causes problems.
> 
> I believe msync() is thread-safe.  So that doesn't need BQL, AFAICT.
> 
> Personally I actually prefer not having the BQL requirement if ever
> possible in any vmstate hooks.
> 
> I think the only challenge here is if VFIO will start to need BQL for some
> specific code path that you added in this series, it means VFIO needs to
> detect bql_locked() to make sure it won't deadlock.. and only take BQL if
> it's not taken.
> 
>  From that POV, it might be easier for you to define that hook as "always do
> cleanup() with BQL" globally, just to avoid one bql_locked() usage in vfio
> specific hook.  We pay that with slow RAM sync in corner cases like pmem
> that could potentially block VM from making progress (e.g. vcpu
> concurrently accessing MMIO regions).

Not only the VFIO load_cleanup hook is BQL-sensitive in this patch set but
also the load threads pool cleanup handler.

And migration_incoming_state_destroy() was already called with BQL everywhere
but in the postcopy thread so I think that any BQL-related performance issue
would already be uncovered by its other callers.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 18:25         ` Fabiano Rosas
@ 2025-02-04 19:34           ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-04 19:34 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Daniel P. Berrangé,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 4.02.2025 19:25, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Feb 04, 2025 at 03:08:02PM +0000, Daniel P. Berrangé wrote:
>>> On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> Multifd send channels are terminated by calling
>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>
>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>
>>>>> The only reason why this wasn't causing migration failures is because
>>>>> the current migration code apparently does not check for migration
>>>>> error being set after the end of the multifd receive process.
>>>>>
>>>>> However, this will change soon so the multifd receive code has to be
>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>
>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>> the TLS connection properly the receive side still needs to remain
>>>>> compatible with older QEMU bit stream which does not do this.
>>>>
>>>> If this is an existing bug, we could add a Fixes.
>>>>
>>>> Two pure questions..
>>>>
>>>>    - What is the correct way to terminate the TLS session without this flag?
>>>>
>>>>    - Why this is only needed by multifd sessions?
>>>
>>> Graceful TLS termination (via gnutls_bye()) should only be important to
>>> security if the QEMU protocol in question does not know how much data it
>>> is expecting to recieve. ie it cannot otherwise distinguish between an
>>> expected EOF, and a premature EOF triggered by an attacker.
>>>
>>> If the migration protocol has sufficient info to know when a chanel is
>>> expected to see EOF, then we should stop trying to read from the TLS
>>> channel before seeing the underlying EOF.
>>>
>>> Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
>>> migration will still fail corretly in the case of a malicious attack
>>> causing premature termination.
>>>
>>> If there's a risk that migration may succeed, but with incomplete data,
>>> then we would need the full gnutls_bye dance.
>>
>> IIUC that's not required for migration then, because migration should know
>> exactly how much data to receive, and migration should need to verify that
>> and fail if the received data didn't match the expectation along the way.
>> We also have QEMU_VM_EOF as the end mark of stream.
> 
> The migration overall can detect whether EOF should have been reached,
> but multifd threads cannot. If one multifd channel experiences an issue
> and sees a premature termination, but ignores it, then that's a hang in
> QEMU because nothing provided the syncs needed (p->sem_sync, most
> likely).

I think allowing premature TLS termination simply makes the TLS case
function the same as non-TLS case here if the source were to close the
multifd channel(s) early.

> Aren't we just postponing a bug?
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-02-04 19:32     ` Maciej S. Szmigiero
@ 2025-02-04 20:34       ` Peter Xu
  2025-02-05 11:53         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-04 20:34 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 08:32:15PM +0100, Maciej S. Szmigiero wrote:
> On 4.02.2025 18:54, Peter Xu wrote:
> > On Thu, Jan 30, 2025 at 11:08:40AM +0100, Maciej S. Szmigiero wrote:
> > > +static int multifd_device_state_save_thread(void *opaque)
> > > +{
> > > +    struct MultiFDDSSaveThreadData *data = opaque;
> > > +    int ret;
> > > +
> > > +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
> > > +                     data->handler_opaque);
> > 
> > I thought we discussed somewhere and the plan was we could use Error** here
> > to report errors.  Would that still make sense, or maybe I lost some
> > context?
> 
> That was about *load* threads, here these are *save* threads.

Ah OK.

> 
> Save handlers do not return an Error value, neither save_live_iterate, nor
> save_live_complete_precopy or save_state does so.

Let's try to make new APIs work with Error* if possible.

> 
> > Meanwhile, I still feel uneasy on having these globals (send_threads_abort,
> > send_threads_ret).  Can we make MultiFDDSSaveThreadData the only interface
> > between migration and the threads impl?  So I wonder if it can be:
> > 
> >    ret = data->hdlr(data);
> > 
> > With extended struct like this (I added thread_error and thread_quit):
> > 
> > struct MultiFDDSSaveThreadData {
> >      SaveLiveCompletePrecopyThreadHandler hdlr;
> >      char *idstr;
> >      uint32_t instance_id;
> >      void *handler_opaque;
> >      /*
> >       * Should be NULL when struct passed over to thread, the thread should
> >       * set this if the handler would return false.  It must be kept NULL if
> >       * the handler returned true / success.
> >       */
> >      Error *thread_error;
> 
> As I mentioned above, these handlers do not generally return Error type,
> so this would need to be an *int;
> 
> >      /*
> >       * Migration core would set this when it wants to notify thread to
> >       * quit, for example, when error occured in other threads, or migration is
> >       * cancelled by the user.
> >       */
> >      bool thread_quit;
> 
>             ^ I guess that was supposed to be a pointer too (*thread_quit).

It's my intention to make this bool, to make everything managed per-thread.

It's actually what we do with multifd, these are a bunch of extra threads
to differeciate from the "IO threads" / "multifd threads".

> 
> > };
> > 
> > Then if any multifd_device_state_save_thread() failed, for example, it
> > should notify all threads to quit by setting thread_quit, instead of
> > relying on yet another global variable to show migration needs to quit.
> 
> multifd_abort_device_state_save_threads() needs to access
> send_threads_abort too.

This may need to become something like:

  QLIST_FOREACH() {
      MultiFDDSSaveThreadData *data = ...;
      data->thread_quit = true;
  }

We may want to double check qmp 'migrate_cancel' will work when save
threads are running, but this can also be done for later.

> 
> And multifd_join_device_state_save_threads() needs to access
> send_threads_ret.

Then this one becomes:

  thread_pool_wait(send_threads);
  QLIST_FOREACH() {
      MultiFDDSSaveThreadData *data = ...;
      if (data->thread_error) {
         return false;
      }
  }
  return true;

> 
> These variables ultimately will have to be stored somewhere since
> there can be multiple save threads and so multiple instances of
> MultiFDDSSaveThreadData.
> 
> So these need to be stored somewhere where
> multifd_spawn_device_state_save_thread() can reach them to assign
> their addresses to MultiFDDSSaveThreadData members.

Then multifd_spawn_device_state_save_thread() will need to manage the
qlist, making sure migration core remembers what jobs it submitted.  It
sounds good to have that bookkeeping when I think about it, instead of
throw the job to the thread pool and forget it..

> 
> However, at that point multifd_device_state_save_thread() can
> access them too so it does not need to have them passed via
> MultiFDDSSaveThreadData.
> 
> However, nothing prevents putting send_threads* variables
> into a global struct (with internal linkage - "static", just as
> these separate ones are) if you like such construct more.

This should be better than the current global vars indeed, but less
favoured if the per-thread way could work above.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-02-04 20:34       ` Peter Xu
@ 2025-02-05 11:53         ` Maciej S. Szmigiero
  2025-02-05 15:55           ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-05 11:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.02.2025 21:34, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 08:32:15PM +0100, Maciej S. Szmigiero wrote:
>> On 4.02.2025 18:54, Peter Xu wrote:
>>> On Thu, Jan 30, 2025 at 11:08:40AM +0100, Maciej S. Szmigiero wrote:
>>>> +static int multifd_device_state_save_thread(void *opaque)
>>>> +{
>>>> +    struct MultiFDDSSaveThreadData *data = opaque;
>>>> +    int ret;
>>>> +
>>>> +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
>>>> +                     data->handler_opaque);
>>>
>>> I thought we discussed somewhere and the plan was we could use Error** here
>>> to report errors.  Would that still make sense, or maybe I lost some
>>> context?
>>
>> That was about *load* threads, here these are *save* threads.
> 
> Ah OK.
> 
>>
>> Save handlers do not return an Error value, neither save_live_iterate, nor
>> save_live_complete_precopy or save_state does so.
> 
> Let's try to make new APIs work with Error* if possible.

Let's assume that these threads return an Error object.

What's qemu_savevm_state_complete_precopy_iterable() supposed to do with it?
Both it and its caller qemu_savevm_state_complete_precopy() only handle int
errors.

qemu_savevm_state_complete_precopy() in turn has 4 callers, half of which (2)
also would need to be enlightened with Error handling somehow.

> 
>>
>>> Meanwhile, I still feel uneasy on having these globals (send_threads_abort,
>>> send_threads_ret).  Can we make MultiFDDSSaveThreadData the only interface
>>> between migration and the threads impl?  So I wonder if it can be:
>>>
>>>     ret = data->hdlr(data);
>>>
>>> With extended struct like this (I added thread_error and thread_quit):
>>>
>>> struct MultiFDDSSaveThreadData {
>>>       SaveLiveCompletePrecopyThreadHandler hdlr;
>>>       char *idstr;
>>>       uint32_t instance_id;
>>>       void *handler_opaque;
>>>       /*
>>>        * Should be NULL when struct passed over to thread, the thread should
>>>        * set this if the handler would return false.  It must be kept NULL if
>>>        * the handler returned true / success.
>>>        */
>>>       Error *thread_error;
>>
>> As I mentioned above, these handlers do not generally return Error type,
>> so this would need to be an *int;
>>
>>>       /*
>>>        * Migration core would set this when it wants to notify thread to
>>>        * quit, for example, when error occured in other threads, or migration is
>>>        * cancelled by the user.
>>>        */
>>>       bool thread_quit;
>>
>>              ^ I guess that was supposed to be a pointer too (*thread_quit).
> 
> It's my intention to make this bool, to make everything managed per-thread.

But that's unnecessary since this flag is common to all these threads.

> It's actually what we do with multifd, these are a bunch of extra threads
> to differeciate from the "IO threads" / "multifd threads".
> 
>>
>>> };
>>>
>>> Then if any multifd_device_state_save_thread() failed, for example, it
>>> should notify all threads to quit by setting thread_quit, instead of
>>> relying on yet another global variable to show migration needs to quit.
>>
>> multifd_abort_device_state_save_threads() needs to access
>> send_threads_abort too.
> 
> This may need to become something like:
> 
>    QLIST_FOREACH() {
>        MultiFDDSSaveThreadData *data = ...;
>        data->thread_quit = true;
>    }

At the most basic level that's turning O(1) operation into O(n).

Besides, it creates a question now who now owns these MultiFDDSSaveThreadData
structures - they could be owned by either thread pool or the
multifd_device_state code.

Currently the ownership is simple - the multifd_device_state code
allocates such per-thread structure in multifd_spawn_device_state_save_thread()
and immediately passes its ownership to the thread pool which
takes care to free it once it no longer needs it.

Now, with the list implementation if the thread pool were to free
that MultiFDDSSaveThreadData it would also need to release it from
the list.

Which in turn would need appropriate locking around this removal
operation and probably also each time the list is iterated over.

On the other hand if the multifd_device_state code were to own
that MultiFDDSSaveThreadData then it would linger around until
multifd_device_state_send_cleanup() cleans it up even though its
associated thread might be long gone.

> We may want to double check qmp 'migrate_cancel' will work when save
> threads are running, but this can also be done for later.

>>
>> And multifd_join_device_state_save_threads() needs to access
>> send_threads_ret.
> 
> Then this one becomes:
> 
>    thread_pool_wait(send_threads);
>    QLIST_FOREACH() {
>        MultiFDDSSaveThreadData *data = ...;
>        if (data->thread_error) {
>           return false;
>        }
>    }
>    return true;

Same here, having a common error return would save us from having
to iterate over a list (or having a list in the first place).

>>
>> These variables ultimately will have to be stored somewhere since
>> there can be multiple save threads and so multiple instances of
>> MultiFDDSSaveThreadData.
>>
>> So these need to be stored somewhere where
>> multifd_spawn_device_state_save_thread() can reach them to assign
>> their addresses to MultiFDDSSaveThreadData members.
> 
> Then multifd_spawn_device_state_save_thread() will need to manage the
> qlist, making sure migration core remembers what jobs it submitted.  It
> sounds good to have that bookkeeping when I think about it, instead of
> throw the job to the thread pool and forget it..

It's not "forgetting" about the job but rather letting thread pool
manage it - I think thread pool was introduced so these details
(thread management) are abstracted from the migration code.
Now they would be effectively duplicated in the migration code.

>>
>> However, at that point multifd_device_state_save_thread() can
>> access them too so it does not need to have them passed via
>> MultiFDDSSaveThreadData.
>>
>> However, nothing prevents putting send_threads* variables
>> into a global struct (with internal linkage - "static", just as
>> these separate ones are) if you like such construct more.
> 
> This should be better than the current global vars indeed, but less
> favoured if the per-thread way could work above.

You still need that list to be a global variable,
so it's the same amount of global variables as just putting
the existing variables in a struct (which could be even allocated
in multifd_device_state_send_setup() and deallocated in
multifd_device_state_send_cleanup() for extra memory savings).

These variables are having internal linkage limited to (relatively
small) multifd-device-state.c, so it's not like they are polluting
namespace in some major migration translation unit.

Taking into consideration having to manage an extra data structure
(list), needing more code to do so, having worse algorithms I don't
really see a point of using that list.

(This is orthogonal to whether the thread return type is changed to
Error which could be easily done on the existing save threads pool
implementation).

> Thanks,
> 

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-02-05 11:53         ` Maciej S. Szmigiero
@ 2025-02-05 15:55           ` Peter Xu
  2025-02-06 11:41             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-05 15:55 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Feb 05, 2025 at 12:53:21PM +0100, Maciej S. Szmigiero wrote:
> On 4.02.2025 21:34, Peter Xu wrote:
> > On Tue, Feb 04, 2025 at 08:32:15PM +0100, Maciej S. Szmigiero wrote:
> > > On 4.02.2025 18:54, Peter Xu wrote:
> > > > On Thu, Jan 30, 2025 at 11:08:40AM +0100, Maciej S. Szmigiero wrote:
> > > > > +static int multifd_device_state_save_thread(void *opaque)
> > > > > +{
> > > > > +    struct MultiFDDSSaveThreadData *data = opaque;
> > > > > +    int ret;
> > > > > +
> > > > > +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
> > > > > +                     data->handler_opaque);
> > > > 
> > > > I thought we discussed somewhere and the plan was we could use Error** here
> > > > to report errors.  Would that still make sense, or maybe I lost some
> > > > context?
> > > 
> > > That was about *load* threads, here these are *save* threads.
> > 
> > Ah OK.
> > 
> > > 
> > > Save handlers do not return an Error value, neither save_live_iterate, nor
> > > save_live_complete_precopy or save_state does so.
> > 
> > Let's try to make new APIs work with Error* if possible.
> 
> Let's assume that these threads return an Error object.
> 
> What's qemu_savevm_state_complete_precopy_iterable() supposed to do with it?

IIUC it's not about qemu_savevm_state_complete_precopy_iterable() in this
context, as the Error* will be used in one of the thread of the pool, not
migration thread.

The goal is to be able to set Error* with migrate_set_error(), so that when
migration failed, query-migrate can return the error to libvirt, so
migration always tries to remember the 1st error hit if ever possible.

It's multifd_device_state_save_thread() to do migrate_set_error(), not in
migration thread.  qemu_savevm_state_complete_*() are indeed not ready to
pass Errors, but it's not in the discussed stack.

> Both it and its caller qemu_savevm_state_complete_precopy() only handle int
> errors.
> 
> qemu_savevm_state_complete_precopy() in turn has 4 callers, half of which (2)
> also would need to be enlightened with Error handling somehow.

Right, we don't need to touch those, as explained above.

Generally speaking, IMHO it's always good to add new code with Error*
reports, rather than retvals in migration, even if the new code is in the
migration thread stack.  It made future changes easier to switch to Error*.

> 
> > 
> > > 
> > > > Meanwhile, I still feel uneasy on having these globals (send_threads_abort,
> > > > send_threads_ret).  Can we make MultiFDDSSaveThreadData the only interface
> > > > between migration and the threads impl?  So I wonder if it can be:
> > > > 
> > > >     ret = data->hdlr(data);
> > > > 
> > > > With extended struct like this (I added thread_error and thread_quit):
> > > > 
> > > > struct MultiFDDSSaveThreadData {
> > > >       SaveLiveCompletePrecopyThreadHandler hdlr;
> > > >       char *idstr;
> > > >       uint32_t instance_id;
> > > >       void *handler_opaque;
> > > >       /*
> > > >        * Should be NULL when struct passed over to thread, the thread should
> > > >        * set this if the handler would return false.  It must be kept NULL if
> > > >        * the handler returned true / success.
> > > >        */
> > > >       Error *thread_error;
> > > 
> > > As I mentioned above, these handlers do not generally return Error type,
> > > so this would need to be an *int;
> > > 
> > > >       /*
> > > >        * Migration core would set this when it wants to notify thread to
> > > >        * quit, for example, when error occured in other threads, or migration is
> > > >        * cancelled by the user.
> > > >        */
> > > >       bool thread_quit;
> > > 
> > >              ^ I guess that was supposed to be a pointer too (*thread_quit).
> > 
> > It's my intention to make this bool, to make everything managed per-thread.
> 
> But that's unnecessary since this flag is common to all these threads.

One bool would be enough, but you'll need to export another API for VFIO to
use otherwise.  I suppose that's ok too.

Some context of multifd threads and how that's done there..

We started with one "quit" per thread struct, but then we switched to one
bool exactly as you said, see commit 15f3f21d598148.

If you want to stick with one bool, it's okay too, you can export something
similar in misc.h, e.g. multifd_device_state_save_thread_quitting(), then
we can avoid passing in the "quit" either as handler parameter, or
per-thread flag.

> 
> > It's actually what we do with multifd, these are a bunch of extra threads
> > to differeciate from the "IO threads" / "multifd threads".
> > 
> > > 
> > > > };
> > > > 
> > > > Then if any multifd_device_state_save_thread() failed, for example, it
> > > > should notify all threads to quit by setting thread_quit, instead of
> > > > relying on yet another global variable to show migration needs to quit.
> > > 
> > > multifd_abort_device_state_save_threads() needs to access
> > > send_threads_abort too.
> > 
> > This may need to become something like:
> > 
> >    QLIST_FOREACH() {
> >        MultiFDDSSaveThreadData *data = ...;
> >        data->thread_quit = true;
> >    }
> 
> At the most basic level that's turning O(1) operation into O(n).
> 
> Besides, it creates a question now who now owns these MultiFDDSSaveThreadData
> structures - they could be owned by either thread pool or the
> multifd_device_state code.

I think it should be owned by migration, and with this idea it will need to
be there until waiting thread pool completing their works, so migration
core needs to free them.

> 
> Currently the ownership is simple - the multifd_device_state code
> allocates such per-thread structure in multifd_spawn_device_state_save_thread()
> and immediately passes its ownership to the thread pool which
> takes care to free it once it no longer needs it.

Right, this is another reason why I think having migration owing these
structs is better.  We used to have task dangling issues when we shift
ownership of something to mainloop then we lose track of them (e.g. on TLS
handshake gsources).  Those are pretty hard to debug when hanged, because
migration core has nothing to link to the hanged tasks again anymore.

I think we should start from having migration core being able to reach
these thread-based tasks when needed.  Migration also have control of the
thread pool, then it would be easier.  Thread pool is so far simple so we
may still need to be able to reference to per-task info separately.

> 
> Now, with the list implementation if the thread pool were to free
> that MultiFDDSSaveThreadData it would also need to release it from
> the list.
> 
> Which in turn would need appropriate locking around this removal
> operation and probably also each time the list is iterated over.
> 
> On the other hand if the multifd_device_state code were to own
> that MultiFDDSSaveThreadData then it would linger around until
> multifd_device_state_send_cleanup() cleans it up even though its
> associated thread might be long gone.

Do you see a problem with it?  It sounds good to me actually.. and pretty
easy to understand.

So migration creates these MultiFDDSSaveThreadData, then create threads to
enqueue then, then wait for all threads to complete, then free these
structs.

> 
> > We may want to double check qmp 'migrate_cancel' will work when save
> > threads are running, but this can also be done for later.
> 
> > > 
> > > And multifd_join_device_state_save_threads() needs to access
> > > send_threads_ret.
> > 
> > Then this one becomes:
> > 
> >    thread_pool_wait(send_threads);
> >    QLIST_FOREACH() {
> >        MultiFDDSSaveThreadData *data = ...;
> >        if (data->thread_error) {
> >           return false;
> >        }
> >    }
> >    return true;
> 
> Same here, having a common error return would save us from having
> to iterate over a list (or having a list in the first place).

IMHO perf isn't an issue here. It's slow path, threads num is small, loop
is cheap.  I prefer prioritize cleaness in this case.

Otherwise any suggestion we could report an Error* in the threads?

> 
> > > 
> > > These variables ultimately will have to be stored somewhere since
> > > there can be multiple save threads and so multiple instances of
> > > MultiFDDSSaveThreadData.
> > > 
> > > So these need to be stored somewhere where
> > > multifd_spawn_device_state_save_thread() can reach them to assign
> > > their addresses to MultiFDDSSaveThreadData members.
> > 
> > Then multifd_spawn_device_state_save_thread() will need to manage the
> > qlist, making sure migration core remembers what jobs it submitted.  It
> > sounds good to have that bookkeeping when I think about it, instead of
> > throw the job to the thread pool and forget it..
> 
> It's not "forgetting" about the job but rather letting thread pool
> manage it - I think thread pool was introduced so these details
> (thread management) are abstracted from the migration code.
> Now they would be effectively duplicated in the migration code.

Migration is still managing those as long as you have send_threads_abort,
isn't it?  The thread pool doesn't yet have an API to say "let's quit all
the tasks", otherwise I'm OK too to use the pool API instead of having
thread_quit.

> 
> > > 
> > > However, at that point multifd_device_state_save_thread() can
> > > access them too so it does not need to have them passed via
> > > MultiFDDSSaveThreadData.
> > > 
> > > However, nothing prevents putting send_threads* variables
> > > into a global struct (with internal linkage - "static", just as
> > > these separate ones are) if you like such construct more.
> > 
> > This should be better than the current global vars indeed, but less
> > favoured if the per-thread way could work above.
> 
> You still need that list to be a global variable,
> so it's the same amount of global variables as just putting
> the existing variables in a struct (which could be even allocated
> in multifd_device_state_send_setup() and deallocated in
> multifd_device_state_send_cleanup() for extra memory savings).

Yes this works for me.

I think you got me wrong on "not allowing to introduce global variables".
I'm OK with it, but please still consider..

  - Put it under some existing global object rather than having separate
    global variables all over the places..

  - Having Error reports

And I still think we can change:

typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
                                                    uint32_t instance_id,
                                                    bool *abort_flag,
                                                    void *opaque);

To:

typedef int (*SaveLiveCompletePrecopyThreadHandler)(MultiFDDSSaveThreadData*);

No matter what.

> 
> These variables are having internal linkage limited to (relatively
> small) multifd-device-state.c, so it's not like they are polluting
> namespace in some major migration translation unit.

If someone proposes to introduce 100 global vars in multifd-device-state.c,
I'll strongly stop that.

If it's one global var, I'm OK.

What if it's 5?

===8<===
static QemuMutex queue_job_mutex;

static ThreadPool *send_threads;
static int send_threads_ret;
static bool send_threads_abort;

static MultiFDSendData *device_state_send;
===8<===

I think I should start calling a stop.  That's what happened..

Please consider introducing something like multifd_send_device_state so we
can avoid anyone in the future randomly add static global vars.

> 
> Taking into consideration having to manage an extra data structure
> (list), needing more code to do so, having worse algorithms I don't
> really see a point of using that list.
> 
> (This is orthogonal to whether the thread return type is changed to
> Error which could be easily done on the existing save threads pool
> implementation).

My bet is changing to list is as easy (10-20 LOC?).  If not, I can try to
provide the diff on top of your patch.

I'm also not strictly asking for a list, but anything that makes the API
cleaner (less globals, better error reports, etc.).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-04 15:39                 ` Daniel P. Berrangé
@ 2025-02-05 19:09                   ` Fabiano Rosas
  2025-02-05 20:42                     ` Fabiano Rosas
  0 siblings, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-05 19:09 UTC (permalink / raw)
  To: Daniel P. Berrangé, Peter Xu
  Cc: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>> > On 3.02.2025 23:56, Peter Xu wrote:
>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> > > > On 3.02.2025 21:20, Peter Xu wrote:
>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>> > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> > > > > > > > 
>> > > > > > > > Multifd send channels are terminated by calling
>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case essentially
>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> > > > > > > > 
>> > > > > > > > Unfortunately, this does not terminate the TLS session properly and
>> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>> > > > > > > > 
>> > > > > > > > The only reason why this wasn't causing migration failures is because
>> > > > > > > > the current migration code apparently does not check for migration
>> > > > > > > > error being set after the end of the multifd receive process.
>> > > > > > > > 
>> > > > > > > > However, this will change soon so the multifd receive code has to be
>> > > > > > > > prepared to not return an error on such premature TLS session EOF.
>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
>> > > > > > > > 
>> > > > > > > > It's worth noting that even if the sender were to be changed to terminate
>> > > > > > > > the TLS connection properly the receive side still needs to remain
>> > > > > > > > compatible with older QEMU bit stream which does not do this.
>> > > > > > > 
>> > > > > > > If this is an existing bug, we could add a Fixes.
>> > > > > > 
>> > > > > > It is an existing issue but only uncovered by this patch set.
>> > > > > > 
>> > > > > > As far as I can see it was always there, so it would need some
>> > > > > > thought where to point that Fixes tag.
>> > > > > 
>> > > > > If there's no way to trigger a real functional bug anyway, it's also ok we
>> > > > > omit the Fixes.
>> > > > > 
>> > > > > > > Two pure questions..
>> > > > > > > 
>> > > > > > >      - What is the correct way to terminate the TLS session without this flag?
>> > > > > > 
>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>> > > > > > 
>> > > > > > >      - Why this is only needed by multifd sessions?
>> > > > > > 
>> > > > > > What uncovered the issue was switching the load threads to using
>> > > > > > migrate_set_error() instead of their own result variable
>> > > > > > (load_threads_ret) which you had requested during the previous
>> > > > > > patch set version review:
>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>> > > > > > 
>> > > > > > Turns out that the multifd receive code always returned
>> > > > > > error in the TLS case, just nothing was previously checking for
>> > > > > > that error presence.
>> > > > > 
>> > > > > What I was curious is whether this issue also exists for the main migration
>> > > > > channel when with tls, especially when e.g. multifd not enabled at all.  As
>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>> > > > > 
>> > > > > I think it's a good to find that we overlooked this before.. and IMHO it's
>> > > > > always good we could fix this.
>> > > > > 
>> > > > > Does it mean we need proper gnutls_bye() somewhere?
>> > > > > 
>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be done
>> > > > > on the main channel as well.
>> > > > 
>> > > > That's a good question and looking at the code qemu_loadvm_state_main() exits
>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>> > > > in qemu_loadvm_state() - so still not until channel EOF.
>> > > 
>> > > I had a closer look, I do feel like such pre-mature termination is caused
>> > > by explicit shutdown()s of the iochannels, looks like that can cause issue
>> > > even after everything is sent.  Then I noticed indeed multifd sender
>> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>> > > don't do that for the main channel.  Maybe that is a major difference.
>> > > 
>> > > Now I wonder whether we should shutdown() the channel at all if migration
>> > > succeeded, because looks like it can cause tls session to interrupt even if
>> > > the shutdown() is done after sent everything, and if so it'll explain why
>> > > you hit the issue with tls.
>> > > 
>> > > > 
>> > > > Then I can't see anything else reading the channel until it is closed in
>> > > > migration_incoming_state_destroy().
>> > > > 
>> > > > So most likely the main migration channel will never read far enough to
>> > > > reach that GNUTLS_E_PREMATURE_TERMINATION error.
>> > > > 
>> > > > > If we don't need gnutls_bye(), then should we always ignore pre-mature
>> > > > > termination of tls no matter if it's multifd or non-multifd channel (or
>> > > > > even a tls session that is not migration-related)?
>> > > > 
>> > > > So basically have this patch extended to calling
>> > > > qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>> > > 
>> > > If above theory can stand, then eof-okay could be a workaround papering
>> > > over the real problem that we shouldn't always shutdown()..
>> > > 
>> > > Could you have a look at below patch and see whether it can fix the problem
>> > > you hit too, in replace of these two patches (including the previous
>> > > iochannel change)?
>> > > 
>> > 
>> > Unfortunately, the patch below does not fix the problem:
>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> > 
>> > I think that, even in the absence of shutdown(), if the sender does not
>> > call gnutls_bye() the TLS session is considered improperly terminated.
>> 
>> Ah..
>> 
>> How about one more change on top of above change to disconnect properly for
>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>> make sense to you?
>
> Calling gnutls_bye from qio_channel_tls_close is not viable for the
> API contract of qio_channel_close. gnutls_bye needs to be able to
> perform I/O, which means we need to be able to tell the caller
> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>
> This is the same API design scenario as the gnutls_handshake method.
> As such I tdon't think it is practical to abstract it inside any
> existing QIOChannel API call, it'll have to be standalone like
> qio_channel_tls_handshake() is.
>

I implemented the call to gnutls_bye:
https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye

Then while testing it I realised we actually have a regression from 9.2:

1d457daf86 ("migration/multifd: Further remove the SYNC on complete")

It seems that patch somehow affected the ordering between src shutdown
vs. recv shutdown and now the recv channels are staying around to see
the connection being broken. Or something... I'm still looking into it.

>
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-05 19:09                   ` Fabiano Rosas
@ 2025-02-05 20:42                     ` Fabiano Rosas
  2025-02-05 20:55                       ` Maciej S. Szmigiero
  2025-02-05 21:13                       ` Peter Xu
  0 siblings, 2 replies; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-05 20:42 UTC (permalink / raw)
  To: Daniel P. Berrangé, Peter Xu
  Cc: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Avihai Horon, Joao Martins,
	qemu-devel

Fabiano Rosas <farosas@suse.de> writes:

> Daniel P. Berrangé <berrange@redhat.com> writes:
>
>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>> > On 3.02.2025 23:56, Peter Xu wrote:
>>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>> > > > On 3.02.2025 21:20, Peter Xu wrote:
>>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
>>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>> > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>> > > > > > > > 
>>> > > > > > > > Multifd send channels are terminated by calling
>>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case essentially
>>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>> > > > > > > > 
>>> > > > > > > > Unfortunately, this does not terminate the TLS session properly and
>>> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>> > > > > > > > 
>>> > > > > > > > The only reason why this wasn't causing migration failures is because
>>> > > > > > > > the current migration code apparently does not check for migration
>>> > > > > > > > error being set after the end of the multifd receive process.
>>> > > > > > > > 
>>> > > > > > > > However, this will change soon so the multifd receive code has to be
>>> > > > > > > > prepared to not return an error on such premature TLS session EOF.
>>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
>>> > > > > > > > 
>>> > > > > > > > It's worth noting that even if the sender were to be changed to terminate
>>> > > > > > > > the TLS connection properly the receive side still needs to remain
>>> > > > > > > > compatible with older QEMU bit stream which does not do this.
>>> > > > > > > 
>>> > > > > > > If this is an existing bug, we could add a Fixes.
>>> > > > > > 
>>> > > > > > It is an existing issue but only uncovered by this patch set.
>>> > > > > > 
>>> > > > > > As far as I can see it was always there, so it would need some
>>> > > > > > thought where to point that Fixes tag.
>>> > > > > 
>>> > > > > If there's no way to trigger a real functional bug anyway, it's also ok we
>>> > > > > omit the Fixes.
>>> > > > > 
>>> > > > > > > Two pure questions..
>>> > > > > > > 
>>> > > > > > >      - What is the correct way to terminate the TLS session without this flag?
>>> > > > > > 
>>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>> > > > > > 
>>> > > > > > >      - Why this is only needed by multifd sessions?
>>> > > > > > 
>>> > > > > > What uncovered the issue was switching the load threads to using
>>> > > > > > migrate_set_error() instead of their own result variable
>>> > > > > > (load_threads_ret) which you had requested during the previous
>>> > > > > > patch set version review:
>>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>> > > > > > 
>>> > > > > > Turns out that the multifd receive code always returned
>>> > > > > > error in the TLS case, just nothing was previously checking for
>>> > > > > > that error presence.
>>> > > > > 
>>> > > > > What I was curious is whether this issue also exists for the main migration
>>> > > > > channel when with tls, especially when e.g. multifd not enabled at all.  As
>>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>> > > > > 
>>> > > > > I think it's a good to find that we overlooked this before.. and IMHO it's
>>> > > > > always good we could fix this.
>>> > > > > 
>>> > > > > Does it mean we need proper gnutls_bye() somewhere?
>>> > > > > 
>>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be done
>>> > > > > on the main channel as well.
>>> > > > 
>>> > > > That's a good question and looking at the code qemu_loadvm_state_main() exits
>>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>> > > > in qemu_loadvm_state() - so still not until channel EOF.
>>> > > 
>>> > > I had a closer look, I do feel like such pre-mature termination is caused
>>> > > by explicit shutdown()s of the iochannels, looks like that can cause issue
>>> > > even after everything is sent.  Then I noticed indeed multifd sender
>>> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>> > > don't do that for the main channel.  Maybe that is a major difference.
>>> > > 
>>> > > Now I wonder whether we should shutdown() the channel at all if migration
>>> > > succeeded, because looks like it can cause tls session to interrupt even if
>>> > > the shutdown() is done after sent everything, and if so it'll explain why
>>> > > you hit the issue with tls.
>>> > > 
>>> > > > 
>>> > > > Then I can't see anything else reading the channel until it is closed in
>>> > > > migration_incoming_state_destroy().
>>> > > > 
>>> > > > So most likely the main migration channel will never read far enough to
>>> > > > reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>> > > > 
>>> > > > > If we don't need gnutls_bye(), then should we always ignore pre-mature
>>> > > > > termination of tls no matter if it's multifd or non-multifd channel (or
>>> > > > > even a tls session that is not migration-related)?
>>> > > > 
>>> > > > So basically have this patch extended to calling
>>> > > > qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>> > > 
>>> > > If above theory can stand, then eof-okay could be a workaround papering
>>> > > over the real problem that we shouldn't always shutdown()..
>>> > > 
>>> > > Could you have a look at below patch and see whether it can fix the problem
>>> > > you hit too, in replace of these two patches (including the previous
>>> > > iochannel change)?
>>> > > 
>>> > 
>>> > Unfortunately, the patch below does not fix the problem:
>>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>> > 
>>> > I think that, even in the absence of shutdown(), if the sender does not
>>> > call gnutls_bye() the TLS session is considered improperly terminated.
>>> 
>>> Ah..
>>> 
>>> How about one more change on top of above change to disconnect properly for
>>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>>> make sense to you?
>>
>> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>> API contract of qio_channel_close. gnutls_bye needs to be able to
>> perform I/O, which means we need to be able to tell the caller
>> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>>
>> This is the same API design scenario as the gnutls_handshake method.
>> As such I tdon't think it is practical to abstract it inside any
>> existing QIOChannel API call, it'll have to be standalone like
>> qio_channel_tls_handshake() is.
>>
>
> I implemented the call to gnutls_bye:
> https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>
> Then while testing it I realised we actually have a regression from 9.2:
>
> 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>
> It seems that patch somehow affected the ordering between src shutdown
> vs. recv shutdown and now the recv channels are staying around to see
> the connection being broken. Or something... I'm still looking into it.
>

Ok, so the issue is that the recv side would previously be stuck at the
sync semaphore and multifd_recv_terminate_threads() would kick it only
after 'exiting' was set, so no further recv() would happen.

After the patch, there's no final sync anymore, so the recv thread loops
around and waits at the recv() until multifd_send_terminate_threads()
closes the connection.

Waiting on sem_sync as before would lead to a cleaner termination
process IMO, but I don't think it's worth the extra complexity of
introducing a sync to the device state migration.

So I think we'll have to go with one of the approaches suggested on this
thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
sure we add a reference to the patch above and some words explaining the
situation.

(let me know if anyone prefers the gnutls_bye approach I have implemented
and I can send a proper series)

>>
>> With regards,
>> Daniel


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-05 20:42                     ` Fabiano Rosas
@ 2025-02-05 20:55                       ` Maciej S. Szmigiero
  2025-02-06 14:13                         ` Fabiano Rosas
  2025-02-05 21:13                       ` Peter Xu
  1 sibling, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-05 20:55 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Daniel P. Berrangé, Peter Xu,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 5.02.2025 21:42, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>
>>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>>>> On 3.02.2025 23:56, Peter Xu wrote:
>>>>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>>>>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>>>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>
>>>>>>>>>>> Multifd send channels are terminated by calling
>>>>>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>
>>>>>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>>>>>> the current migration code apparently does not check for migration
>>>>>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>>>>>
>>>>>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>>>>>
>>>>>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>>>>>
>>>>>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>>>>>
>>>>>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>>>>>
>>>>>>>>> As far as I can see it was always there, so it would need some
>>>>>>>>> thought where to point that Fixes tag.
>>>>>>>>
>>>>>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>>>>>> omit the Fixes.
>>>>>>>>
>>>>>>>>>> Two pure questions..
>>>>>>>>>>
>>>>>>>>>>       - What is the correct way to terminate the TLS session without this flag?
>>>>>>>>>
>>>>>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>>>>>
>>>>>>>>>>       - Why this is only needed by multifd sessions?
>>>>>>>>>
>>>>>>>>> What uncovered the issue was switching the load threads to using
>>>>>>>>> migrate_set_error() instead of their own result variable
>>>>>>>>> (load_threads_ret) which you had requested during the previous
>>>>>>>>> patch set version review:
>>>>>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>>>>>
>>>>>>>>> Turns out that the multifd receive code always returned
>>>>>>>>> error in the TLS case, just nothing was previously checking for
>>>>>>>>> that error presence.
>>>>>>>>
>>>>>>>> What I was curious is whether this issue also exists for the main migration
>>>>>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>>>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>>>>>
>>>>>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>>>>>> always good we could fix this.
>>>>>>>>
>>>>>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>>>>>
>>>>>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>>>>>> on the main channel as well.
>>>>>>>
>>>>>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>>>>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>>>>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>>>>>> in qemu_loadvm_state() - so still not until channel EOF.
>>>>>>
>>>>>> I had a closer look, I do feel like such pre-mature termination is caused
>>>>>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>>>>>> even after everything is sent.  Then I noticed indeed multifd sender
>>>>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>>>>> don't do that for the main channel.  Maybe that is a major difference.
>>>>>>
>>>>>> Now I wonder whether we should shutdown() the channel at all if migration
>>>>>> succeeded, because looks like it can cause tls session to interrupt even if
>>>>>> the shutdown() is done after sent everything, and if so it'll explain why
>>>>>> you hit the issue with tls.
>>>>>>
>>>>>>>
>>>>>>> Then I can't see anything else reading the channel until it is closed in
>>>>>>> migration_incoming_state_destroy().
>>>>>>>
>>>>>>> So most likely the main migration channel will never read far enough to
>>>>>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>
>>>>>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>>>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>>>>>> even a tls session that is not migration-related)?
>>>>>>>
>>>>>>> So basically have this patch extended to calling
>>>>>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>>>>>
>>>>>> If above theory can stand, then eof-okay could be a workaround papering
>>>>>> over the real problem that we shouldn't always shutdown()..
>>>>>>
>>>>>> Could you have a look at below patch and see whether it can fix the problem
>>>>>> you hit too, in replace of these two patches (including the previous
>>>>>> iochannel change)?
>>>>>>
>>>>>
>>>>> Unfortunately, the patch below does not fix the problem:
>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>
>>>>> I think that, even in the absence of shutdown(), if the sender does not
>>>>> call gnutls_bye() the TLS session is considered improperly terminated.
>>>>
>>>> Ah..
>>>>
>>>> How about one more change on top of above change to disconnect properly for
>>>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>>>> make sense to you?
>>>
>>> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>>> API contract of qio_channel_close. gnutls_bye needs to be able to
>>> perform I/O, which means we need to be able to tell the caller
>>> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>>>
>>> This is the same API design scenario as the gnutls_handshake method.
>>> As such I tdon't think it is practical to abstract it inside any
>>> existing QIOChannel API call, it'll have to be standalone like
>>> qio_channel_tls_handshake() is.
>>>
>>
>> I implemented the call to gnutls_bye:
>> https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>>
>> Then while testing it I realised we actually have a regression from 9.2:
>>
>> 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>>
>> It seems that patch somehow affected the ordering between src shutdown
>> vs. recv shutdown and now the recv channels are staying around to see
>> the connection being broken. Or something... I'm still looking into it.
>>
> 
> Ok, so the issue is that the recv side would previously be stuck at the
> sync semaphore and multifd_recv_terminate_threads() would kick it only
> after 'exiting' was set, so no further recv() would happen.
> 
> After the patch, there's no final sync anymore, so the recv thread loops
> around and waits at the recv() until multifd_send_terminate_threads()
> closes the connection.
> 
> Waiting on sem_sync as before would lead to a cleaner termination
> process IMO, but I don't think it's worth the extra complexity of
> introducing a sync to the device state migration.
> 
> So I think we'll have to go with one of the approaches suggested on this
> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
> sure we add a reference to the patch above and some words explaining the
> situation.

We still need premature_ok for handling older QEMU versions that do not
terminate the TLS stream correctly since the TLS test regression happens
even without device state transfer being enabled.

So I think that's what we should use generally.
  
> (let me know if anyone prefers the gnutls_bye approach I have implemented
> and I can send a proper series)

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-05 20:42                     ` Fabiano Rosas
  2025-02-05 20:55                       ` Maciej S. Szmigiero
@ 2025-02-05 21:13                       ` Peter Xu
  2025-02-06 14:19                         ` Fabiano Rosas
  1 sibling, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-05 21:13 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Daniel P. Berrangé, Maciej S. Szmigiero, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Wed, Feb 05, 2025 at 05:42:37PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> >
> >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
> >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
> >>> > On 3.02.2025 23:56, Peter Xu wrote:
> >>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> >>> > > > On 3.02.2025 21:20, Peter Xu wrote:
> >>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> >>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
> >>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> >>> > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> >>> > > > > > > > 
> >>> > > > > > > > Multifd send channels are terminated by calling
> >>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> >>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case essentially
> >>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> >>> > > > > > > > 
> >>> > > > > > > > Unfortunately, this does not terminate the TLS session properly and
> >>> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> >>> > > > > > > > 
> >>> > > > > > > > The only reason why this wasn't causing migration failures is because
> >>> > > > > > > > the current migration code apparently does not check for migration
> >>> > > > > > > > error being set after the end of the multifd receive process.
> >>> > > > > > > > 
> >>> > > > > > > > However, this will change soon so the multifd receive code has to be
> >>> > > > > > > > prepared to not return an error on such premature TLS session EOF.
> >>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
> >>> > > > > > > > 
> >>> > > > > > > > It's worth noting that even if the sender were to be changed to terminate
> >>> > > > > > > > the TLS connection properly the receive side still needs to remain
> >>> > > > > > > > compatible with older QEMU bit stream which does not do this.
> >>> > > > > > > 
> >>> > > > > > > If this is an existing bug, we could add a Fixes.
> >>> > > > > > 
> >>> > > > > > It is an existing issue but only uncovered by this patch set.
> >>> > > > > > 
> >>> > > > > > As far as I can see it was always there, so it would need some
> >>> > > > > > thought where to point that Fixes tag.
> >>> > > > > 
> >>> > > > > If there's no way to trigger a real functional bug anyway, it's also ok we
> >>> > > > > omit the Fixes.
> >>> > > > > 
> >>> > > > > > > Two pure questions..
> >>> > > > > > > 
> >>> > > > > > >      - What is the correct way to terminate the TLS session without this flag?
> >>> > > > > > 
> >>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
> >>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> >>> > > > > > 
> >>> > > > > > >      - Why this is only needed by multifd sessions?
> >>> > > > > > 
> >>> > > > > > What uncovered the issue was switching the load threads to using
> >>> > > > > > migrate_set_error() instead of their own result variable
> >>> > > > > > (load_threads_ret) which you had requested during the previous
> >>> > > > > > patch set version review:
> >>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> >>> > > > > > 
> >>> > > > > > Turns out that the multifd receive code always returned
> >>> > > > > > error in the TLS case, just nothing was previously checking for
> >>> > > > > > that error presence.
> >>> > > > > 
> >>> > > > > What I was curious is whether this issue also exists for the main migration
> >>> > > > > channel when with tls, especially when e.g. multifd not enabled at all.  As
> >>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> >>> > > > > 
> >>> > > > > I think it's a good to find that we overlooked this before.. and IMHO it's
> >>> > > > > always good we could fix this.
> >>> > > > > 
> >>> > > > > Does it mean we need proper gnutls_bye() somewhere?
> >>> > > > > 
> >>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be done
> >>> > > > > on the main channel as well.
> >>> > > > 
> >>> > > > That's a good question and looking at the code qemu_loadvm_state_main() exits
> >>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
> >>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
> >>> > > > in qemu_loadvm_state() - so still not until channel EOF.
> >>> > > 
> >>> > > I had a closer look, I do feel like such pre-mature termination is caused
> >>> > > by explicit shutdown()s of the iochannels, looks like that can cause issue
> >>> > > even after everything is sent.  Then I noticed indeed multifd sender
> >>> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> >>> > > don't do that for the main channel.  Maybe that is a major difference.
> >>> > > 
> >>> > > Now I wonder whether we should shutdown() the channel at all if migration
> >>> > > succeeded, because looks like it can cause tls session to interrupt even if
> >>> > > the shutdown() is done after sent everything, and if so it'll explain why
> >>> > > you hit the issue with tls.
> >>> > > 
> >>> > > > 
> >>> > > > Then I can't see anything else reading the channel until it is closed in
> >>> > > > migration_incoming_state_destroy().
> >>> > > > 
> >>> > > > So most likely the main migration channel will never read far enough to
> >>> > > > reach that GNUTLS_E_PREMATURE_TERMINATION error.
> >>> > > > 
> >>> > > > > If we don't need gnutls_bye(), then should we always ignore pre-mature
> >>> > > > > termination of tls no matter if it's multifd or non-multifd channel (or
> >>> > > > > even a tls session that is not migration-related)?
> >>> > > > 
> >>> > > > So basically have this patch extended to calling
> >>> > > > qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
> >>> > > 
> >>> > > If above theory can stand, then eof-okay could be a workaround papering
> >>> > > over the real problem that we shouldn't always shutdown()..
> >>> > > 
> >>> > > Could you have a look at below patch and see whether it can fix the problem
> >>> > > you hit too, in replace of these two patches (including the previous
> >>> > > iochannel change)?
> >>> > > 
> >>> > 
> >>> > Unfortunately, the patch below does not fix the problem:
> >>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> >>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> >>> > 
> >>> > I think that, even in the absence of shutdown(), if the sender does not
> >>> > call gnutls_bye() the TLS session is considered improperly terminated.
> >>> 
> >>> Ah..
> >>> 
> >>> How about one more change on top of above change to disconnect properly for
> >>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
> >>> make sense to you?
> >>
> >> Calling gnutls_bye from qio_channel_tls_close is not viable for the
> >> API contract of qio_channel_close. gnutls_bye needs to be able to
> >> perform I/O, which means we need to be able to tell the caller
> >> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
> >>
> >> This is the same API design scenario as the gnutls_handshake method.
> >> As such I tdon't think it is practical to abstract it inside any
> >> existing QIOChannel API call, it'll have to be standalone like
> >> qio_channel_tls_handshake() is.
> >>
> >
> > I implemented the call to gnutls_bye:
> > https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
> >
> > Then while testing it I realised we actually have a regression from 9.2:
> >
> > 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
> >
> > It seems that patch somehow affected the ordering between src shutdown
> > vs. recv shutdown and now the recv channels are staying around to see
> > the connection being broken. Or something... I'm still looking into it.
> >
> 
> Ok, so the issue is that the recv side would previously be stuck at the
> sync semaphore and multifd_recv_terminate_threads() would kick it only
> after 'exiting' was set, so no further recv() would happen.
> 
> After the patch, there's no final sync anymore, so the recv thread loops
> around and waits at the recv() until multifd_send_terminate_threads()
> closes the connection.
> 
> Waiting on sem_sync as before would lead to a cleaner termination
> process IMO, but I don't think it's worth the extra complexity of
> introducing a sync to the device state migration.
> 
> So I think we'll have to go with one of the approaches suggested on this
> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
> sure we add a reference to the patch above and some words explaining the
> situation.
> 
> (let me know if anyone prefers the gnutls_bye approach I have implemented
> and I can send a proper series)

Good to know the progress.

If that doesn't take a lot of time to provide a formal patch, IMO you
should go for it at least with an RFC; RFC is less likely to be completely
forgotten from thread discussions in all cases.

Migration is not the only one using tls channels, so even if migration can
avoid depending on it, I wonder if gnutls_bye is a must if we want to make
sure QEMU is free from pre-mature termination attacks on other users.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-02-05 15:55           ` Peter Xu
@ 2025-02-06 11:41             ` Maciej S. Szmigiero
  2025-02-06 22:16               ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-06 11:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 5.02.2025 16:55, Peter Xu wrote:
> On Wed, Feb 05, 2025 at 12:53:21PM +0100, Maciej S. Szmigiero wrote:
>> On 4.02.2025 21:34, Peter Xu wrote:
>>> On Tue, Feb 04, 2025 at 08:32:15PM +0100, Maciej S. Szmigiero wrote:
>>>> On 4.02.2025 18:54, Peter Xu wrote:
>>>>> On Thu, Jan 30, 2025 at 11:08:40AM +0100, Maciej S. Szmigiero wrote:
>>>>>> +static int multifd_device_state_save_thread(void *opaque)
>>>>>> +{
>>>>>> +    struct MultiFDDSSaveThreadData *data = opaque;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
>>>>>> +                     data->handler_opaque);
>>>>>
>>>>> I thought we discussed somewhere and the plan was we could use Error** here
>>>>> to report errors.  Would that still make sense, or maybe I lost some
>>>>> context?
>>>>
>>>> That was about *load* threads, here these are *save* threads.
>>>
>>> Ah OK.
>>>
>>>>
>>>> Save handlers do not return an Error value, neither save_live_iterate, nor
>>>> save_live_complete_precopy or save_state does so.
>>>
>>> Let's try to make new APIs work with Error* if possible.
>>
>> Let's assume that these threads return an Error object.
>>
>> What's qemu_savevm_state_complete_precopy_iterable() supposed to do with it?
> 
> IIUC it's not about qemu_savevm_state_complete_precopy_iterable() in this
> context, as the Error* will be used in one of the thread of the pool, not
> migration thread.
> 
> The goal is to be able to set Error* with migrate_set_error(), so that when
> migration failed, query-migrate can return the error to libvirt, so
> migration always tries to remember the 1st error hit if ever possible.
>
> It's multifd_device_state_save_thread() to do migrate_set_error(), not in
> migration thread.  qemu_savevm_state_complete_*() are indeed not ready to
> pass Errors, but it's not in the discussed stack.

I understand what are you proposing now - you haven't written about using
migrate_set_error() for save threads earlier, just about returning an Error
object.

While this might work it has tendency to uncover errors in other parts of
the migration core - much as using it in the load threads case uncovered
the TLS session error.

(Speaking of which, could you please respond to the issue at the bottom of
this message from 2 days ago?:
https://lore.kernel.org/qemu-devel/150a9741-daab-4724-add0-f35257e862f9@maciej.szmigiero.name/
It is blocking rework of the TLS session EOF handling in this patch set.
Thanks.)

But I can try this migrate_set_error() approach here and see if something
breaks.

(..)
>>>>
>>>>> Meanwhile, I still feel uneasy on having these globals (send_threads_abort,
>>>>> send_threads_ret).  Can we make MultiFDDSSaveThreadData the only interface
>>>>> between migration and the threads impl?  So I wonder if it can be:
>>>>>
>>>>>      ret = data->hdlr(data);
>>>>>
>>>>> With extended struct like this (I added thread_error and thread_quit):
>>>>>
>>>>> struct MultiFDDSSaveThreadData {
>>>>>        SaveLiveCompletePrecopyThreadHandler hdlr;
>>>>>        char *idstr;
>>>>>        uint32_t instance_id;
>>>>>        void *handler_opaque;
>>>>>        /*
>>>>>         * Should be NULL when struct passed over to thread, the thread should
>>>>>         * set this if the handler would return false.  It must be kept NULL if
>>>>>         * the handler returned true / success.
>>>>>         */
>>>>>        Error *thread_error;
>>>>
>>>> As I mentioned above, these handlers do not generally return Error type,
>>>> so this would need to be an *int;
>>>>
>>>>>        /*
>>>>>         * Migration core would set this when it wants to notify thread to
>>>>>         * quit, for example, when error occured in other threads, or migration is
>>>>>         * cancelled by the user.
>>>>>         */
>>>>>        bool thread_quit;
>>>>
>>>>               ^ I guess that was supposed to be a pointer too (*thread_quit).
>>>
>>> It's my intention to make this bool, to make everything managed per-thread.
>>
>> But that's unnecessary since this flag is common to all these threads.
> 
> One bool would be enough, but you'll need to export another API for VFIO to
> use otherwise.  I suppose that's ok too.
> 
> Some context of multifd threads and how that's done there..
> 
> We started with one "quit" per thread struct, but then we switched to one
> bool exactly as you said, see commit 15f3f21d598148.
> 
> If you want to stick with one bool, it's okay too, you can export something
> similar in misc.h, e.g. multifd_device_state_save_thread_quitting(), then
> we can avoid passing in the "quit" either as handler parameter, or
> per-thread flag.

Of course I can "export" this flag via a getter function rather than passing
it as a parameter to SaveLiveCompletePrecopyThreadHandler.

>>
>>> It's actually what we do with multifd, these are a bunch of extra threads
>>> to differeciate from the "IO threads" / "multifd threads".
>>>
>>>>
>>>>> };
>>>>>
>>>>> Then if any multifd_device_state_save_thread() failed, for example, it
>>>>> should notify all threads to quit by setting thread_quit, instead of
>>>>> relying on yet another global variable to show migration needs to quit.
>>>>
>>>> multifd_abort_device_state_save_threads() needs to access
>>>> send_threads_abort too.
>>>
>>> This may need to become something like:
>>>
>>>     QLIST_FOREACH() {
>>>         MultiFDDSSaveThreadData *data = ...;
>>>         data->thread_quit = true;
>>>     }
>>
>> At the most basic level that's turning O(1) operation into O(n).
>>
>> Besides, it creates a question now who now owns these MultiFDDSSaveThreadData
>> structures - they could be owned by either thread pool or the
>> multifd_device_state code.
> 
> I think it should be owned by migration, and with this idea it will need to
> be there until waiting thread pool completing their works, so migration
> core needs to free them.
> 
>>
>> Currently the ownership is simple - the multifd_device_state code
>> allocates such per-thread structure in multifd_spawn_device_state_save_thread()
>> and immediately passes its ownership to the thread pool which
>> takes care to free it once it no longer needs it.
> 
> Right, this is another reason why I think having migration owing these
> structs is better.  We used to have task dangling issues when we shift
> ownership of something to mainloop then we lose track of them (e.g. on TLS
> handshake gsources).  Those are pretty hard to debug when hanged, because
> migration core has nothing to link to the hanged tasks again anymore.
> 
> I think we should start from having migration core being able to reach
> these thread-based tasks when needed.  Migration also have control of the
> thread pool, then it would be easier.  Thread pool is so far simple so we
> may still need to be able to reference to per-task info separately.

These are separate threads, so they are are pretty easy to identify
in a debugger or a core dump.

Also, one can access them via the thread pool pointer if absolutely
necessary.

If QMP introspection ever becomes necessary then it could be simply
built into the generic thread pool itself.
Then all thread pool consumers will benefit from it.

>>
>> Now, with the list implementation if the thread pool were to free
>> that MultiFDDSSaveThreadData it would also need to release it from
>> the list.
>>
>> Which in turn would need appropriate locking around this removal
>> operation and probably also each time the list is iterated over.
>>
>> On the other hand if the multifd_device_state code were to own
>> that MultiFDDSSaveThreadData then it would linger around until
>> multifd_device_state_send_cleanup() cleans it up even though its
>> associated thread might be long gone.
> 
> Do you see a problem with it?  It sounds good to me actually.. and pretty
> easy to understand.
> 
> So migration creates these MultiFDDSSaveThreadData, then create threads to
> enqueue then, then wait for all threads to complete, then free these
> structs.

One of the benefits of using a thread pool is that it can abstract
memory management away by taking ownership of the data pointed to by
the passed thread opaque pointer (via the passed GDestroyNotify).

I don't see a benefit of re-implementing this also in the migration
code (returning an Error object does *not* require such approach).

>>
>>> We may want to double check qmp 'migrate_cancel' will work when save
>>> threads are running, but this can also be done for later.
>>
>>>>
>>>> And multifd_join_device_state_save_threads() needs to access
>>>> send_threads_ret.
>>>
>>> Then this one becomes:
>>>
>>>     thread_pool_wait(send_threads);
>>>     QLIST_FOREACH() {
>>>         MultiFDDSSaveThreadData *data = ...;
>>>         if (data->thread_error) {
>>>            return false;
>>>         }
>>>     }
>>>     return true;
>>
>> Same here, having a common error return would save us from having
>> to iterate over a list (or having a list in the first place).
> 
> IMHO perf isn't an issue here. It's slow path, threads num is small, loop
> is cheap.  I prefer prioritize cleaness in this case.
> 
> Otherwise any suggestion we could report an Error* in the threads?

Using Error doesn't need a list, load threads return an Error object
just fine without it:
>     if (!data->function(data->opaque, &mis->load_threads_abort, &local_err)) {
>         MigrationState *s = migrate_get_current();
> 
>         assert(local_err);
> 
>         /*
>          * In case of multiple load threads failing which thread error
>          * return we end setting is purely arbitrary.
>          */
>         migrate_set_error(s, local_err);
>     }
> 

Same can be done for save threads here (with the caveat of migrate_set_error()
uncovering possible other errors that I mentioned earlier).

>>
>>>>
>>>> These variables ultimately will have to be stored somewhere since
>>>> there can be multiple save threads and so multiple instances of
>>>> MultiFDDSSaveThreadData.
>>>>
>>>> So these need to be stored somewhere where
>>>> multifd_spawn_device_state_save_thread() can reach them to assign
>>>> their addresses to MultiFDDSSaveThreadData members.
>>>
>>> Then multifd_spawn_device_state_save_thread() will need to manage the
>>> qlist, making sure migration core remembers what jobs it submitted.  It
>>> sounds good to have that bookkeeping when I think about it, instead of
>>> throw the job to the thread pool and forget it..
>>
>> It's not "forgetting" about the job but rather letting thread pool
>> manage it - I think thread pool was introduced so these details
>> (thread management) are abstracted from the migration code.
>> Now they would be effectively duplicated in the migration code.
> 
> Migration is still managing those as long as you have send_threads_abort,
> isn't it?  The thread pool doesn't yet have an API to say "let's quit all
> the tasks", otherwise I'm OK too to use the pool API instead of having
> thread_quit.

The migration code does not manage each thread separately.

It manages them as a pool, and does each operation (wait, abort)
on the pool itself (either literally via ThreadPool or by setting
a variable that's shared by all threads).

>>
>>>>
>>>> However, at that point multifd_device_state_save_thread() can
>>>> access them too so it does not need to have them passed via
>>>> MultiFDDSSaveThreadData.
>>>>
>>>> However, nothing prevents putting send_threads* variables
>>>> into a global struct (with internal linkage - "static", just as
>>>> these separate ones are) if you like such construct more.
>>>
>>> This should be better than the current global vars indeed, but less
>>> favoured if the per-thread way could work above.
>>
>> You still need that list to be a global variable,
>> so it's the same amount of global variables as just putting
>> the existing variables in a struct (which could be even allocated
>> in multifd_device_state_send_setup() and deallocated in
>> multifd_device_state_send_cleanup() for extra memory savings).
> 
> Yes this works for me.
> 
> I think you got me wrong on "not allowing to introduce global variables".
> I'm OK with it, but please still consider..
> 
>    - Put it under some existing global object rather than having separate
>      global variables all over the places..
> 
>    - Having Error reports

Ok.

> And I still think we can change:
> 
> typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
>                                                      uint32_t instance_id,
>                                                      bool *abort_flag,
>                                                      void *opaque);
> 
> To:
> 
> typedef int (*SaveLiveCompletePrecopyThreadHandler)(MultiFDDSSaveThreadData*);
> 
> No matter what.

We can do that, although this requires "exporting" the MultiFDDSSaveThreadData
type.

>>
>> These variables are having internal linkage limited to (relatively
>> small) multifd-device-state.c, so it's not like they are polluting
>> namespace in some major migration translation unit.
> 
> If someone proposes to introduce 100 global vars in multifd-device-state.c,
> I'll strongly stop that.
> 
> If it's one global var, I'm OK.
> 
> What if it's 5?
> 
> ===8<===
> static QemuMutex queue_job_mutex;
> 
> static ThreadPool *send_threads;
> static int send_threads_ret;
> static bool send_threads_abort;
> 
> static MultiFDSendData *device_state_send;
> ===8<===
> 
> I think I should start calling a stop.  That's what happened..
> 
> Please consider introducing something like multifd_send_device_state so we
> can avoid anyone in the future randomly add static global vars.

As I wrote before, I will pack it all into one global variable,
could be called multifd_send_device_state as you suggest.

>>
>> Taking into consideration having to manage an extra data structure
>> (list), needing more code to do so, having worse algorithms I don't
>> really see a point of using that list.
>>
>> (This is orthogonal to whether the thread return type is changed to
>> Error which could be easily done on the existing save threads pool
>> implementation).
> 
> My bet is changing to list is as easy (10-20 LOC?).  If not, I can try to
> provide the diff on top of your patch.
> 
> I'm also not strictly asking for a list, but anything that makes the API
> cleaner (less globals, better error reports, etc.).

I just think introducing that list is a step back due to reasons I described
above.

And its not actually necessary for returning an Error code.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-05 20:55                       ` Maciej S. Szmigiero
@ 2025-02-06 14:13                         ` Fabiano Rosas
  2025-02-06 14:53                           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-06 14:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Daniel P. Berrangé, Peter Xu,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 5.02.2025 21:42, Fabiano Rosas wrote:
>> Fabiano Rosas <farosas@suse.de> writes:
>> 
>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>
>>>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>> On 3.02.2025 23:56, Peter Xu wrote:
>>>>>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>>>>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Multifd send channels are terminated by calling
>>>>>>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>>
>>>>>>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>>>>>>> the current migration code apparently does not check for migration
>>>>>>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>>>>>>
>>>>>>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>>>>>>
>>>>>>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>>>>>>
>>>>>>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>>>>>>
>>>>>>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>>>>>>
>>>>>>>>>> As far as I can see it was always there, so it would need some
>>>>>>>>>> thought where to point that Fixes tag.
>>>>>>>>>
>>>>>>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>>>>>>> omit the Fixes.
>>>>>>>>>
>>>>>>>>>>> Two pure questions..
>>>>>>>>>>>
>>>>>>>>>>>       - What is the correct way to terminate the TLS session without this flag?
>>>>>>>>>>
>>>>>>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>>>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>>>>>>
>>>>>>>>>>>       - Why this is only needed by multifd sessions?
>>>>>>>>>>
>>>>>>>>>> What uncovered the issue was switching the load threads to using
>>>>>>>>>> migrate_set_error() instead of their own result variable
>>>>>>>>>> (load_threads_ret) which you had requested during the previous
>>>>>>>>>> patch set version review:
>>>>>>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>>>>>>
>>>>>>>>>> Turns out that the multifd receive code always returned
>>>>>>>>>> error in the TLS case, just nothing was previously checking for
>>>>>>>>>> that error presence.
>>>>>>>>>
>>>>>>>>> What I was curious is whether this issue also exists for the main migration
>>>>>>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>>>>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>>>>>>
>>>>>>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>>>>>>> always good we could fix this.
>>>>>>>>>
>>>>>>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>>>>>>
>>>>>>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>>>>>>> on the main channel as well.
>>>>>>>>
>>>>>>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>>>>>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>>>>>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>>>>>>> in qemu_loadvm_state() - so still not until channel EOF.
>>>>>>>
>>>>>>> I had a closer look, I do feel like such pre-mature termination is caused
>>>>>>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>>>>>>> even after everything is sent.  Then I noticed indeed multifd sender
>>>>>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>>>>>> don't do that for the main channel.  Maybe that is a major difference.
>>>>>>>
>>>>>>> Now I wonder whether we should shutdown() the channel at all if migration
>>>>>>> succeeded, because looks like it can cause tls session to interrupt even if
>>>>>>> the shutdown() is done after sent everything, and if so it'll explain why
>>>>>>> you hit the issue with tls.
>>>>>>>
>>>>>>>>
>>>>>>>> Then I can't see anything else reading the channel until it is closed in
>>>>>>>> migration_incoming_state_destroy().
>>>>>>>>
>>>>>>>> So most likely the main migration channel will never read far enough to
>>>>>>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>
>>>>>>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>>>>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>>>>>>> even a tls session that is not migration-related)?
>>>>>>>>
>>>>>>>> So basically have this patch extended to calling
>>>>>>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>>>>>>
>>>>>>> If above theory can stand, then eof-okay could be a workaround papering
>>>>>>> over the real problem that we shouldn't always shutdown()..
>>>>>>>
>>>>>>> Could you have a look at below patch and see whether it can fix the problem
>>>>>>> you hit too, in replace of these two patches (including the previous
>>>>>>> iochannel change)?
>>>>>>>
>>>>>>
>>>>>> Unfortunately, the patch below does not fix the problem:
>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>
>>>>>> I think that, even in the absence of shutdown(), if the sender does not
>>>>>> call gnutls_bye() the TLS session is considered improperly terminated.
>>>>>
>>>>> Ah..
>>>>>
>>>>> How about one more change on top of above change to disconnect properly for
>>>>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>>>>> make sense to you?
>>>>
>>>> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>>>> API contract of qio_channel_close. gnutls_bye needs to be able to
>>>> perform I/O, which means we need to be able to tell the caller
>>>> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>>>>
>>>> This is the same API design scenario as the gnutls_handshake method.
>>>> As such I tdon't think it is practical to abstract it inside any
>>>> existing QIOChannel API call, it'll have to be standalone like
>>>> qio_channel_tls_handshake() is.
>>>>
>>>
>>> I implemented the call to gnutls_bye:
>>> https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>>>
>>> Then while testing it I realised we actually have a regression from 9.2:
>>>
>>> 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>>>
>>> It seems that patch somehow affected the ordering between src shutdown
>>> vs. recv shutdown and now the recv channels are staying around to see
>>> the connection being broken. Or something... I'm still looking into it.
>>>
>> 
>> Ok, so the issue is that the recv side would previously be stuck at the
>> sync semaphore and multifd_recv_terminate_threads() would kick it only
>> after 'exiting' was set, so no further recv() would happen.
>> 
>> After the patch, there's no final sync anymore, so the recv thread loops
>> around and waits at the recv() until multifd_send_terminate_threads()
>> closes the connection.
>> 
>> Waiting on sem_sync as before would lead to a cleaner termination
>> process IMO, but I don't think it's worth the extra complexity of
>> introducing a sync to the device state migration.
>> 
>> So I think we'll have to go with one of the approaches suggested on this
>> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
>> sure we add a reference to the patch above and some words explaining the
>> situation.
>
> We still need premature_ok for handling older QEMU versions that do not
> terminate the TLS stream correctly since the TLS test regression happens
> even without device state transfer being enabled.

What exactly is the impact of this issue to the device state series?
From the cover letter:

  * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
  instead of dedicated load_threads_ret variable.

  * Since the change above uncovered an issue with respect to multifd send
  channels not terminating TLS session properly QIOChannelTLS now allows
  gracefully handling this situation.

I understand qemu_loadvm_load_thread_pool() is attempting to use
migrate_set_error() but an error is already set by the recv_thread. Is
that the issue? I wonder if we could isolate somehow this so it doesn't
impact this series.

>
> So I think that's what we should use generally.
>   

For premature_ok, we need to make sure it will not hang QEMU if the
connection gets unexpectedly closed. The current code checks for
shutdown() having already happened, which is fine because it means we're
already on the way out. However, if any ol' recv() can now ignore a
premature termination error, then the recv_thread will not trigger
cleanup of the multifd_recv threads.

>> (let me know if anyone prefers the gnutls_bye approach I have implemented
>> and I can send a proper series)
>
> Thanks,
> Maciej


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-05 21:13                       ` Peter Xu
@ 2025-02-06 14:19                         ` Fabiano Rosas
  0 siblings, 0 replies; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-06 14:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, Maciej S. Szmigiero, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Wed, Feb 05, 2025 at 05:42:37PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas <farosas@suse.de> writes:
>> 
>> > Daniel P. Berrangé <berrange@redhat.com> writes:
>> >
>> >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>> >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>> >>> > On 3.02.2025 23:56, Peter Xu wrote:
>> >>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> >>> > > > On 3.02.2025 21:20, Peter Xu wrote:
>> >>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>> >>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
>> >>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>> >>> > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> >>> > > > > > > > 
>> >>> > > > > > > > Multifd send channels are terminated by calling
>> >>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> >>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case essentially
>> >>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> >>> > > > > > > > 
>> >>> > > > > > > > Unfortunately, this does not terminate the TLS session properly and
>> >>> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>> >>> > > > > > > > 
>> >>> > > > > > > > The only reason why this wasn't causing migration failures is because
>> >>> > > > > > > > the current migration code apparently does not check for migration
>> >>> > > > > > > > error being set after the end of the multifd receive process.
>> >>> > > > > > > > 
>> >>> > > > > > > > However, this will change soon so the multifd receive code has to be
>> >>> > > > > > > > prepared to not return an error on such premature TLS session EOF.
>> >>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
>> >>> > > > > > > > 
>> >>> > > > > > > > It's worth noting that even if the sender were to be changed to terminate
>> >>> > > > > > > > the TLS connection properly the receive side still needs to remain
>> >>> > > > > > > > compatible with older QEMU bit stream which does not do this.
>> >>> > > > > > > 
>> >>> > > > > > > If this is an existing bug, we could add a Fixes.
>> >>> > > > > > 
>> >>> > > > > > It is an existing issue but only uncovered by this patch set.
>> >>> > > > > > 
>> >>> > > > > > As far as I can see it was always there, so it would need some
>> >>> > > > > > thought where to point that Fixes tag.
>> >>> > > > > 
>> >>> > > > > If there's no way to trigger a real functional bug anyway, it's also ok we
>> >>> > > > > omit the Fixes.
>> >>> > > > > 
>> >>> > > > > > > Two pure questions..
>> >>> > > > > > > 
>> >>> > > > > > >      - What is the correct way to terminate the TLS session without this flag?
>> >>> > > > > > 
>> >>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
>> >>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>> >>> > > > > > 
>> >>> > > > > > >      - Why this is only needed by multifd sessions?
>> >>> > > > > > 
>> >>> > > > > > What uncovered the issue was switching the load threads to using
>> >>> > > > > > migrate_set_error() instead of their own result variable
>> >>> > > > > > (load_threads_ret) which you had requested during the previous
>> >>> > > > > > patch set version review:
>> >>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>> >>> > > > > > 
>> >>> > > > > > Turns out that the multifd receive code always returned
>> >>> > > > > > error in the TLS case, just nothing was previously checking for
>> >>> > > > > > that error presence.
>> >>> > > > > 
>> >>> > > > > What I was curious is whether this issue also exists for the main migration
>> >>> > > > > channel when with tls, especially when e.g. multifd not enabled at all.  As
>> >>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>> >>> > > > > 
>> >>> > > > > I think it's a good to find that we overlooked this before.. and IMHO it's
>> >>> > > > > always good we could fix this.
>> >>> > > > > 
>> >>> > > > > Does it mean we need proper gnutls_bye() somewhere?
>> >>> > > > > 
>> >>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be done
>> >>> > > > > on the main channel as well.
>> >>> > > > 
>> >>> > > > That's a good question and looking at the code qemu_loadvm_state_main() exits
>> >>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>> >>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>> >>> > > > in qemu_loadvm_state() - so still not until channel EOF.
>> >>> > > 
>> >>> > > I had a closer look, I do feel like such pre-mature termination is caused
>> >>> > > by explicit shutdown()s of the iochannels, looks like that can cause issue
>> >>> > > even after everything is sent.  Then I noticed indeed multifd sender
>> >>> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>> >>> > > don't do that for the main channel.  Maybe that is a major difference.
>> >>> > > 
>> >>> > > Now I wonder whether we should shutdown() the channel at all if migration
>> >>> > > succeeded, because looks like it can cause tls session to interrupt even if
>> >>> > > the shutdown() is done after sent everything, and if so it'll explain why
>> >>> > > you hit the issue with tls.
>> >>> > > 
>> >>> > > > 
>> >>> > > > Then I can't see anything else reading the channel until it is closed in
>> >>> > > > migration_incoming_state_destroy().
>> >>> > > > 
>> >>> > > > So most likely the main migration channel will never read far enough to
>> >>> > > > reach that GNUTLS_E_PREMATURE_TERMINATION error.
>> >>> > > > 
>> >>> > > > > If we don't need gnutls_bye(), then should we always ignore pre-mature
>> >>> > > > > termination of tls no matter if it's multifd or non-multifd channel (or
>> >>> > > > > even a tls session that is not migration-related)?
>> >>> > > > 
>> >>> > > > So basically have this patch extended to calling
>> >>> > > > qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>> >>> > > 
>> >>> > > If above theory can stand, then eof-okay could be a workaround papering
>> >>> > > over the real problem that we shouldn't always shutdown()..
>> >>> > > 
>> >>> > > Could you have a look at below patch and see whether it can fix the problem
>> >>> > > you hit too, in replace of these two patches (including the previous
>> >>> > > iochannel change)?
>> >>> > > 
>> >>> > 
>> >>> > Unfortunately, the patch below does not fix the problem:
>> >>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >>> > > qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >>> > 
>> >>> > I think that, even in the absence of shutdown(), if the sender does not
>> >>> > call gnutls_bye() the TLS session is considered improperly terminated.
>> >>> 
>> >>> Ah..
>> >>> 
>> >>> How about one more change on top of above change to disconnect properly for
>> >>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>> >>> make sense to you?
>> >>
>> >> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>> >> API contract of qio_channel_close. gnutls_bye needs to be able to
>> >> perform I/O, which means we need to be able to tell the caller
>> >> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>> >>
>> >> This is the same API design scenario as the gnutls_handshake method.
>> >> As such I tdon't think it is practical to abstract it inside any
>> >> existing QIOChannel API call, it'll have to be standalone like
>> >> qio_channel_tls_handshake() is.
>> >>
>> >
>> > I implemented the call to gnutls_bye:
>> > https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>> >
>> > Then while testing it I realised we actually have a regression from 9.2:
>> >
>> > 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>> >
>> > It seems that patch somehow affected the ordering between src shutdown
>> > vs. recv shutdown and now the recv channels are staying around to see
>> > the connection being broken. Or something... I'm still looking into it.
>> >
>> 
>> Ok, so the issue is that the recv side would previously be stuck at the
>> sync semaphore and multifd_recv_terminate_threads() would kick it only
>> after 'exiting' was set, so no further recv() would happen.
>> 
>> After the patch, there's no final sync anymore, so the recv thread loops
>> around and waits at the recv() until multifd_send_terminate_threads()
>> closes the connection.
>> 
>> Waiting on sem_sync as before would lead to a cleaner termination
>> process IMO, but I don't think it's worth the extra complexity of
>> introducing a sync to the device state migration.
>> 
>> So I think we'll have to go with one of the approaches suggested on this
>> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
>> sure we add a reference to the patch above and some words explaining the
>> situation.
>> 
>> (let me know if anyone prefers the gnutls_bye approach I have implemented
>> and I can send a proper series)
>
> Good to know the progress.
>
> If that doesn't take a lot of time to provide a formal patch, IMO you
> should go for it at least with an RFC; RFC is less likely to be completely
> forgotten from thread discussions in all cases.
>

I'll send it. But let's try to avoid making it a dependency for this
series like the last time. That didn't work out so well I think. I'd
suggest we prepare a simple fix to go along with the device state code,
ideally something that doesn't interact with TLS at all.

> Migration is not the only one using tls channels, so even if migration can
> avoid depending on it, I wonder if gnutls_bye is a must if we want to make
> sure QEMU is free from pre-mature termination attacks on other users.
>
> Thanks,


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-06 14:13                         ` Fabiano Rosas
@ 2025-02-06 14:53                           ` Maciej S. Szmigiero
  2025-02-06 15:20                             ` Fabiano Rosas
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-06 14:53 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Daniel P. Berrangé, Peter Xu,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 6.02.2025 15:13, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 5.02.2025 21:42, Fabiano Rosas wrote:
>>> Fabiano Rosas <farosas@suse.de> writes:
>>>
>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>
>>>>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>>>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>> On 3.02.2025 23:56, Peter Xu wrote:
>>>>>>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>>>>>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>>>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Multifd send channels are terminated by calling
>>>>>>>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>>>>>>>> the current migration code apparently does not check for migration
>>>>>>>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>>>>>>>
>>>>>>>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>>>>>>>
>>>>>>>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>>>>>>>
>>>>>>>>>>> As far as I can see it was always there, so it would need some
>>>>>>>>>>> thought where to point that Fixes tag.
>>>>>>>>>>
>>>>>>>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>>>>>>>> omit the Fixes.
>>>>>>>>>>
>>>>>>>>>>>> Two pure questions..
>>>>>>>>>>>>
>>>>>>>>>>>>        - What is the correct way to terminate the TLS session without this flag?
>>>>>>>>>>>
>>>>>>>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>>>>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>>>>>>>
>>>>>>>>>>>>        - Why this is only needed by multifd sessions?
>>>>>>>>>>>
>>>>>>>>>>> What uncovered the issue was switching the load threads to using
>>>>>>>>>>> migrate_set_error() instead of their own result variable
>>>>>>>>>>> (load_threads_ret) which you had requested during the previous
>>>>>>>>>>> patch set version review:
>>>>>>>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>>>>>>>
>>>>>>>>>>> Turns out that the multifd receive code always returned
>>>>>>>>>>> error in the TLS case, just nothing was previously checking for
>>>>>>>>>>> that error presence.
>>>>>>>>>>
>>>>>>>>>> What I was curious is whether this issue also exists for the main migration
>>>>>>>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>>>>>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>>>>>>>
>>>>>>>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>>>>>>>> always good we could fix this.
>>>>>>>>>>
>>>>>>>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>>>>>>>
>>>>>>>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>>>>>>>> on the main channel as well.
>>>>>>>>>
>>>>>>>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>>>>>>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>>>>>>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>>>>>>>> in qemu_loadvm_state() - so still not until channel EOF.
>>>>>>>>
>>>>>>>> I had a closer look, I do feel like such pre-mature termination is caused
>>>>>>>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>>>>>>>> even after everything is sent.  Then I noticed indeed multifd sender
>>>>>>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>>>>>>> don't do that for the main channel.  Maybe that is a major difference.
>>>>>>>>
>>>>>>>> Now I wonder whether we should shutdown() the channel at all if migration
>>>>>>>> succeeded, because looks like it can cause tls session to interrupt even if
>>>>>>>> the shutdown() is done after sent everything, and if so it'll explain why
>>>>>>>> you hit the issue with tls.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Then I can't see anything else reading the channel until it is closed in
>>>>>>>>> migration_incoming_state_destroy().
>>>>>>>>>
>>>>>>>>> So most likely the main migration channel will never read far enough to
>>>>>>>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>
>>>>>>>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>>>>>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>>>>>>>> even a tls session that is not migration-related)?
>>>>>>>>>
>>>>>>>>> So basically have this patch extended to calling
>>>>>>>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>>>>>>>
>>>>>>>> If above theory can stand, then eof-okay could be a workaround papering
>>>>>>>> over the real problem that we shouldn't always shutdown()..
>>>>>>>>
>>>>>>>> Could you have a look at below patch and see whether it can fix the problem
>>>>>>>> you hit too, in replace of these two patches (including the previous
>>>>>>>> iochannel change)?
>>>>>>>>
>>>>>>>
>>>>>>> Unfortunately, the patch below does not fix the problem:
>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>
>>>>>>> I think that, even in the absence of shutdown(), if the sender does not
>>>>>>> call gnutls_bye() the TLS session is considered improperly terminated.
>>>>>>
>>>>>> Ah..
>>>>>>
>>>>>> How about one more change on top of above change to disconnect properly for
>>>>>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>>>>>> make sense to you?
>>>>>
>>>>> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>>>>> API contract of qio_channel_close. gnutls_bye needs to be able to
>>>>> perform I/O, which means we need to be able to tell the caller
>>>>> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>>>>>
>>>>> This is the same API design scenario as the gnutls_handshake method.
>>>>> As such I tdon't think it is practical to abstract it inside any
>>>>> existing QIOChannel API call, it'll have to be standalone like
>>>>> qio_channel_tls_handshake() is.
>>>>>
>>>>
>>>> I implemented the call to gnutls_bye:
>>>> https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>>>>
>>>> Then while testing it I realised we actually have a regression from 9.2:
>>>>
>>>> 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>>>>
>>>> It seems that patch somehow affected the ordering between src shutdown
>>>> vs. recv shutdown and now the recv channels are staying around to see
>>>> the connection being broken. Or something... I'm still looking into it.
>>>>
>>>
>>> Ok, so the issue is that the recv side would previously be stuck at the
>>> sync semaphore and multifd_recv_terminate_threads() would kick it only
>>> after 'exiting' was set, so no further recv() would happen.
>>>
>>> After the patch, there's no final sync anymore, so the recv thread loops
>>> around and waits at the recv() until multifd_send_terminate_threads()
>>> closes the connection.
>>>
>>> Waiting on sem_sync as before would lead to a cleaner termination
>>> process IMO, but I don't think it's worth the extra complexity of
>>> introducing a sync to the device state migration.
>>>
>>> So I think we'll have to go with one of the approaches suggested on this
>>> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
>>> sure we add a reference to the patch above and some words explaining the
>>> situation.
>>
>> We still need premature_ok for handling older QEMU versions that do not
>> terminate the TLS stream correctly since the TLS test regression happens
>> even without device state transfer being enabled.
> 
> What exactly is the impact of this issue to the device state series?
>  From the cover letter:
> 
>    * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
>    instead of dedicated load_threads_ret variable.
> 
>    * Since the change above uncovered an issue with respect to multifd send
>    channels not terminating TLS session properly QIOChannelTLS now allows
>    gracefully handling this situation.
> 
> I understand qemu_loadvm_load_thread_pool() is attempting to use
> migrate_set_error() but an error is already set by the recv_thread. Is
> that the issue? 

Yes, when we test for load threads error in the TLS case we see that multifd TLS
one.
We need to know whether the load threads succeeded so we either continue with the
migration or abort it.

> I wonder if we could isolate somehow this so it doesn't
> impact this series.

The previous version simply used a dedicated load_threads_ret variable
so it wasn't affected but Peter likes migrate_set_error() more for
migration thread pools.
  
>>
>> So I think that's what we should use generally.
>>    
> 
> For premature_ok, we need to make sure it will not hang QEMU if the
> connection gets unexpectedly closed. The current code checks for
> shutdown() having already happened, which is fine because it means we're
> already on the way out. However, if any ol' recv() can now ignore a
> premature termination error, then the recv_thread will not trigger
> cleanup of the multifd_recv threads.

Enabling premature_ok just turns GNUTLS_E_PREMATURE_TERMINATION error
on EOF into a normal EOF.

The receive thread will exit on either one:
>           if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>                 break;
>             }

It's true that multifd_recv_terminate_threads() will only be called
by multifd_recv_cleanup() or multifd_recv_shutdown() in this case,
however this is already the case for non-TLS migration.

So if there was a bug with multifd threads shutdown it would have
already been manifesting on the non-TLS migration.

Also, to be clear, I'm not advocating for removing that shutdown()
call.

>>> (let me know if anyone prefers the gnutls_bye approach I have implemented
>>> and I can send a proper series)
>>
>> Thanks,
>> Maciej

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-06 14:53                           ` Maciej S. Szmigiero
@ 2025-02-06 15:20                             ` Fabiano Rosas
  2025-02-06 16:01                               ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-06 15:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Daniel P. Berrangé, Peter Xu,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 6.02.2025 15:13, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> On 5.02.2025 21:42, Fabiano Rosas wrote:
>>>> Fabiano Rosas <farosas@suse.de> writes:
>>>>
>>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>>
>>>>>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>>>>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>> On 3.02.2025 23:56, Peter Xu wrote:
>>>>>>>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>>>>>>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>>>>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Multifd send channels are terminated by calling
>>>>>>>>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>>>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>>>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>>>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>>>>>>>>> the current migration code apparently does not check for migration
>>>>>>>>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>>>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>>>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>>>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>>>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>>>>>>>>
>>>>>>>>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>>>>>>>>
>>>>>>>>>>>> As far as I can see it was always there, so it would need some
>>>>>>>>>>>> thought where to point that Fixes tag.
>>>>>>>>>>>
>>>>>>>>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>>>>>>>>> omit the Fixes.
>>>>>>>>>>>
>>>>>>>>>>>>> Two pure questions..
>>>>>>>>>>>>>
>>>>>>>>>>>>>        - What is the correct way to terminate the TLS session without this flag?
>>>>>>>>>>>>
>>>>>>>>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>>>>>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>>>>>>>>
>>>>>>>>>>>>>        - Why this is only needed by multifd sessions?
>>>>>>>>>>>>
>>>>>>>>>>>> What uncovered the issue was switching the load threads to using
>>>>>>>>>>>> migrate_set_error() instead of their own result variable
>>>>>>>>>>>> (load_threads_ret) which you had requested during the previous
>>>>>>>>>>>> patch set version review:
>>>>>>>>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>>>>>>>>
>>>>>>>>>>>> Turns out that the multifd receive code always returned
>>>>>>>>>>>> error in the TLS case, just nothing was previously checking for
>>>>>>>>>>>> that error presence.
>>>>>>>>>>>
>>>>>>>>>>> What I was curious is whether this issue also exists for the main migration
>>>>>>>>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>>>>>>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>>>>>>>>
>>>>>>>>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>>>>>>>>> always good we could fix this.
>>>>>>>>>>>
>>>>>>>>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>>>>>>>>
>>>>>>>>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>>>>>>>>> on the main channel as well.
>>>>>>>>>>
>>>>>>>>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>>>>>>>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>>>>>>>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>>>>>>>>> in qemu_loadvm_state() - so still not until channel EOF.
>>>>>>>>>
>>>>>>>>> I had a closer look, I do feel like such pre-mature termination is caused
>>>>>>>>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>>>>>>>>> even after everything is sent.  Then I noticed indeed multifd sender
>>>>>>>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>>>>>>>> don't do that for the main channel.  Maybe that is a major difference.
>>>>>>>>>
>>>>>>>>> Now I wonder whether we should shutdown() the channel at all if migration
>>>>>>>>> succeeded, because looks like it can cause tls session to interrupt even if
>>>>>>>>> the shutdown() is done after sent everything, and if so it'll explain why
>>>>>>>>> you hit the issue with tls.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then I can't see anything else reading the channel until it is closed in
>>>>>>>>>> migration_incoming_state_destroy().
>>>>>>>>>>
>>>>>>>>>> So most likely the main migration channel will never read far enough to
>>>>>>>>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>
>>>>>>>>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>>>>>>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>>>>>>>>> even a tls session that is not migration-related)?
>>>>>>>>>>
>>>>>>>>>> So basically have this patch extended to calling
>>>>>>>>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>>>>>>>>
>>>>>>>>> If above theory can stand, then eof-okay could be a workaround papering
>>>>>>>>> over the real problem that we shouldn't always shutdown()..
>>>>>>>>>
>>>>>>>>> Could you have a look at below patch and see whether it can fix the problem
>>>>>>>>> you hit too, in replace of these two patches (including the previous
>>>>>>>>> iochannel change)?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Unfortunately, the patch below does not fix the problem:
>>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>>
>>>>>>>> I think that, even in the absence of shutdown(), if the sender does not
>>>>>>>> call gnutls_bye() the TLS session is considered improperly terminated.
>>>>>>>
>>>>>>> Ah..
>>>>>>>
>>>>>>> How about one more change on top of above change to disconnect properly for
>>>>>>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>>>>>>> make sense to you?
>>>>>>
>>>>>> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>>>>>> API contract of qio_channel_close. gnutls_bye needs to be able to
>>>>>> perform I/O, which means we need to be able to tell the caller
>>>>>> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>>>>>>
>>>>>> This is the same API design scenario as the gnutls_handshake method.
>>>>>> As such I tdon't think it is practical to abstract it inside any
>>>>>> existing QIOChannel API call, it'll have to be standalone like
>>>>>> qio_channel_tls_handshake() is.
>>>>>>
>>>>>
>>>>> I implemented the call to gnutls_bye:
>>>>> https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>>>>>
>>>>> Then while testing it I realised we actually have a regression from 9.2:
>>>>>
>>>>> 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>>>>>
>>>>> It seems that patch somehow affected the ordering between src shutdown
>>>>> vs. recv shutdown and now the recv channels are staying around to see
>>>>> the connection being broken. Or something... I'm still looking into it.
>>>>>
>>>>
>>>> Ok, so the issue is that the recv side would previously be stuck at the
>>>> sync semaphore and multifd_recv_terminate_threads() would kick it only
>>>> after 'exiting' was set, so no further recv() would happen.
>>>>
>>>> After the patch, there's no final sync anymore, so the recv thread loops
>>>> around and waits at the recv() until multifd_send_terminate_threads()
>>>> closes the connection.
>>>>
>>>> Waiting on sem_sync as before would lead to a cleaner termination
>>>> process IMO, but I don't think it's worth the extra complexity of
>>>> introducing a sync to the device state migration.
>>>>
>>>> So I think we'll have to go with one of the approaches suggested on this
>>>> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
>>>> sure we add a reference to the patch above and some words explaining the
>>>> situation.
>>>
>>> We still need premature_ok for handling older QEMU versions that do not
>>> terminate the TLS stream correctly since the TLS test regression happens
>>> even without device state transfer being enabled.
>> 
>> What exactly is the impact of this issue to the device state series?
>>  From the cover letter:
>> 
>>    * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
>>    instead of dedicated load_threads_ret variable.
>> 
>>    * Since the change above uncovered an issue with respect to multifd send
>>    channels not terminating TLS session properly QIOChannelTLS now allows
>>    gracefully handling this situation.
>> 
>> I understand qemu_loadvm_load_thread_pool() is attempting to use
>> migrate_set_error() but an error is already set by the recv_thread. Is
>> that the issue? 
>
> Yes, when we test for load threads error in the TLS case we see that multifd TLS
> one.
> We need to know whether the load threads succeeded so we either continue with the
> migration or abort it.
>
>> I wonder if we could isolate somehow this so it doesn't
>> impact this series.
>
> The previous version simply used a dedicated load_threads_ret variable
> so it wasn't affected but Peter likes migrate_set_error() more for
> migration thread pools.
>   
>>>
>>> So I think that's what we should use generally.
>>>    
>> 
>> For premature_ok, we need to make sure it will not hang QEMU if the
>> connection gets unexpectedly closed. The current code checks for
>> shutdown() having already happened, which is fine because it means we're
>> already on the way out. However, if any ol' recv() can now ignore a
>> premature termination error, then the recv_thread will not trigger
>> cleanup of the multifd_recv threads.
>
> Enabling premature_ok just turns GNUTLS_E_PREMATURE_TERMINATION error
> on EOF into a normal EOF.
>
> The receive thread will exit on either one:
>>           if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>>                 break;
>>             }
>
> It's true that multifd_recv_terminate_threads() will only be called
> by multifd_recv_cleanup() or multifd_recv_shutdown() in this case,
> however this is already the case for non-TLS migration.
>

My point is that a premature termination sets local_err and a
premature_ok==true doesn't. So it's not the same as non-TLS migration
because there we don't have a way to ignore any errors.

Multifd recv threads can't discern an EOF in the middle of the migration
from an EOF after all data has been received. The former is definitely
an error and should cause migration to abort, multifd threads to
cleanup, etc.

> So if there was a bug with multifd threads shutdown it would have
> already been manifesting on the non-TLS migration.
>

Even if non-TLS behaved the same, why would we choose to port a bug to
the TLS implementation?

We could of course decide at this point to punt the problem forward and
when it shows up, we'd have to go implement gnutls_bye() to allow the
distinction between good-EOF/bad-EOF or maybe add an extra sync at the
end of migration to make sure the last recv() call is only started after
the source has already shutdown() the channel.

> Also, to be clear, I'm not advocating for removing that shutdown()
> call.

Yes, I think we should keep it.



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-06 15:20                             ` Fabiano Rosas
@ 2025-02-06 16:01                               ` Maciej S. Szmigiero
  2025-02-06 17:32                                 ` Fabiano Rosas
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-06 16:01 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Daniel P. Berrangé, Peter Xu,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On 6.02.2025 16:20, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 6.02.2025 15:13, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> On 5.02.2025 21:42, Fabiano Rosas wrote:
>>>>> Fabiano Rosas <farosas@suse.de> writes:
>>>>>
>>>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>>>
>>>>>>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>>>>>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>> On 3.02.2025 23:56, Peter Xu wrote:
>>>>>>>>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>>>>>>>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>>>>>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Multifd send channels are terminated by calling
>>>>>>>>>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>>>>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>>>>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>>>>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>>>>>>>>>> the current migration code apparently does not check for migration
>>>>>>>>>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>>>>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>>>>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>>>>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>>>>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As far as I can see it was always there, so it would need some
>>>>>>>>>>>>> thought where to point that Fixes tag.
>>>>>>>>>>>>
>>>>>>>>>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>>>>>>>>>> omit the Fixes.
>>>>>>>>>>>>
>>>>>>>>>>>>>> Two pure questions..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         - What is the correct way to terminate the TLS session without this flag?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>>>>>>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>>>>>>>>>
>>>>>>>>>>>>>>         - Why this is only needed by multifd sessions?
>>>>>>>>>>>>>
>>>>>>>>>>>>> What uncovered the issue was switching the load threads to using
>>>>>>>>>>>>> migrate_set_error() instead of their own result variable
>>>>>>>>>>>>> (load_threads_ret) which you had requested during the previous
>>>>>>>>>>>>> patch set version review:
>>>>>>>>>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Turns out that the multifd receive code always returned
>>>>>>>>>>>>> error in the TLS case, just nothing was previously checking for
>>>>>>>>>>>>> that error presence.
>>>>>>>>>>>>
>>>>>>>>>>>> What I was curious is whether this issue also exists for the main migration
>>>>>>>>>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>>>>>>>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>>>>>>>>>
>>>>>>>>>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>>>>>>>>>> always good we could fix this.
>>>>>>>>>>>>
>>>>>>>>>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>>>>>>>>>
>>>>>>>>>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>>>>>>>>>> on the main channel as well.
>>>>>>>>>>>
>>>>>>>>>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>>>>>>>>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>>>>>>>>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>>>>>>>>>> in qemu_loadvm_state() - so still not until channel EOF.
>>>>>>>>>>
>>>>>>>>>> I had a closer look, I do feel like such pre-mature termination is caused
>>>>>>>>>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>>>>>>>>>> even after everything is sent.  Then I noticed indeed multifd sender
>>>>>>>>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>>>>>>>>> don't do that for the main channel.  Maybe that is a major difference.
>>>>>>>>>>
>>>>>>>>>> Now I wonder whether we should shutdown() the channel at all if migration
>>>>>>>>>> succeeded, because looks like it can cause tls session to interrupt even if
>>>>>>>>>> the shutdown() is done after sent everything, and if so it'll explain why
>>>>>>>>>> you hit the issue with tls.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Then I can't see anything else reading the channel until it is closed in
>>>>>>>>>>> migration_incoming_state_destroy().
>>>>>>>>>>>
>>>>>>>>>>> So most likely the main migration channel will never read far enough to
>>>>>>>>>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>
>>>>>>>>>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>>>>>>>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>>>>>>>>>> even a tls session that is not migration-related)?
>>>>>>>>>>>
>>>>>>>>>>> So basically have this patch extended to calling
>>>>>>>>>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>>>>>>>>>
>>>>>>>>>> If above theory can stand, then eof-okay could be a workaround papering
>>>>>>>>>> over the real problem that we shouldn't always shutdown()..
>>>>>>>>>>
>>>>>>>>>> Could you have a look at below patch and see whether it can fix the problem
>>>>>>>>>> you hit too, in replace of these two patches (including the previous
>>>>>>>>>> iochannel change)?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Unfortunately, the patch below does not fix the problem:
>>>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>>>
>>>>>>>>> I think that, even in the absence of shutdown(), if the sender does not
>>>>>>>>> call gnutls_bye() the TLS session is considered improperly terminated.
>>>>>>>>
>>>>>>>> Ah..
>>>>>>>>
>>>>>>>> How about one more change on top of above change to disconnect properly for
>>>>>>>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>>>>>>>> make sense to you?
>>>>>>>
>>>>>>> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>>>>>>> API contract of qio_channel_close. gnutls_bye needs to be able to
>>>>>>> perform I/O, which means we need to be able to tell the caller
>>>>>>> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>>>>>>>
>>>>>>> This is the same API design scenario as the gnutls_handshake method.
>>>>>>> As such I tdon't think it is practical to abstract it inside any
>>>>>>> existing QIOChannel API call, it'll have to be standalone like
>>>>>>> qio_channel_tls_handshake() is.
>>>>>>>
>>>>>>
>>>>>> I implemented the call to gnutls_bye:
>>>>>> https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>>>>>>
>>>>>> Then while testing it I realised we actually have a regression from 9.2:
>>>>>>
>>>>>> 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>>>>>>
>>>>>> It seems that patch somehow affected the ordering between src shutdown
>>>>>> vs. recv shutdown and now the recv channels are staying around to see
>>>>>> the connection being broken. Or something... I'm still looking into it.
>>>>>>
>>>>>
>>>>> Ok, so the issue is that the recv side would previously be stuck at the
>>>>> sync semaphore and multifd_recv_terminate_threads() would kick it only
>>>>> after 'exiting' was set, so no further recv() would happen.
>>>>>
>>>>> After the patch, there's no final sync anymore, so the recv thread loops
>>>>> around and waits at the recv() until multifd_send_terminate_threads()
>>>>> closes the connection.
>>>>>
>>>>> Waiting on sem_sync as before would lead to a cleaner termination
>>>>> process IMO, but I don't think it's worth the extra complexity of
>>>>> introducing a sync to the device state migration.
>>>>>
>>>>> So I think we'll have to go with one of the approaches suggested on this
>>>>> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
>>>>> sure we add a reference to the patch above and some words explaining the
>>>>> situation.
>>>>
>>>> We still need premature_ok for handling older QEMU versions that do not
>>>> terminate the TLS stream correctly since the TLS test regression happens
>>>> even without device state transfer being enabled.
>>>
>>> What exactly is the impact of this issue to the device state series?
>>>   From the cover letter:
>>>
>>>     * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
>>>     instead of dedicated load_threads_ret variable.
>>>
>>>     * Since the change above uncovered an issue with respect to multifd send
>>>     channels not terminating TLS session properly QIOChannelTLS now allows
>>>     gracefully handling this situation.
>>>
>>> I understand qemu_loadvm_load_thread_pool() is attempting to use
>>> migrate_set_error() but an error is already set by the recv_thread. Is
>>> that the issue?
>>
>> Yes, when we test for load threads error in the TLS case we see that multifd TLS
>> one.
>> We need to know whether the load threads succeeded so we either continue with the
>> migration or abort it.
>>
>>> I wonder if we could isolate somehow this so it doesn't
>>> impact this series.
>>
>> The previous version simply used a dedicated load_threads_ret variable
>> so it wasn't affected but Peter likes migrate_set_error() more for
>> migration thread pools.
>>    
>>>>
>>>> So I think that's what we should use generally.
>>>>     
>>>
>>> For premature_ok, we need to make sure it will not hang QEMU if the
>>> connection gets unexpectedly closed. The current code checks for
>>> shutdown() having already happened, which is fine because it means we're
>>> already on the way out. However, if any ol' recv() can now ignore a
>>> premature termination error, then the recv_thread will not trigger
>>> cleanup of the multifd_recv threads.
>>
>> Enabling premature_ok just turns GNUTLS_E_PREMATURE_TERMINATION error
>> on EOF into a normal EOF.
>>
>> The receive thread will exit on either one:
>>>            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>>>                  break;
>>>              }
>>
>> It's true that multifd_recv_terminate_threads() will only be called
>> by multifd_recv_cleanup() or multifd_recv_shutdown() in this case,
>> however this is already the case for non-TLS migration.
>>
> 
> My point is that a premature termination sets local_err and a
> premature_ok==true doesn't. 

* Only in the TLS case - the non-TLS doesn't set any error even with
premature_ok==true (assuming there hasn't been any other error during
receive)

* If there *has* been other any other error during receive then it
will be set and the code flow will be the same even with premature_ok==true.

> So it's not the same as non-TLS migration
> because there we don't have a way to ignore any errors.

The GNUTLS_E_PREMATURE_TERMINATION error can't happen in the non-TLS case
so by definition we can't ignore it in the non-TLS case.

And we don't ignore any other error in the TLS/non-TLS case.

> Multifd recv threads can't discern an EOF in the middle of the migration
> from an EOF after all data has been received. The former is definitely
> an error and should cause migration to abort, multifd threads to
> cleanup, etc.

So in this case we should set the QIO_CHANNEL_READ_RELAXED_EOF flag on
the multifd channel recv thread main loop only, and leave the
mid-stream page receive methods report GNUTLS_E_PREMATURE_TERMINATION
as usual.

This makes the TLS case work the same with respect to premature
EOF as the non-TLS case since the non-TLS case can't detect premature
EOF in the multifd channel recv thread main loop either.

>> So if there was a bug with multifd threads shutdown it would have
>> already been manifesting on the non-TLS migration.
>>
> 
> Even if non-TLS behaved the same, why would we choose to port a bug to
> the TLS implementation?
> 
> We could of course decide at this point to punt the problem forward and
> when it shows up, we'd have to go implement gnutls_bye() to allow the
> distinction between good-EOF/bad-EOF or maybe add an extra sync at the
> end of migration to make sure the last recv() call is only started after
> the source has already shutdown() the channel.

If we do some kind of a premature EOF detection then it should probably
work for the non-TLS case too (since that's probably by far the most
common use case).
So adding some MULTIFD_FLAG_EOS would make the most sense and would work
even with QIO_CHANNEL_READ_RELAXED_EOF being set.

In any case we'd still need some kind of a compatibility behavior for
the TLS bit stream emitted by older QEMU versions (which is always
improperly terminated).

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-06 16:01                               ` Maciej S. Szmigiero
@ 2025-02-06 17:32                                 ` Fabiano Rosas
  2025-02-06 17:55                                   ` Maciej S. Szmigiero
  2025-02-06 21:51                                   ` Peter Xu
  0 siblings, 2 replies; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-06 17:32 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Daniel P. Berrangé, Peter Xu,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 6.02.2025 16:20, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> On 6.02.2025 15:13, Fabiano Rosas wrote:
>>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>>
>>>>> On 5.02.2025 21:42, Fabiano Rosas wrote:
>>>>>> Fabiano Rosas <farosas@suse.de> writes:
>>>>>>
>>>>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>>>>
>>>>>>>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>>>>>>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>> On 3.02.2025 23:56, Peter Xu wrote:
>>>>>>>>>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> On 3.02.2025 21:20, Peter Xu wrote:
>>>>>>>>>>>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>>> On 3.02.2025 19:20, Peter Xu wrote:
>>>>>>>>>>>>>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>>>>>>>>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Multifd send channels are terminated by calling
>>>>>>>>>>>>>>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>>>>>>>>>>>>>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>>>>>>>>>>>>>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Unfortunately, this does not terminate the TLS session properly and
>>>>>>>>>>>>>>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The only reason why this wasn't causing migration failures is because
>>>>>>>>>>>>>>>> the current migration code apparently does not check for migration
>>>>>>>>>>>>>>>> error being set after the end of the multifd receive process.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, this will change soon so the multifd receive code has to be
>>>>>>>>>>>>>>>> prepared to not return an error on such premature TLS session EOF.
>>>>>>>>>>>>>>>> Use the newly introduced QIOChannelTLS method for that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's worth noting that even if the sender were to be changed to terminate
>>>>>>>>>>>>>>>> the TLS connection properly the receive side still needs to remain
>>>>>>>>>>>>>>>> compatible with older QEMU bit stream which does not do this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If this is an existing bug, we could add a Fixes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is an existing issue but only uncovered by this patch set.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As far as I can see it was always there, so it would need some
>>>>>>>>>>>>>> thought where to point that Fixes tag.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If there's no way to trigger a real functional bug anyway, it's also ok we
>>>>>>>>>>>>> omit the Fixes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Two pure questions..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         - What is the correct way to terminate the TLS session without this flag?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess one would need to call gnutls_bye() like in this GnuTLS example:
>>>>>>>>>>>>>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         - Why this is only needed by multifd sessions?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What uncovered the issue was switching the load threads to using
>>>>>>>>>>>>>> migrate_set_error() instead of their own result variable
>>>>>>>>>>>>>> (load_threads_ret) which you had requested during the previous
>>>>>>>>>>>>>> patch set version review:
>>>>>>>>>>>>>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Turns out that the multifd receive code always returned
>>>>>>>>>>>>>> error in the TLS case, just nothing was previously checking for
>>>>>>>>>>>>>> that error presence.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What I was curious is whether this issue also exists for the main migration
>>>>>>>>>>>>> channel when with tls, especially when e.g. multifd not enabled at all.  As
>>>>>>>>>>>>> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's a good to find that we overlooked this before.. and IMHO it's
>>>>>>>>>>>>> always good we could fix this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does it mean we need proper gnutls_bye() somewhere?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we need an explicit gnutls_bye(), then I wonder if that should be done
>>>>>>>>>>>>> on the main channel as well.
>>>>>>>>>>>>
>>>>>>>>>>>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>>>>>>>>>>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
>>>>>>>>>>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
>>>>>>>>>>>> in qemu_loadvm_state() - so still not until channel EOF.
>>>>>>>>>>>
>>>>>>>>>>> I had a closer look, I do feel like such pre-mature termination is caused
>>>>>>>>>>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>>>>>>>>>>> even after everything is sent.  Then I noticed indeed multifd sender
>>>>>>>>>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>>>>>>>>>>> don't do that for the main channel.  Maybe that is a major difference.
>>>>>>>>>>>
>>>>>>>>>>> Now I wonder whether we should shutdown() the channel at all if migration
>>>>>>>>>>> succeeded, because looks like it can cause tls session to interrupt even if
>>>>>>>>>>> the shutdown() is done after sent everything, and if so it'll explain why
>>>>>>>>>>> you hit the issue with tls.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Then I can't see anything else reading the channel until it is closed in
>>>>>>>>>>>> migration_incoming_state_destroy().
>>>>>>>>>>>>
>>>>>>>>>>>> So most likely the main migration channel will never read far enough to
>>>>>>>>>>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>>>>>>>>>>
>>>>>>>>>>>>> If we don't need gnutls_bye(), then should we always ignore pre-mature
>>>>>>>>>>>>> termination of tls no matter if it's multifd or non-multifd channel (or
>>>>>>>>>>>>> even a tls session that is not migration-related)?
>>>>>>>>>>>>
>>>>>>>>>>>> So basically have this patch extended to calling
>>>>>>>>>>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>>>>>>>>>>>
>>>>>>>>>>> If above theory can stand, then eof-okay could be a workaround papering
>>>>>>>>>>> over the real problem that we shouldn't always shutdown()..
>>>>>>>>>>>
>>>>>>>>>>> Could you have a look at below patch and see whether it can fix the problem
>>>>>>>>>>> you hit too, in replace of these two patches (including the previous
>>>>>>>>>>> iochannel change)?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Unfortunately, the patch below does not fix the problem:
>>>>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>>>>> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>>>>>>>>>
>>>>>>>>>> I think that, even in the absence of shutdown(), if the sender does not
>>>>>>>>>> call gnutls_bye() the TLS session is considered improperly terminated.
>>>>>>>>>
>>>>>>>>> Ah..
>>>>>>>>>
>>>>>>>>> How about one more change on top of above change to disconnect properly for
>>>>>>>>> TLS?  Something like gnutls_bye() in qio_channel_tls_close(), would that
>>>>>>>>> make sense to you?
>>>>>>>>
>>>>>>>> Calling gnutls_bye from qio_channel_tls_close is not viable for the
>>>>>>>> API contract of qio_channel_close. gnutls_bye needs to be able to
>>>>>>>> perform I/O, which means we need to be able to tell the caller
>>>>>>>> whether it needs to perform an event loop wait for POLLIN or POLLOUT.
>>>>>>>>
>>>>>>>> This is the same API design scenario as the gnutls_handshake method.
>>>>>>>> As such I tdon't think it is practical to abstract it inside any
>>>>>>>> existing QIOChannel API call, it'll have to be standalone like
>>>>>>>> qio_channel_tls_handshake() is.
>>>>>>>>
>>>>>>>
>>>>>>> I implemented the call to gnutls_bye:
>>>>>>> https://gitlab.com/farosas/qemu/-/commits/migration-tls-bye
>>>>>>>
>>>>>>> Then while testing it I realised we actually have a regression from 9.2:
>>>>>>>
>>>>>>> 1d457daf86 ("migration/multifd: Further remove the SYNC on complete")
>>>>>>>
>>>>>>> It seems that patch somehow affected the ordering between src shutdown
>>>>>>> vs. recv shutdown and now the recv channels are staying around to see
>>>>>>> the connection being broken. Or something... I'm still looking into it.
>>>>>>>
>>>>>>
>>>>>> Ok, so the issue is that the recv side would previously be stuck at the
>>>>>> sync semaphore and multifd_recv_terminate_threads() would kick it only
>>>>>> after 'exiting' was set, so no further recv() would happen.
>>>>>>
>>>>>> After the patch, there's no final sync anymore, so the recv thread loops
>>>>>> around and waits at the recv() until multifd_send_terminate_threads()
>>>>>> closes the connection.
>>>>>>
>>>>>> Waiting on sem_sync as before would lead to a cleaner termination
>>>>>> process IMO, but I don't think it's worth the extra complexity of
>>>>>> introducing a sync to the device state migration.
>>>>>>
>>>>>> So I think we'll have to go with one of the approaches suggested on this
>>>>>> thread (gnutls_bye or premature_ok). I'm fine either way, but let's make
>>>>>> sure we add a reference to the patch above and some words explaining the
>>>>>> situation.
>>>>>
>>>>> We still need premature_ok for handling older QEMU versions that do not
>>>>> terminate the TLS stream correctly since the TLS test regression happens
>>>>> even without device state transfer being enabled.
>>>>
>>>> What exactly is the impact of this issue to the device state series?
>>>>   From the cover letter:
>>>>
>>>>     * qemu_loadvm_load_thread_pool now reports error via migrate_set_error()
>>>>     instead of dedicated load_threads_ret variable.
>>>>
>>>>     * Since the change above uncovered an issue with respect to multifd send
>>>>     channels not terminating TLS session properly QIOChannelTLS now allows
>>>>     gracefully handling this situation.
>>>>
>>>> I understand qemu_loadvm_load_thread_pool() is attempting to use
>>>> migrate_set_error() but an error is already set by the recv_thread. Is
>>>> that the issue?
>>>
>>> Yes, when we test for load threads error in the TLS case we see that multifd TLS
>>> one.
>>> We need to know whether the load threads succeeded so we either continue with the
>>> migration or abort it.
>>>
>>>> I wonder if we could isolate somehow this so it doesn't
>>>> impact this series.
>>>
>>> The previous version simply used a dedicated load_threads_ret variable
>>> so it wasn't affected but Peter likes migrate_set_error() more for
>>> migration thread pools.
>>>    
>>>>>
>>>>> So I think that's what we should use generally.
>>>>>     
>>>>
>>>> For premature_ok, we need to make sure it will not hang QEMU if the
>>>> connection gets unexpectedly closed. The current code checks for
>>>> shutdown() having already happened, which is fine because it means we're
>>>> already on the way out. However, if any ol' recv() can now ignore a
>>>> premature termination error, then the recv_thread will not trigger
>>>> cleanup of the multifd_recv threads.
>>>
>>> Enabling premature_ok just turns GNUTLS_E_PREMATURE_TERMINATION error
>>> on EOF into a normal EOF.
>>>
>>> The receive thread will exit on either one:
>>>>            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
>>>>                  break;
>>>>              }
>>>
>>> It's true that multifd_recv_terminate_threads() will only be called
>>> by multifd_recv_cleanup() or multifd_recv_shutdown() in this case,
>>> however this is already the case for non-TLS migration.
>>>
>> 
>> My point is that a premature termination sets local_err and a
>> premature_ok==true doesn't. 
>
> * Only in the TLS case - the non-TLS doesn't set any error even with
> premature_ok==true (assuming there hasn't been any other error during
> receive)
>
> * If there *has* been other any other error during receive then it
> will be set and the code flow will be the same even with premature_ok==true.
>

Sure, I'm not implying the change affects non-TLS. I'm just arguing that
non-TLS behavior should not be taken into consideration because it
doesn't ignore any errors anyway.

The whole (and only) point I'm making is what happens when
e.g. multifd_send shuts down the connection prematurely due to a
bug. IIUC, premature_ok would make that be treated as normal EOF in
multifd_recv and that is a problem because any thread that sees an error
should terminate all others instead of just exiting.

Currently, GNUTLS_E_PREMATURE_TERMINATION doesn't abort migration, but
it does cause multifd_recv_terminate_threads() to be executed. The
change from this series will make it so that
GNUTLS_E_PREMATURE_TERMINATION never leads to
multifd_recv_terminate_threads(), while the correct behavior would be to
trigger cleanup always, except for the very last recv().

>> So it's not the same as non-TLS migration
>> because there we don't have a way to ignore any errors.
>
> The GNUTLS_E_PREMATURE_TERMINATION error can't happen in the non-TLS case
> so by definition we can't ignore it in the non-TLS case.
>
> And we don't ignore any other error in the TLS/non-TLS case.
>
>> Multifd recv threads can't discern an EOF in the middle of the migration
>> from an EOF after all data has been received. The former is definitely
>> an error and should cause migration to abort, multifd threads to
>> cleanup, etc.
>
> So in this case we should set the QIO_CHANNEL_READ_RELAXED_EOF flag on
> the multifd channel recv thread main loop only, and leave the
> mid-stream page receive methods report GNUTLS_E_PREMATURE_TERMINATION
> as usual.
>

The multifd recv loop is where I don't want to have RELAXED_EOF. Except
for the last recv() call. Which of course we can't differentiate unless
we use something like gnutls_bye() of MULTIFD_FLAG_EOS as you suggest
below.

> This makes the TLS case work the same with respect to premature
> EOF as the non-TLS case since the non-TLS case can't detect premature
> EOF in the multifd channel recv thread main loop either.
>
>>> So if there was a bug with multifd threads shutdown it would have
>>> already been manifesting on the non-TLS migration.
>>>
>> 
>> Even if non-TLS behaved the same, why would we choose to port a bug to
>> the TLS implementation?
>> 
>> We could of course decide at this point to punt the problem forward and
>> when it shows up, we'd have to go implement gnutls_bye() to allow the
>> distinction between good-EOF/bad-EOF or maybe add an extra sync at the
>> end of migration to make sure the last recv() call is only started after
>> the source has already shutdown() the channel.
>
> If we do some kind of a premature EOF detection then it should probably
> work for the non-TLS case too (since that's probably by far the most
> common use case).
> So adding some MULTIFD_FLAG_EOS would make the most sense and would work
> even with QIO_CHANNEL_READ_RELAXED_EOF being set.
>

Indeed, if MULTIFD_FLAG_EOS is not seen, the recv thread could treat any
EOF as an error. The question is whether we can add that without
disrupting multifd synchronization too much.

> In any case we'd still need some kind of a compatibility behavior for
> the TLS bit stream emitted by older QEMU versions (which is always
> improperly terminated).
>

There is no compat issue. For <= 9.2, QEMU is still doing an extra
multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
the destination and it gets stuck waiting for the
RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
closes the connection before dst reaches the extra recv().

I test migration both ways with 2 previous QEMU versions and the
gnutls_bye() series passes all tests. I also put an assert at
tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
MULTIFD_FLAG_EOS should behave the same.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-06 17:32                                 ` Fabiano Rosas
@ 2025-02-06 17:55                                   ` Maciej S. Szmigiero
  2025-02-06 21:51                                   ` Peter Xu
  1 sibling, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-06 17:55 UTC (permalink / raw)
  To: Fabiano Rosas, Daniel P. Berrangé, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Avihai Horon, Joao Martins, qemu-devel

On 6.02.2025 18:32, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> On 6.02.2025 16:20, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
(..)
>>> Even if non-TLS behaved the same, why would we choose to port a bug to
>>> the TLS implementation?
>>>
>>> We could of course decide at this point to punt the problem forward and
>>> when it shows up, we'd have to go implement gnutls_bye() to allow the
>>> distinction between good-EOF/bad-EOF or maybe add an extra sync at the
>>> end of migration to make sure the last recv() call is only started after
>>> the source has already shutdown() the channel.
>>
>> If we do some kind of a premature EOF detection then it should probably
>> work for the non-TLS case too (since that's probably by far the most
>> common use case).
>> So adding some MULTIFD_FLAG_EOS would make the most sense and would work
>> even with QIO_CHANNEL_READ_RELAXED_EOF being set.
>>
> 
> Indeed, if MULTIFD_FLAG_EOS is not seen, the recv thread could treat any
> EOF as an error. The question is whether we can add that without
> disrupting multifd synchronization too much.
> 
>> In any case we'd still need some kind of a compatibility behavior for
>> the TLS bit stream emitted by older QEMU versions (which is always
>> improperly terminated).
>>
> 
> There is no compat issue. For <= 9.2, QEMU is still doing an extra
> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
> the destination and it gets stuck waiting for the
> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
> closes the connection before dst reaches the extra recv().
> 
> I test migration both ways with 2 previous QEMU versions and the
> gnutls_bye() series passes all tests. I also put an assert at
> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
> MULTIFD_FLAG_EOS should behave the same.

If you are confident that properly terminating the TLS session by adding
gnutls_bye() is the way to go then I'm fine with this - I hope @Peter
and @Daniel are too.

It's now only matter of how soon you can have these gnutls_bye() patches
posted/merged since if I drop the premature_ok stuff the updated series
will depend on them for passing the TLS tests.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-06 17:32                                 ` Fabiano Rosas
  2025-02-06 17:55                                   ` Maciej S. Szmigiero
@ 2025-02-06 21:51                                   ` Peter Xu
  2025-02-07 13:17                                     ` Fabiano Rosas
  1 sibling, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-06 21:51 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Maciej S. Szmigiero, Alex Williamson, Daniel P. Berrangé,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
> > In any case we'd still need some kind of a compatibility behavior for
> > the TLS bit stream emitted by older QEMU versions (which is always
> > improperly terminated).
> >
> 
> There is no compat issue. For <= 9.2, QEMU is still doing an extra
> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
> the destination and it gets stuck waiting for the
> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
> closes the connection before dst reaches the extra recv().
> 
> I test migration both ways with 2 previous QEMU versions and the
> gnutls_bye() series passes all tests. I also put an assert at
> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
> MULTIFD_FLAG_EOS should behave the same.

Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
wonder if the same issue would hit too with 9.0 or older.

I'd confess I feel unreliable relying on the side effect of 637280aeb2,
because fundamentally it works based on the fact that multifd threads need
to be kicked out by the main load thread SYNC event on dest QEMU to avoid
the readv() from going wrong.

What I'm not sure here is, is it sheer luck that the main channel SYNC will
always arrive _before_ pre-mature terminations of the multifd channels?  It
sounds like it could also happen when the multifd channels got its
pre-mature termination early, before the main thread got the SYNC.

Maybe we still need a compat property at the end..

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF
  2025-02-04 18:25         ` Maciej S. Szmigiero
@ 2025-02-06 21:53           ` Peter Xu
  0 siblings, 0 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-06 21:53 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Daniel P. Berrangé, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Tue, Feb 04, 2025 at 07:25:12PM +0100, Maciej S. Szmigiero wrote:
> That's for the multifd channel recv thread main loop only, if @Peter
> wants to patch also the mid-stream page receive methods and the main
> migration channel receive then qio_channel_read(), qio_channel_read_all(),
> qio_channel_readv_all() and qio_channel_readv_full_all() would need
> such treatment too.

No matter which way we go with the multifd part - I'm ok we ignore the main
channel completely if it's not prone to any pre-mature terminations anyway.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler
  2025-02-06 11:41             ` Maciej S. Szmigiero
@ 2025-02-06 22:16               ` Peter Xu
  0 siblings, 0 replies; 137+ messages in thread
From: Peter Xu @ 2025-02-06 22:16 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Feb 06, 2025 at 12:41:50PM +0100, Maciej S. Szmigiero wrote:
> On 5.02.2025 16:55, Peter Xu wrote:
> > On Wed, Feb 05, 2025 at 12:53:21PM +0100, Maciej S. Szmigiero wrote:
> > > On 4.02.2025 21:34, Peter Xu wrote:
> > > > On Tue, Feb 04, 2025 at 08:32:15PM +0100, Maciej S. Szmigiero wrote:
> > > > > On 4.02.2025 18:54, Peter Xu wrote:
> > > > > > On Thu, Jan 30, 2025 at 11:08:40AM +0100, Maciej S. Szmigiero wrote:
> > > > > > > +static int multifd_device_state_save_thread(void *opaque)
> > > > > > > +{
> > > > > > > +    struct MultiFDDSSaveThreadData *data = opaque;
> > > > > > > +    int ret;
> > > > > > > +
> > > > > > > +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
> > > > > > > +                     data->handler_opaque);
> > > > > > 
> > > > > > I thought we discussed somewhere and the plan was we could use Error** here
> > > > > > to report errors.  Would that still make sense, or maybe I lost some
> > > > > > context?
> > > > > 
> > > > > That was about *load* threads, here these are *save* threads.
> > > > 
> > > > Ah OK.
> > > > 
> > > > > 
> > > > > Save handlers do not return an Error value, neither save_live_iterate, nor
> > > > > save_live_complete_precopy or save_state does so.
> > > > 
> > > > Let's try to make new APIs work with Error* if possible.
> > > 
> > > Let's assume that these threads return an Error object.
> > > 
> > > What's qemu_savevm_state_complete_precopy_iterable() supposed to do with it?
> > 
> > IIUC it's not about qemu_savevm_state_complete_precopy_iterable() in this
> > context, as the Error* will be used in one of the thread of the pool, not
> > migration thread.
> > 
> > The goal is to be able to set Error* with migrate_set_error(), so that when
> > migration failed, query-migrate can return the error to libvirt, so
> > migration always tries to remember the 1st error hit if ever possible.
> > 
> > It's multifd_device_state_save_thread() to do migrate_set_error(), not in
> > migration thread.  qemu_savevm_state_complete_*() are indeed not ready to
> > pass Errors, but it's not in the discussed stack.
> 
> I understand what are you proposing now - you haven't written about using
> migrate_set_error() for save threads earlier, just about returning an Error
> object.
> 
> While this might work it has tendency to uncover errors in other parts of
> the migration core - much as using it in the load threads case uncovered
> the TLS session error.

Yes, hitting the tls issue is unfortunate, thanks for finding the bug.  I'm
ok if we have any workaround for that that is easily revertable, then we
fix it later.  Or if Fabiano's series can land earlier.  To me it seems
easier you stick with Fabiano's series and focus on the rest of the patches.

If we hit another bug, it's more unfortunate, but imho there's not much we
can do but try to fix all of them..

> 
> (Speaking of which, could you please respond to the issue at the bottom of
> this message from 2 days ago?:
> https://lore.kernel.org/qemu-devel/150a9741-daab-4724-add0-f35257e862f9@maciej.szmigiero.name/
> It is blocking rework of the TLS session EOF handling in this patch set.
> Thanks.)
> 
> But I can try this migrate_set_error() approach here and see if something
> breaks.
> 
> (..)
> > > > > 
> > > > > > Meanwhile, I still feel uneasy on having these globals (send_threads_abort,
> > > > > > send_threads_ret).  Can we make MultiFDDSSaveThreadData the only interface
> > > > > > between migration and the threads impl?  So I wonder if it can be:
> > > > > > 
> > > > > >      ret = data->hdlr(data);
> > > > > > 
> > > > > > With extended struct like this (I added thread_error and thread_quit):
> > > > > > 
> > > > > > struct MultiFDDSSaveThreadData {
> > > > > >        SaveLiveCompletePrecopyThreadHandler hdlr;
> > > > > >        char *idstr;
> > > > > >        uint32_t instance_id;
> > > > > >        void *handler_opaque;
> > > > > >        /*
> > > > > >         * Should be NULL when struct passed over to thread, the thread should
> > > > > >         * set this if the handler would return false.  It must be kept NULL if
> > > > > >         * the handler returned true / success.
> > > > > >         */
> > > > > >        Error *thread_error;
> > > > > 
> > > > > As I mentioned above, these handlers do not generally return Error type,
> > > > > so this would need to be an *int;
> > > > > 
> > > > > >        /*
> > > > > >         * Migration core would set this when it wants to notify thread to
> > > > > >         * quit, for example, when error occured in other threads, or migration is
> > > > > >         * cancelled by the user.
> > > > > >         */
> > > > > >        bool thread_quit;
> > > > > 
> > > > >               ^ I guess that was supposed to be a pointer too (*thread_quit).
> > > > 
> > > > It's my intention to make this bool, to make everything managed per-thread.
> > > 
> > > But that's unnecessary since this flag is common to all these threads.
> > 
> > One bool would be enough, but you'll need to export another API for VFIO to
> > use otherwise.  I suppose that's ok too.
> > 
> > Some context of multifd threads and how that's done there..
> > 
> > We started with one "quit" per thread struct, but then we switched to one
> > bool exactly as you said, see commit 15f3f21d598148.
> > 
> > If you want to stick with one bool, it's okay too, you can export something
> > similar in misc.h, e.g. multifd_device_state_save_thread_quitting(), then
> > we can avoid passing in the "quit" either as handler parameter, or
> > per-thread flag.
> 
> Of course I can "export" this flag via a getter function rather than passing
> it as a parameter to SaveLiveCompletePrecopyThreadHandler.
> 
> > > 
> > > > It's actually what we do with multifd, these are a bunch of extra threads
> > > > to differeciate from the "IO threads" / "multifd threads".
> > > > 
> > > > > 
> > > > > > };
> > > > > > 
> > > > > > Then if any multifd_device_state_save_thread() failed, for example, it
> > > > > > should notify all threads to quit by setting thread_quit, instead of
> > > > > > relying on yet another global variable to show migration needs to quit.
> > > > > 
> > > > > multifd_abort_device_state_save_threads() needs to access
> > > > > send_threads_abort too.
> > > > 
> > > > This may need to become something like:
> > > > 
> > > >     QLIST_FOREACH() {
> > > >         MultiFDDSSaveThreadData *data = ...;
> > > >         data->thread_quit = true;
> > > >     }
> > > 
> > > At the most basic level that's turning O(1) operation into O(n).
> > > 
> > > Besides, it creates a question now who now owns these MultiFDDSSaveThreadData
> > > structures - they could be owned by either thread pool or the
> > > multifd_device_state code.
> > 
> > I think it should be owned by migration, and with this idea it will need to
> > be there until waiting thread pool completing their works, so migration
> > core needs to free them.
> > 
> > > 
> > > Currently the ownership is simple - the multifd_device_state code
> > > allocates such per-thread structure in multifd_spawn_device_state_save_thread()
> > > and immediately passes its ownership to the thread pool which
> > > takes care to free it once it no longer needs it.
> > 
> > Right, this is another reason why I think having migration owing these
> > structs is better.  We used to have task dangling issues when we shift
> > ownership of something to mainloop then we lose track of them (e.g. on TLS
> > handshake gsources).  Those are pretty hard to debug when hanged, because
> > migration core has nothing to link to the hanged tasks again anymore.
> > 
> > I think we should start from having migration core being able to reach
> > these thread-based tasks when needed.  Migration also have control of the
> > thread pool, then it would be easier.  Thread pool is so far simple so we
> > may still need to be able to reference to per-task info separately.
> 
> These are separate threads, so they are are pretty easy to identify
> in a debugger or a core dump.
> 
> Also, one can access them via the thread pool pointer if absolutely
> necessary.
> 
> If QMP introspection ever becomes necessary then it could be simply
> built into the generic thread pool itself.
> Then all thread pool consumers will benefit from it.

Having full control of tasks assigned makes sure it'll be manageable even
if num_tasks > num_threads.  And it's manageable when another thread wants
to do anything with a task.  I know it's always tasks==threads for now but
I think we shouldn't assume it like that when designing the api.

If we'll have one bool showing "we need to quit all tasks", then indeed I
don't see a major concern with not maintaining per-task infos.  In
qmp_cancel logically we should set that bool.

To me, I want to avoid any form of unnecessary struggles when e.g. main
thread wants to access the tasks, then UAFs racing with thread pool freeing
a task info struct or things like that.  For now, I'm OK not having it.
The bool should work.  We can leave that for later.

> 
> > > 
> > > Now, with the list implementation if the thread pool were to free
> > > that MultiFDDSSaveThreadData it would also need to release it from
> > > the list.
> > > 
> > > Which in turn would need appropriate locking around this removal
> > > operation and probably also each time the list is iterated over.
> > > 
> > > On the other hand if the multifd_device_state code were to own
> > > that MultiFDDSSaveThreadData then it would linger around until
> > > multifd_device_state_send_cleanup() cleans it up even though its
> > > associated thread might be long gone.
> > 
> > Do you see a problem with it?  It sounds good to me actually.. and pretty
> > easy to understand.
> > 
> > So migration creates these MultiFDDSSaveThreadData, then create threads to
> > enqueue then, then wait for all threads to complete, then free these
> > structs.
> 
> One of the benefits of using a thread pool is that it can abstract
> memory management away by taking ownership of the data pointed to by
> the passed thread opaque pointer (via the passed GDestroyNotify).
> 
> I don't see a benefit of re-implementing this also in the migration
> code (returning an Error object does *not* require such approach).
> 
> > > 
> > > > We may want to double check qmp 'migrate_cancel' will work when save
> > > > threads are running, but this can also be done for later.
> > > 
> > > > > 
> > > > > And multifd_join_device_state_save_threads() needs to access
> > > > > send_threads_ret.
> > > > 
> > > > Then this one becomes:
> > > > 
> > > >     thread_pool_wait(send_threads);
> > > >     QLIST_FOREACH() {
> > > >         MultiFDDSSaveThreadData *data = ...;
> > > >         if (data->thread_error) {
> > > >            return false;
> > > >         }
> > > >     }
> > > >     return true;
> > > 
> > > Same here, having a common error return would save us from having
> > > to iterate over a list (or having a list in the first place).
> > 
> > IMHO perf isn't an issue here. It's slow path, threads num is small, loop
> > is cheap.  I prefer prioritize cleaness in this case.
> > 
> > Otherwise any suggestion we could report an Error* in the threads?
> 
> Using Error doesn't need a list, load threads return an Error object
> just fine without it:
> >     if (!data->function(data->opaque, &mis->load_threads_abort, &local_err)) {
> >         MigrationState *s = migrate_get_current();
> > 
> >         assert(local_err);
> > 
> >         /*
> >          * In case of multiple load threads failing which thread error
> >          * return we end setting is purely arbitrary.
> >          */
> >         migrate_set_error(s, local_err);
> >     }
> > 
> 
> Same can be done for save threads here (with the caveat of migrate_set_error()
> uncovering possible other errors that I mentioned earlier).
> 
> > > 
> > > > > 
> > > > > These variables ultimately will have to be stored somewhere since
> > > > > there can be multiple save threads and so multiple instances of
> > > > > MultiFDDSSaveThreadData.
> > > > > 
> > > > > So these need to be stored somewhere where
> > > > > multifd_spawn_device_state_save_thread() can reach them to assign
> > > > > their addresses to MultiFDDSSaveThreadData members.
> > > > 
> > > > Then multifd_spawn_device_state_save_thread() will need to manage the
> > > > qlist, making sure migration core remembers what jobs it submitted.  It
> > > > sounds good to have that bookkeeping when I think about it, instead of
> > > > throw the job to the thread pool and forget it..
> > > 
> > > It's not "forgetting" about the job but rather letting thread pool
> > > manage it - I think thread pool was introduced so these details
> > > (thread management) are abstracted from the migration code.
> > > Now they would be effectively duplicated in the migration code.
> > 
> > Migration is still managing those as long as you have send_threads_abort,
> > isn't it?  The thread pool doesn't yet have an API to say "let's quit all
> > the tasks", otherwise I'm OK too to use the pool API instead of having
> > thread_quit.
> 
> The migration code does not manage each thread separately.
> 
> It manages them as a pool, and does each operation (wait, abort)
> on the pool itself (either literally via ThreadPool or by setting
> a variable that's shared by all threads).
> 
> > > 
> > > > > 
> > > > > However, at that point multifd_device_state_save_thread() can
> > > > > access them too so it does not need to have them passed via
> > > > > MultiFDDSSaveThreadData.
> > > > > 
> > > > > However, nothing prevents putting send_threads* variables
> > > > > into a global struct (with internal linkage - "static", just as
> > > > > these separate ones are) if you like such construct more.
> > > > 
> > > > This should be better than the current global vars indeed, but less
> > > > favoured if the per-thread way could work above.
> > > 
> > > You still need that list to be a global variable,
> > > so it's the same amount of global variables as just putting
> > > the existing variables in a struct (which could be even allocated
> > > in multifd_device_state_send_setup() and deallocated in
> > > multifd_device_state_send_cleanup() for extra memory savings).
> > 
> > Yes this works for me.
> > 
> > I think you got me wrong on "not allowing to introduce global variables".
> > I'm OK with it, but please still consider..
> > 
> >    - Put it under some existing global object rather than having separate
> >      global variables all over the places..
> > 
> >    - Having Error reports
> 
> Ok.
> 
> > And I still think we can change:
> > 
> > typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
> >                                                      uint32_t instance_id,
> >                                                      bool *abort_flag,
> >                                                      void *opaque);
> > 
> > To:
> > 
> > typedef int (*SaveLiveCompletePrecopyThreadHandler)(MultiFDDSSaveThreadData*);
> > 
> > No matter what.
> 
> We can do that, although this requires "exporting" the MultiFDDSSaveThreadData
> type.

I'm OK with that.

I think it's fair to export it because that's the interface for
outside-migration.

> 
> > > 
> > > These variables are having internal linkage limited to (relatively
> > > small) multifd-device-state.c, so it's not like they are polluting
> > > namespace in some major migration translation unit.
> > 
> > If someone proposes to introduce 100 global vars in multifd-device-state.c,
> > I'll strongly stop that.
> > 
> > If it's one global var, I'm OK.
> > 
> > What if it's 5?
> > 
> > ===8<===
> > static QemuMutex queue_job_mutex;
> > 
> > static ThreadPool *send_threads;
> > static int send_threads_ret;
> > static bool send_threads_abort;
> > 
> > static MultiFDSendData *device_state_send;
> > ===8<===
> > 
> > I think I should start calling a stop.  That's what happened..
> > 
> > Please consider introducing something like multifd_send_device_state so we
> > can avoid anyone in the future randomly add static global vars.
> 
> As I wrote before, I will pack it all into one global variable,
> could be called multifd_send_device_state as you suggest.
> 
> > > 
> > > Taking into consideration having to manage an extra data structure
> > > (list), needing more code to do so, having worse algorithms I don't
> > > really see a point of using that list.
> > > 
> > > (This is orthogonal to whether the thread return type is changed to
> > > Error which could be easily done on the existing save threads pool
> > > implementation).
> > 
> > My bet is changing to list is as easy (10-20 LOC?).  If not, I can try to
> > provide the diff on top of your patch.
> > 
> > I'm also not strictly asking for a list, but anything that makes the API
> > cleaner (less globals, better error reports, etc.).
> 
> I just think introducing that list is a step back due to reasons I described
> above.
> 
> And its not actually necessary for returning an Error code.

I'm ok with having no list, as long as the rest are addressed.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-06 21:51                                   ` Peter Xu
@ 2025-02-07 13:17                                     ` Fabiano Rosas
  2025-02-07 14:04                                       ` Peter Xu
  0 siblings, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-07 13:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Alex Williamson, Daniel P. Berrangé,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
>> > In any case we'd still need some kind of a compatibility behavior for
>> > the TLS bit stream emitted by older QEMU versions (which is always
>> > improperly terminated).
>> >
>> 
>> There is no compat issue. For <= 9.2, QEMU is still doing an extra
>> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
>> the destination and it gets stuck waiting for the
>> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
>> closes the connection before dst reaches the extra recv().
>> 
>> I test migration both ways with 2 previous QEMU versions and the
>> gnutls_bye() series passes all tests. I also put an assert at
>> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
>> MULTIFD_FLAG_EOS should behave the same.
>
> Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
> wonder if the same issue would hit too with 9.0 or older.

Good point. 9.0 indeed breaks.

>
> I'd confess I feel unreliable relying on the side effect of 637280aeb2,
> because fundamentally it works based on the fact that multifd threads need
> to be kicked out by the main load thread SYNC event on dest QEMU to avoid
> the readv() from going wrong.
>

We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is
a bug that 1d457daf86 fixed.

> What I'm not sure here is, is it sheer luck that the main channel SYNC will
> always arrive _before_ pre-mature terminations of the multifd channels?  It
> sounds like it could also happen when the multifd channels got its
> pre-mature termination early, before the main thread got the SYNC.

You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that
puts the recv thread in the "won't see the termination" state and that
is serialized:

   SEND                        RECV
   -------------------------+----------------------------
1  multifd_send_sync_main()
2  pending_sync==true,
3  send thread sends SYNC      recv thread gets SYNC
4  <some work>                 recv gets stuck.
5  multifd_send_shutdown()     <time passes>
6  shutdown()                  multifd_recv_shutdown()
                               recv_terminate_threads()
                               recv exits without recv()

In other words, RECV would need to see the shutdown (6) before the SYNC
(3), which I don't think it possible.

>
> Maybe we still need a compat property at the end..

This is actually similar to preempt_pre_7_2, what about:

    /*
     * This variable only makes sense when set on the machine that is
     * the destination of a multifd migration with TLS enabled. It
     * affects the behavior of the last send->recv iteration with
     * regards to termination of the TLS session. Defaults to true.
     *
     * When set:
     *
     * - the destination QEMU instance can expect to never get a
     *   GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error
     *   message: "The TLS connection was non-properly terminated".
     *
     * When clear:
     *
     * - the destination QEMU instance can expect to see a
     *   GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel
     *   whenever the last recv() call of that channel happens after
     *   the source QEMU instance has already issued shutdown() on the
     *   channel. This is affected by (at least) commits 637280aeb2
     *   and 1d457daf86.
     *
     * NOTE: Regardless of the state of this option, a premature
     * termination of the TLS connection might happen due to error at
     * any moment prior to the last send->recv iteration.
     */
    bool multifd_clean_tls_termination;

And I think the more straight-forward implementation is to incorporate
Maciej's premature_ok patches (in some form), otherwise that option will
have to take effect on the QIOChannel which is a layering violation.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-07 13:17                                     ` Fabiano Rosas
@ 2025-02-07 14:04                                       ` Peter Xu
  2025-02-07 14:16                                         ` Fabiano Rosas
  0 siblings, 1 reply; 137+ messages in thread
From: Peter Xu @ 2025-02-07 14:04 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Maciej S. Szmigiero, Alex Williamson, Daniel P. Berrangé,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
> >> > In any case we'd still need some kind of a compatibility behavior for
> >> > the TLS bit stream emitted by older QEMU versions (which is always
> >> > improperly terminated).
> >> >
> >> 
> >> There is no compat issue. For <= 9.2, QEMU is still doing an extra
> >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
> >> the destination and it gets stuck waiting for the
> >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
> >> closes the connection before dst reaches the extra recv().
> >> 
> >> I test migration both ways with 2 previous QEMU versions and the
> >> gnutls_bye() series passes all tests. I also put an assert at
> >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
> >> MULTIFD_FLAG_EOS should behave the same.
> >
> > Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
> > wonder if the same issue would hit too with 9.0 or older.
> 
> Good point. 9.0 indeed breaks.
> 
> >
> > I'd confess I feel unreliable relying on the side effect of 637280aeb2,
> > because fundamentally it works based on the fact that multifd threads need
> > to be kicked out by the main load thread SYNC event on dest QEMU to avoid
> > the readv() from going wrong.
> >
> 
> We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is
> a bug that 1d457daf86 fixed.
> 
> > What I'm not sure here is, is it sheer luck that the main channel SYNC will
> > always arrive _before_ pre-mature terminations of the multifd channels?  It
> > sounds like it could also happen when the multifd channels got its
> > pre-mature termination early, before the main thread got the SYNC.
> 
> You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that
> puts the recv thread in the "won't see the termination" state and that
> is serialized:
> 
>    SEND                        RECV
>    -------------------------+----------------------------
> 1  multifd_send_sync_main()
> 2  pending_sync==true,
> 3  send thread sends SYNC      recv thread gets SYNC
> 4  <some work>                 recv gets stuck.
> 5  multifd_send_shutdown()     <time passes>
> 6  shutdown()                  multifd_recv_shutdown()
>                                recv_terminate_threads()
>                                recv exits without recv()
> 
> In other words, RECV would need to see the shutdown (6) before the SYNC
> (3), which I don't think it possible.

Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot
to push the per-channel SYNC.  I got it the other way round.  Yeah if data
is always ordered with shutdown() effect on recv then it seems in order.

> 
> >
> > Maybe we still need a compat property at the end..
> 
> This is actually similar to preempt_pre_7_2, what about:
> 
>     /*
>      * This variable only makes sense when set on the machine that is
>      * the destination of a multifd migration with TLS enabled. It
>      * affects the behavior of the last send->recv iteration with
>      * regards to termination of the TLS session. Defaults to true.
>      *
>      * When set:
>      *
>      * - the destination QEMU instance can expect to never get a
>      *   GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error
>      *   message: "The TLS connection was non-properly terminated".
>      *
>      * When clear:
>      *
>      * - the destination QEMU instance can expect to see a
>      *   GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel
>      *   whenever the last recv() call of that channel happens after
>      *   the source QEMU instance has already issued shutdown() on the
>      *   channel. This is affected by (at least) commits 637280aeb2
>      *   and 1d457daf86.

If we want to reference them after all, we could use another sentence to
describe the effects:

       *   Commit 637280aeb2 (since 9.1) introduced a side effect to cause
       *   pre-mature termination not happen, while commit 1d457daf86
       *   (since 10.0) can unexpectedly re-expose the pre-mature
       *   termination issue.

>      *
>      * NOTE: Regardless of the state of this option, a premature
>      * termination of the TLS connection might happen due to error at
>      * any moment prior to the last send->recv iteration.
>      */
>     bool multifd_clean_tls_termination;
> 
> And I think the more straight-forward implementation is to incorporate
> Maciej's premature_ok patches (in some form), otherwise that option will
> have to take effect on the QIOChannel which is a layering violation.

If we take Dan's comment into account:

https://lore.kernel.org/r/Z6I86e-hzJAlxk0r@redhat.com

It means whenever multifd recv thread invokes the iochannel API it will use
multifd_clean_tls_termination to decide QIO_CHANNEL_READ_RELAXED_EOF flag
to pass in.  I hope this is not layer violation, or I could miss something..

So if we're on the same page we need that knob, to make this series easier
we could make it two steps:

  - Step 1: introduce the parameter and QIO_CHANNEL_READ_RELAXED_EOF, set
    it default to false.

  - Step 2: Your other RFC series to implement gnutls_bye(), at last make
    it a compat property and switch default true.

Then Maciej only needs step 1, it looks to me.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
  2025-02-07 14:04                                       ` Peter Xu
@ 2025-02-07 14:16                                         ` Fabiano Rosas
  0 siblings, 0 replies; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-07 14:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Alex Williamson, Daniel P. Berrangé,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Avihai Horon, Joao Martins, qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
>> >> > In any case we'd still need some kind of a compatibility behavior for
>> >> > the TLS bit stream emitted by older QEMU versions (which is always
>> >> > improperly terminated).
>> >> >
>> >> 
>> >> There is no compat issue. For <= 9.2, QEMU is still doing an extra
>> >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
>> >> the destination and it gets stuck waiting for the
>> >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
>> >> closes the connection before dst reaches the extra recv().
>> >> 
>> >> I test migration both ways with 2 previous QEMU versions and the
>> >> gnutls_bye() series passes all tests. I also put an assert at
>> >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
>> >> MULTIFD_FLAG_EOS should behave the same.
>> >
>> > Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
>> > wonder if the same issue would hit too with 9.0 or older.
>> 
>> Good point. 9.0 indeed breaks.
>> 
>> >
>> > I'd confess I feel unreliable relying on the side effect of 637280aeb2,
>> > because fundamentally it works based on the fact that multifd threads need
>> > to be kicked out by the main load thread SYNC event on dest QEMU to avoid
>> > the readv() from going wrong.
>> >
>> 
>> We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is
>> a bug that 1d457daf86 fixed.
>> 
>> > What I'm not sure here is, is it sheer luck that the main channel SYNC will
>> > always arrive _before_ pre-mature terminations of the multifd channels?  It
>> > sounds like it could also happen when the multifd channels got its
>> > pre-mature termination early, before the main thread got the SYNC.
>> 
>> You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that
>> puts the recv thread in the "won't see the termination" state and that
>> is serialized:
>> 
>>    SEND                        RECV
>>    -------------------------+----------------------------
>> 1  multifd_send_sync_main()
>> 2  pending_sync==true,
>> 3  send thread sends SYNC      recv thread gets SYNC
>> 4  <some work>                 recv gets stuck.
>> 5  multifd_send_shutdown()     <time passes>
>> 6  shutdown()                  multifd_recv_shutdown()
>>                                recv_terminate_threads()
>>                                recv exits without recv()
>> 
>> In other words, RECV would need to see the shutdown (6) before the SYNC
>> (3), which I don't think it possible.
>
> Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot
> to push the per-channel SYNC.  I got it the other way round.  Yeah if data
> is always ordered with shutdown() effect on recv then it seems in order.
>
>> 
>> >
>> > Maybe we still need a compat property at the end..
>> 
>> This is actually similar to preempt_pre_7_2, what about:
>> 
>>     /*
>>      * This variable only makes sense when set on the machine that is
>>      * the destination of a multifd migration with TLS enabled. It
>>      * affects the behavior of the last send->recv iteration with
>>      * regards to termination of the TLS session. Defaults to true.
>>      *
>>      * When set:
>>      *
>>      * - the destination QEMU instance can expect to never get a
>>      *   GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error
>>      *   message: "The TLS connection was non-properly terminated".
>>      *
>>      * When clear:
>>      *
>>      * - the destination QEMU instance can expect to see a
>>      *   GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel
>>      *   whenever the last recv() call of that channel happens after
>>      *   the source QEMU instance has already issued shutdown() on the
>>      *   channel. This is affected by (at least) commits 637280aeb2
>>      *   and 1d457daf86.
>
> If we want to reference them after all, we could use another sentence to
> describe the effects:
>
>        *   Commit 637280aeb2 (since 9.1) introduced a side effect to cause
>        *   pre-mature termination not happen, while commit 1d457daf86
>        *   (since 10.0) can unexpectedly re-expose the pre-mature
>        *   termination issue.
>

I'll add this.

>>      *
>>      * NOTE: Regardless of the state of this option, a premature
>>      * termination of the TLS connection might happen due to error at
>>      * any moment prior to the last send->recv iteration.
>>      */
>>     bool multifd_clean_tls_termination;
>> 
>> And I think the more straight-forward implementation is to incorporate
>> Maciej's premature_ok patches (in some form), otherwise that option will
>> have to take effect on the QIOChannel which is a layering violation.
>
> If we take Dan's comment into account:
>
> https://lore.kernel.org/r/Z6I86e-hzJAlxk0r@redhat.com
>
> It means whenever multifd recv thread invokes the iochannel API it will use
> multifd_clean_tls_termination to decide QIO_CHANNEL_READ_RELAXED_EOF flag
> to pass in.  I hope this is not layer violation, or I could miss something..

Yes, we need that.

>
> So if we're on the same page we need that knob, to make this series easier
> we could make it two steps:
>
>   - Step 1: introduce the parameter and QIO_CHANNEL_READ_RELAXED_EOF, set
>     it default to false.
>
>   - Step 2: Your other RFC series to implement gnutls_bye(), at last make
>     it a compat property and switch default true.
>
> Then Maciej only needs step 1, it looks to me.

I'm sending everything as a v2 in a moment. We can cherry-pick from
there.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 17/33] migration/multifd: Make MultiFDSendData a struct
  2025-01-30 10:08 ` [PATCH v4 17/33] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
@ 2025-02-07 14:36   ` Fabiano Rosas
  2025-02-07 19:43     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Fabiano Rosas @ 2025-02-07 14:36 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: Peter Xu <peterx@redhat.com>
>
> The newly introduced device state buffer can be used for either storing
> VFIO's read() raw data, but already also possible to store generic device
> states.  After noticing that device states may not easily provide a max
> buffer size (also the fact that RAM MultiFDPages_t after all also want to
> have flexibility on managing offset[] array), it may not be a good idea to
> stick with union on MultiFDSendData.. as it won't play well with such
> flexibility.
>
> Switch MultiFDSendData to a struct.
>
> It won't consume a lot more space in reality, after all the real buffers
> were already dynamically allocated, so it's so far only about the two
> structs (pages, device_state) that will be duplicated, but they're small.
>
> With this, we can remove the pretty hard to understand alloc size logic.
> Because now we can allocate offset[] together with the SendData, and
> properly free it when the SendData is freed.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> [MSS: Make sure to clear possible device state payload before freeing
> MultiFDSendData, remove placeholders for other patches not included]
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  migration/multifd-device-state.c |  5 -----
>  migration/multifd-nocomp.c       | 13 ++++++-------
>  migration/multifd.c              | 25 +++++++------------------
>  migration/multifd.h              | 15 +++++++++------
>  4 files changed, 22 insertions(+), 36 deletions(-)
>
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index 2207bea9bf8a..d1674b432ff2 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -16,11 +16,6 @@ static QemuMutex queue_job_mutex;
>  
>  static MultiFDSendData *device_state_send;
>  
> -size_t multifd_device_state_payload_size(void)
> -{
> -    return sizeof(MultiFDDeviceState_t);
> -}
> -
>  void multifd_device_state_send_setup(void)
>  {
>      qemu_mutex_init(&queue_job_mutex);
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index c00804652383..ffe75256c9fb 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -25,15 +25,14 @@
>  
>  static MultiFDSendData *multifd_ram_send;
>  
> -size_t multifd_ram_payload_size(void)
> +void multifd_ram_payload_alloc(MultiFDPages_t *pages)
>  {
> -    uint32_t n = multifd_ram_page_count();
> +    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
> +}
>  
> -    /*
> -     * We keep an array of page offsets at the end of MultiFDPages_t,
> -     * add space for it in the allocation.
> -     */
> -    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
> +void multifd_ram_payload_free(MultiFDPages_t *pages)
> +{
> +    g_clear_pointer(&pages->offset, g_free);
>  }
>  
>  void multifd_ram_save_setup(void)
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 61b061a33d35..0b61b8192231 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -105,26 +105,12 @@ struct {
>  
>  MultiFDSendData *multifd_send_data_alloc(void)
>  {
> -    size_t max_payload_size, size_minus_payload;
> +    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
>  
> -    /*
> -     * MultiFDPages_t has a flexible array at the end, account for it
> -     * when allocating MultiFDSendData. Use max() in case other types
> -     * added to the union in the future are larger than
> -     * (MultiFDPages_t + flex array).
> -     */
> -    max_payload_size = MAX(multifd_ram_payload_size(),
> -                           multifd_device_state_payload_size());
> -    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
> -
> -    /*
> -     * Account for any holes the compiler might insert. We can't pack
> -     * the structure because that misaligns the members and triggers
> -     * Waddress-of-packed-member.
> -     */
> -    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
> +    multifd_ram_payload_alloc(&new->u.ram);
> +    /* Device state allocates its payload on-demand */
>  
> -    return g_malloc0(size_minus_payload + max_payload_size);
> +    return new;
>  }
>  
>  void multifd_send_data_clear(MultiFDSendData *data)
> @@ -151,8 +137,11 @@ void multifd_send_data_free(MultiFDSendData *data)
>          return;
>      }
>  
> +    /* This also free's device state payload */
>      multifd_send_data_clear(data);
>  
> +    multifd_ram_payload_free(&data->u.ram);
> +

Shouldn't this be added to the switch statement at
multifd_send_data_clear() instead?

>      g_free(data);
>  }
>  
> diff --git a/migration/multifd.h b/migration/multifd.h
> index ddc617db9acb..f7811cc0d0cb 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -115,9 +115,13 @@ typedef struct {
>      uint32_t num;
>      /* number of normal pages */
>      uint32_t normal_num;
> +    /*
> +     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
> +     * sure the pointer is always valid!
> +     */
>      RAMBlock *block;
> -    /* offset of each page */
> -    ram_addr_t offset[];
> +    /* offset array of each page, managed by multifd */
> +    ram_addr_t *offset;
>  } MultiFDPages_t;
>  
>  struct MultiFDRecvData {
> @@ -140,7 +144,7 @@ typedef enum {
>      MULTIFD_PAYLOAD_DEVICE_STATE,
>  } MultiFDPayloadType;
>  
> -typedef union MultiFDPayload {
> +typedef struct MultiFDPayload {
>      MultiFDPages_t ram;
>      MultiFDDeviceState_t device_state;
>  } MultiFDPayload;
> @@ -392,12 +396,11 @@ void multifd_ram_save_cleanup(void);
>  int multifd_ram_flush_and_sync(QEMUFile *f);
>  bool multifd_ram_sync_per_round(void);
>  bool multifd_ram_sync_per_section(void);
> -size_t multifd_ram_payload_size(void);
> +void multifd_ram_payload_alloc(MultiFDPages_t *pages);
> +void multifd_ram_payload_free(MultiFDPages_t *pages);
>  void multifd_ram_fill_packet(MultiFDSendParams *p);
>  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
>  
> -size_t multifd_device_state_payload_size(void);
> -
>  void multifd_send_data_clear_device_state(MultiFDDeviceState_t *device_state);
>  
>  void multifd_device_state_send_setup(void);


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 17/33] migration/multifd: Make MultiFDSendData a struct
  2025-02-07 14:36   ` Fabiano Rosas
@ 2025-02-07 19:43     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-07 19:43 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 7.02.2025 15:36, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: Peter Xu <peterx@redhat.com>
>>
>> The newly introduced device state buffer can be used for either storing
>> VFIO's read() raw data, but already also possible to store generic device
>> states.  After noticing that device states may not easily provide a max
>> buffer size (also the fact that RAM MultiFDPages_t after all also want to
>> have flexibility on managing offset[] array), it may not be a good idea to
>> stick with union on MultiFDSendData.. as it won't play well with such
>> flexibility.
>>
>> Switch MultiFDSendData to a struct.
>>
>> It won't consume a lot more space in reality, after all the real buffers
>> were already dynamically allocated, so it's so far only about the two
>> structs (pages, device_state) that will be duplicated, but they're small.
>>
>> With this, we can remove the pretty hard to understand alloc size logic.
>> Because now we can allocate offset[] together with the SendData, and
>> properly free it when the SendData is freed.
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> [MSS: Make sure to clear possible device state payload before freeing
>> MultiFDSendData, remove placeholders for other patches not included]
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   migration/multifd-device-state.c |  5 -----
>>   migration/multifd-nocomp.c       | 13 ++++++-------
>>   migration/multifd.c              | 25 +++++++------------------
>>   migration/multifd.h              | 15 +++++++++------
>>   4 files changed, 22 insertions(+), 36 deletions(-)
>>
>> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> index 2207bea9bf8a..d1674b432ff2 100644
>> --- a/migration/multifd-device-state.c
>> +++ b/migration/multifd-device-state.c
>> @@ -16,11 +16,6 @@ static QemuMutex queue_job_mutex;
>>   
>>   static MultiFDSendData *device_state_send;
>>   
>> -size_t multifd_device_state_payload_size(void)
>> -{
>> -    return sizeof(MultiFDDeviceState_t);
>> -}
>> -
>>   void multifd_device_state_send_setup(void)
>>   {
>>       qemu_mutex_init(&queue_job_mutex);
>> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
>> index c00804652383..ffe75256c9fb 100644
>> --- a/migration/multifd-nocomp.c
>> +++ b/migration/multifd-nocomp.c
>> @@ -25,15 +25,14 @@
>>   
>>   static MultiFDSendData *multifd_ram_send;
>>   
>> -size_t multifd_ram_payload_size(void)
>> +void multifd_ram_payload_alloc(MultiFDPages_t *pages)
>>   {
>> -    uint32_t n = multifd_ram_page_count();
>> +    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
>> +}
>>   
>> -    /*
>> -     * We keep an array of page offsets at the end of MultiFDPages_t,
>> -     * add space for it in the allocation.
>> -     */
>> -    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
>> +void multifd_ram_payload_free(MultiFDPages_t *pages)
>> +{
>> +    g_clear_pointer(&pages->offset, g_free);
>>   }
>>   
>>   void multifd_ram_save_setup(void)
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 61b061a33d35..0b61b8192231 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -105,26 +105,12 @@ struct {
>>   
>>   MultiFDSendData *multifd_send_data_alloc(void)
>>   {
>> -    size_t max_payload_size, size_minus_payload;
>> +    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
>>   
>> -    /*
>> -     * MultiFDPages_t has a flexible array at the end, account for it
>> -     * when allocating MultiFDSendData. Use max() in case other types
>> -     * added to the union in the future are larger than
>> -     * (MultiFDPages_t + flex array).
>> -     */
>> -    max_payload_size = MAX(multifd_ram_payload_size(),
>> -                           multifd_device_state_payload_size());
>> -    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
>> -
>> -    /*
>> -     * Account for any holes the compiler might insert. We can't pack
>> -     * the structure because that misaligns the members and triggers
>> -     * Waddress-of-packed-member.
>> -     */
>> -    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
>> +    multifd_ram_payload_alloc(&new->u.ram);
>> +    /* Device state allocates its payload on-demand */
>>   
>> -    return g_malloc0(size_minus_payload + max_payload_size);
>> +    return new;
>>   }
>>   
>>   void multifd_send_data_clear(MultiFDSendData *data)
>> @@ -151,8 +137,11 @@ void multifd_send_data_free(MultiFDSendData *data)
>>           return;
>>       }
>>   
>> +    /* This also free's device state payload */
>>       multifd_send_data_clear(data);
>>   
>> +    multifd_ram_payload_free(&data->u.ram);
>> +
> 
> Shouldn't this be added to the switch statement at
> multifd_send_data_clear() instead?

I think the intention is that RAM pages are allocated at MultiFDSendData
instance allocation time and stay allocated for its entire lifetime -
because we know RAM pages packet data size upfront and also that's what
multifd send threads will be mostly sending.

In contrast with RAM, device state allocates its payload on-demand since
its size is unknown and can vary between each multifd_queue_device_state()
invocation. This payload is free'd after it gets send by a multifd send
thread.

There's even a comment about this in multifd_send_data_alloc():
>     multifd_ram_payload_alloc(&new->u.ram);
>     /* Device state allocates its payload on-demand */

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 23/33] vfio/migration: Multifd device state transfer support - basic types
  2025-01-30 10:08 ` [PATCH v4 23/33] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
@ 2025-02-10 17:17   ` Cédric Le Goater
  0 siblings, 0 replies; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-10 17:17 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add basic types and flags used by VFIO multifd device state transfer
> support.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index cbb1e0b6f852..715182c4f810 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -47,6 +47,7 @@
>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>   #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>   #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY (0xffffffffef100006ULL)
>   
>   /*
>    * This is an arbitrary size based on migration of mlx5 devices, where typically
> @@ -55,6 +56,15 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>   
> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
> +
> +typedef struct VFIODeviceStatePacket {
> +    uint32_t version;
> +    uint32_t idx;
> +    uint32_t flags;
> +    uint8_t data[0];
> +} QEMU_PACKED VFIODeviceStatePacket;
> +

Since this is a rather big change :

  hw/vfio/migration.c                | 754 ++++++++++++++++++++++++++++-

please introduce a new hw/vfio/migration-multifd.c file.

Thanks,

C.



>   static int64_t bytes_transferred;
>
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property
  2025-01-30 10:08 ` [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property Maciej S. Szmigiero
@ 2025-02-10 17:24   ` Cédric Le Goater
  2025-02-11 14:37     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-10 17:24 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This property allows configuring whether to start the config load only
> after all iterables were loaded.
> Such interlocking is required for ARM64 due to this platform VFIO
> dependency on interrupt controller being loaded first.
> 
> The property defaults to AUTO, which means ON for ARM, OFF for other
> platforms.>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 25 +++++++++++++++++++++++++
>   hw/vfio/pci.c                 |  3 +++
>   include/hw/vfio/vfio-common.h |  1 +
>   3 files changed, 29 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index adfa752db527..d801c861d202 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -254,6 +254,31 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>       return ret;
>   }
>   
> +static bool vfio_load_config_after_iter(VFIODevice *vbasedev)
> +{
> +    if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_ON) {
> +        return true;
> +    } else if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_OFF) {
> +        return false;
> +    }
> +
> +    assert(vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_AUTO);
> +
> +    /*
> +     * Starting the config load only after all iterables were loaded is required
> +     * for ARM64 due to this platform VFIO dependency on interrupt controller
> +     * being loaded first.
> +     *
> +     * See commit d329f5032e17 ("vfio: Move the saving of the config space to
> +     * the right place in VFIO migration").
> +     */
> +#if defined(TARGET_ARM)
> +    return true;
> +#else
> +    return false;
> +#endif

I would rather deactivate support on ARM and avoid workarounds.

This can be done in routine vfio_multifd_transfer_supported() I believe,
at the end of this series. A warning can be added to inform the user.

Thanks,

C.



> +}
> +
>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>                                            Error **errp)
>   {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index ab17a98ee5b6..83090c544d95 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3377,6 +3377,9 @@ static const Property vfio_pci_dev_properties[] = {
>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
> +                            vbasedev.migration_load_config_after_iter,
> +                            ON_OFF_AUTO_AUTO),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0c60be5b15c7..153d03745dc7 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -133,6 +133,7 @@ typedef struct VFIODevice {
>       bool no_mmap;
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
> +    OnOffAuto migration_load_config_after_iter;
>       bool migration_events;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property
  2025-02-10 17:24   ` Cédric Le Goater
@ 2025-02-11 14:37     ` Maciej S. Szmigiero
  2025-02-11 15:00       ` Cédric Le Goater
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-11 14:37 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 10.02.2025 18:24, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This property allows configuring whether to start the config load only
>> after all iterables were loaded.
>> Such interlocking is required for ARM64 due to this platform VFIO
>> dependency on interrupt controller being loaded first.
>>
>> The property defaults to AUTO, which means ON for ARM, OFF for other
>> platforms.>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 25 +++++++++++++++++++++++++
>>   hw/vfio/pci.c                 |  3 +++
>>   include/hw/vfio/vfio-common.h |  1 +
>>   3 files changed, 29 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index adfa752db527..d801c861d202 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -254,6 +254,31 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>       return ret;
>>   }
>> +static bool vfio_load_config_after_iter(VFIODevice *vbasedev)
>> +{
>> +    if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_ON) {
>> +        return true;
>> +    } else if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_OFF) {
>> +        return false;
>> +    }
>> +
>> +    assert(vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_AUTO);
>> +
>> +    /*
>> +     * Starting the config load only after all iterables were loaded is required
>> +     * for ARM64 due to this platform VFIO dependency on interrupt controller
>> +     * being loaded first.
>> +     *
>> +     * See commit d329f5032e17 ("vfio: Move the saving of the config space to
>> +     * the right place in VFIO migration").
>> +     */
>> +#if defined(TARGET_ARM)
>> +    return true;
>> +#else
>> +    return false;
>> +#endif
> 
> I would rather deactivate support on ARM and avoid workarounds.
> 
> This can be done in routine vfio_multifd_transfer_supported() I believe,
> at the end of this series. A warning can be added to inform the user.

The reason why this interlocking support (x-migration-load-config-after-iter)
was added because you said during the review of the previous version of
this patch set that "regarding ARM64, it would be unfortunate to deactivate
the feature since migration works correctly today [..] and this series should
improve also downtime":
https://lore.kernel.org/qemu-devel/59897119-25d7-4a8b-9616-f8ab54e03f65@redhat.com/

My point is that after spending time developing and testing that feature
(or "workaround") it would be a shame to throw it away (with all the benefits
it brings) and completely disable multifd VFIO device state transfer on ARM.

Or am I misunderstanding you right now and you only mean here to make
x-migration-load-config-after-iter forcefully enabled on ARM?

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property
  2025-02-11 14:37     ` Maciej S. Szmigiero
@ 2025-02-11 15:00       ` Cédric Le Goater
  2025-02-11 15:57         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-11 15:00 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/11/25 15:37, Maciej S. Szmigiero wrote:
> On 10.02.2025 18:24, Cédric Le Goater wrote:
>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This property allows configuring whether to start the config load only
>>> after all iterables were loaded.
>>> Such interlocking is required for ARM64 due to this platform VFIO
>>> dependency on interrupt controller being loaded first.
>>>
>>> The property defaults to AUTO, which means ON for ARM, OFF for other
>>> platforms.>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 25 +++++++++++++++++++++++++
>>>   hw/vfio/pci.c                 |  3 +++
>>>   include/hw/vfio/vfio-common.h |  1 +
>>>   3 files changed, 29 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index adfa752db527..d801c861d202 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -254,6 +254,31 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>>       return ret;
>>>   }
>>> +static bool vfio_load_config_after_iter(VFIODevice *vbasedev)
>>> +{
>>> +    if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_ON) {
>>> +        return true;
>>> +    } else if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_OFF) {
>>> +        return false;
>>> +    }
>>> +
>>> +    assert(vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_AUTO);
>>> +
>>> +    /*
>>> +     * Starting the config load only after all iterables were loaded is required
>>> +     * for ARM64 due to this platform VFIO dependency on interrupt controller
>>> +     * being loaded first.
>>> +     *
>>> +     * See commit d329f5032e17 ("vfio: Move the saving of the config space to
>>> +     * the right place in VFIO migration").
>>> +     */
>>> +#if defined(TARGET_ARM)
>>> +    return true;
>>> +#else
>>> +    return false;
>>> +#endif
>>
>> I would rather deactivate support on ARM and avoid workarounds.
>>
>> This can be done in routine vfio_multifd_transfer_supported() I believe,
>> at the end of this series. A warning can be added to inform the user.
> 
> The reason why this interlocking support (x-migration-load-config-after-iter)
> was added because you said during the review of the previous version of
> this patch set that "regarding ARM64, it would be unfortunate to deactivate
> the feature since migration works correctly today [..] and this series should
> improve also downtime":
> https://lore.kernel.org/qemu-devel/59897119-25d7-4a8b-9616-f8ab54e03f65@redhat.com/

So much happened since ... my bad. I think this patch is not well
placed in the series, it should be at the end.

The series should present first the feature in a perfect world
and introduce at the end the toggles to handle the corner cases.
It helps the reader to focus on the good side of the proposal
and better understand the more unpleasant/ugly part.

> My point is that after spending time developing and testing that feature> (or "workaround") it would be a shame to throw it away (with all the benefits
> it brings) and completely disable multifd VFIO device state transfer on ARM.

Well, if you take the approach described above, this patch would
be proposed after merge as a fix/workaround for ARM or we would
fix the ARM platform.

> Or am I misunderstanding you right now and you only mean here to make
> x-migration-load-config-after-iter forcefully enabled on ARM?

If we only need this toggle for ARM, and this seems to be the case,
let's take a more direct path and avoid a property.

I haven't read all your series and the comments yet.

Thanks,

C.


  
>> Thanks,
>>
>> C.
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property
  2025-02-11 15:00       ` Cédric Le Goater
@ 2025-02-11 15:57         ` Maciej S. Szmigiero
  2025-02-11 16:28           ` Cédric Le Goater
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-11 15:57 UTC (permalink / raw)
  To: Cédric Le Goater, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Peter Xu, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11.02.2025 16:00, Cédric Le Goater wrote:
> On 2/11/25 15:37, Maciej S. Szmigiero wrote:
>> On 10.02.2025 18:24, Cédric Le Goater wrote:
>>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> This property allows configuring whether to start the config load only
>>>> after all iterables were loaded.
>>>> Such interlocking is required for ARM64 due to this platform VFIO
>>>> dependency on interrupt controller being loaded first.
>>>>
>>>> The property defaults to AUTO, which means ON for ARM, OFF for other
>>>> platforms.>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration.c           | 25 +++++++++++++++++++++++++
>>>>   hw/vfio/pci.c                 |  3 +++
>>>>   include/hw/vfio/vfio-common.h |  1 +
>>>>   3 files changed, 29 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index adfa752db527..d801c861d202 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -254,6 +254,31 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>>>       return ret;
>>>>   }
>>>> +static bool vfio_load_config_after_iter(VFIODevice *vbasedev)
>>>> +{
>>>> +    if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_ON) {
>>>> +        return true;
>>>> +    } else if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_OFF) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    assert(vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_AUTO);
>>>> +
>>>> +    /*
>>>> +     * Starting the config load only after all iterables were loaded is required
>>>> +     * for ARM64 due to this platform VFIO dependency on interrupt controller
>>>> +     * being loaded first.
>>>> +     *
>>>> +     * See commit d329f5032e17 ("vfio: Move the saving of the config space to
>>>> +     * the right place in VFIO migration").
>>>> +     */
>>>> +#if defined(TARGET_ARM)
>>>> +    return true;
>>>> +#else
>>>> +    return false;
>>>> +#endif
>>>
>>> I would rather deactivate support on ARM and avoid workarounds.
>>>
>>> This can be done in routine vfio_multifd_transfer_supported() I believe,
>>> at the end of this series. A warning can be added to inform the user.
>>
>> The reason why this interlocking support (x-migration-load-config-after-iter)
>> was added because you said during the review of the previous version of
>> this patch set that "regarding ARM64, it would be unfortunate to deactivate
>> the feature since migration works correctly today [..] and this series should
>> improve also downtime":
>> https://lore.kernel.org/qemu-devel/59897119-25d7-4a8b-9616-f8ab54e03f65@redhat.com/
> 
> So much happened since ... my bad. I think this patch is not well
> placed in the series, it should be at the end.
> 
> The series should present first the feature in a perfect world
> and introduce at the end the toggles to handle the corner cases.
> It helps the reader to focus on the good side of the proposal
> and better understand the more unpleasant/ugly part.
> 
>> My point is that after spending time developing and testing that feature> (or "workaround") it would be a shame to throw it away (with all the benefits
>> it brings) and completely disable multifd VFIO device state transfer on ARM.
> 
> Well, if you take the approach described above, this patch would
> be proposed after merge as a fix/workaround for ARM or we would
> fix the ARM platform.

Looks like there should be no problems moving this x-migration-load-config-after-iter
feature to a separate patch near the end of the series - I will try
to do this.

>> Or am I misunderstanding you right now and you only mean here to make
>> x-migration-load-config-after-iter forcefully enabled on ARM?
> 
> If we only need this toggle for ARM, and this seems to be the case,
> let's take a more direct path and avoid a property.

The reason why we likely want a some kind of switch even on
a non-ARM platform is ability to test this functionality there.

Most VFIO setups are probably x86 so people working on this code
will benefit from easy ability to check if they haven't accidentally
broke this interlocking.

> I haven't read all your series and the comments yet.

I keep the series updated with received review comments and re-based
on top of the latest Fabiano's TLS session termination patches here:
https://gitlab.com/maciejsszmigiero/qemu/-/commits/multifd-device-state-transfer-vfio

The changes up to this point has been mostly limited to migration
core code but it would be nice to get review comments for the VFIO
parts soon too so I can post a complete new version.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property
  2025-02-11 15:57         ` Maciej S. Szmigiero
@ 2025-02-11 16:28           ` Cédric Le Goater
  0 siblings, 0 replies; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-11 16:28 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Peter Xu, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/11/25 16:57, Maciej S. Szmigiero wrote:
> On 11.02.2025 16:00, Cédric Le Goater wrote:
>> On 2/11/25 15:37, Maciej S. Szmigiero wrote:
>>> On 10.02.2025 18:24, Cédric Le Goater wrote:
>>>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> This property allows configuring whether to start the config load only
>>>>> after all iterables were loaded.
>>>>> Such interlocking is required for ARM64 due to this platform VFIO
>>>>> dependency on interrupt controller being loaded first.
>>>>>
>>>>> The property defaults to AUTO, which means ON for ARM, OFF for other
>>>>> platforms.>
>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>> ---
>>>>>   hw/vfio/migration.c           | 25 +++++++++++++++++++++++++
>>>>>   hw/vfio/pci.c                 |  3 +++
>>>>>   include/hw/vfio/vfio-common.h |  1 +
>>>>>   3 files changed, 29 insertions(+)
>>>>>
>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>> index adfa752db527..d801c861d202 100644
>>>>> --- a/hw/vfio/migration.c
>>>>> +++ b/hw/vfio/migration.c
>>>>> @@ -254,6 +254,31 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>>>>       return ret;
>>>>>   }
>>>>> +static bool vfio_load_config_after_iter(VFIODevice *vbasedev)
>>>>> +{
>>>>> +    if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_ON) {
>>>>> +        return true;
>>>>> +    } else if (vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_OFF) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    assert(vbasedev->migration_load_config_after_iter == ON_OFF_AUTO_AUTO);
>>>>> +
>>>>> +    /*
>>>>> +     * Starting the config load only after all iterables were loaded is required
>>>>> +     * for ARM64 due to this platform VFIO dependency on interrupt controller
>>>>> +     * being loaded first.
>>>>> +     *
>>>>> +     * See commit d329f5032e17 ("vfio: Move the saving of the config space to
>>>>> +     * the right place in VFIO migration").
>>>>> +     */
>>>>> +#if defined(TARGET_ARM)
>>>>> +    return true;
>>>>> +#else
>>>>> +    return false;
>>>>> +#endif
>>>>
>>>> I would rather deactivate support on ARM and avoid workarounds.
>>>>
>>>> This can be done in routine vfio_multifd_transfer_supported() I believe,
>>>> at the end of this series. A warning can be added to inform the user.
>>>
>>> The reason why this interlocking support (x-migration-load-config-after-iter)
>>> was added because you said during the review of the previous version of
>>> this patch set that "regarding ARM64, it would be unfortunate to deactivate
>>> the feature since migration works correctly today [..] and this series should
>>> improve also downtime":
>>> https://lore.kernel.org/qemu-devel/59897119-25d7-4a8b-9616-f8ab54e03f65@redhat.com/
>>
>> So much happened since ... my bad. I think this patch is not well
>> placed in the series, it should be at the end.
>>
>> The series should present first the feature in a perfect world
>> and introduce at the end the toggles to handle the corner cases.
>> It helps the reader to focus on the good side of the proposal
>> and better understand the more unpleasant/ugly part.
>>
>>> My point is that after spending time developing and testing that feature> (or "workaround") it would be a shame to throw it away (with all the benefits
>>> it brings) and completely disable multifd VFIO device state transfer on ARM.
>>
>> Well, if you take the approach described above, this patch would
>> be proposed after merge as a fix/workaround for ARM or we would
>> fix the ARM platform.
> 
> Looks like there should be no problems moving this x-migration-load-config-after-iter
> feature to a separate patch near the end of the series - I will try
> to do this.

Thanks, The patch is broken anyhow, since vfio_load_config_after_iter
is unused.
  
>>> Or am I misunderstanding you right now and you only mean here to make
>>> x-migration-load-config-after-iter forcefully enabled on ARM?
>>
>> If we only need this toggle for ARM, and this seems to be the case,
>> let's take a more direct path and avoid a property.
> 
> The reason why we likely want a some kind of switch even on
> a non-ARM platform is ability to test this functionality there.
> 
> Most VFIO setups are probably x86 so people working on this code
> will benefit from easy ability to check if they haven't accidentally
> broke this interlocking.

it's a valid request. We might want to find a better name for the
property.
  
>> I haven't read all your series and the comments yet.
> 
> I keep the series updated with received review comments and re-based
> on top of the latest Fabiano's TLS session termination patches here:
> https://gitlab.com/maciejsszmigiero/qemu/-/commits/multifd-device-state-transfer-vfio
> 
> The changes up to this point has been mostly limited to migration
> core code but it would be nice to get review comments for the VFIO
> parts soon too so I can post a complete new version.

yes. Please introduce vfio/migration-multifd.{c,h} files. I would like
the interface to be clear and avoid bits and pieces of multifd support
slipping into the VFIO migration component.

Also, please provide documentation on :

  - the design principles and its limitations (the mlx5 kernel driver
    has some outcomes)
  - a guide on how to use VFIO migration with multifd
  - how the properties (migration and VFIO) can be set to improve
    performance.

It's not trivial, we will need some guidance for the people wanting
to use it.

I have asked our QE to rerun benchmarks on RHEL. I think we could
run it on ARM too if needed.

Thanks,

C.



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-01-30 10:08 ` [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
@ 2025-02-12 10:55   ` Cédric Le Goater
  2025-02-14 20:55     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-12 10:55 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add support for VFIOMultifd data structure that will contain most of the
> receive-side data together with its init/cleanup methods.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 52 +++++++++++++++++++++++++++++++++--
>   include/hw/vfio/vfio-common.h |  5 ++++
>   2 files changed, 55 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 3211041939c6..bcdf204d5cf4 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -300,6 +300,9 @@ typedef struct VFIOStateBuffer {
>       size_t len;
>   } VFIOStateBuffer;
>   
> +typedef struct VFIOMultifd {
> +} VFIOMultifd;
> +
>   static void vfio_state_buffer_clear(gpointer data)
>   {
>       VFIOStateBuffer *lb = data;
> @@ -398,6 +401,18 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>       return qemu_file_get_error(f);
>   }
>   
> +static VFIOMultifd *vfio_multifd_new(void)
> +{
> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
> +
> +    return multifd;
> +}
> +
> +static void vfio_multifd_free(VFIOMultifd *multifd)
> +{
> +    g_free(multifd);
> +}
> +
>   static void vfio_migration_cleanup(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
> @@ -785,14 +800,47 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    /*
> +     * Make a copy of this setting at the start in case it is changed
> +     * mid-migration.
> +     */
> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> +        migration->multifd_transfer = vfio_multifd_transfer_supported();

Attribute "migration->multifd_transfer" is not necessary. It can be
replaced by a small inline helper testing pointer migration->multifd
and this routine can use a local variable instead.

I don't think the '_transfer' suffix adds much to the understanding.

> +    } else {
> +        migration->multifd_transfer =
> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> +    }
> +
> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return -EINVAL;
> +    }

The above checks are also introduced in vfio_save_setup(). Please
implement a common routine vfio_multifd_is_enabled() or some other
name.

> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   migration->device_state, errp);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    if (migration->multifd_transfer) {
> +        assert(!migration->multifd);
> +        migration->multifd = vfio_multifd_new();
> +    }
>   
> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> -                                    vbasedev->migration->device_state, errp);
> +    return 0;
>   }
>   
>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);

please add a vfio_multifd_cleanup() routine.


>       vfio_migration_cleanup(vbasedev);
>       trace_vfio_load_cleanup(vbasedev->name);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 153d03745dc7..c0c9c0b1b263 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -61,6 +61,8 @@ typedef struct VFIORegion {
>       uint8_t nr; /* cache the region number for debug */
>   } VFIORegion;
>   
> +typedef struct VFIOMultifd VFIOMultifd;
> +
>   typedef struct VFIOMigration {
>       struct VFIODevice *vbasedev;
>       VMChangeStateEntry *vm_state;
> @@ -72,6 +74,8 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    bool multifd_transfer;
> +    VFIOMultifd *multifd;
>       bool initial_data_sent;
>   
>       bool event_save_iterate_started;
> @@ -133,6 +137,7 @@ typedef struct VFIODevice {
>       bool no_mmap;
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
> +    OnOffAuto migration_multifd_transfer;

This property should be added at the end of the series, with documentation,
and used in the vfio_multifd_some_name() routine I mentioned above.


Thanks,

C.



>       OnOffAuto migration_load_config_after_iter;
>       bool migration_events;
>       VFIODeviceOps *ops;
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-01-30 10:08 ` [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
@ 2025-02-12 13:47   ` Cédric Le Goater
  2025-02-14 20:58     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-12 13:47 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> The multifd received data needs to be reassembled since device state
> packets sent via different multifd channels can arrive out-of-order.
> 
> Therefore, each VFIO device state packet carries a header indicating its
> position in the stream.
> The raw device state data is saved into a VFIOStateBuffer for later
> in-order loading into the device.
> 
> The last such VFIO device state packet should have
> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 116 ++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.c                 |   2 +
>   hw/vfio/trace-events          |   1 +
>   include/hw/vfio/vfio-common.h |   1 +
>   4 files changed, 120 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index bcdf204d5cf4..0c0caec1bd64 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -301,6 +301,12 @@ typedef struct VFIOStateBuffer {
>   } VFIOStateBuffer;
>   
>   typedef struct VFIOMultifd {
> +    VFIOStateBuffers load_bufs;
> +    QemuCond load_bufs_buffer_ready_cond;
> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
> +    uint32_t load_buf_idx;
> +    uint32_t load_buf_idx_last;
> +    uint32_t load_buf_queued_pending_buffers;
>   } VFIOMultifd;
>   
>   static void vfio_state_buffer_clear(gpointer data)
> @@ -346,6 +352,103 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>   }
>   
Each routine executed from a migration thread should have a preliminary
comment saying from which context it is called: migration or VFIO

> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
> +                                          VFIODeviceStatePacket *packet,
> +                                          size_t packet_total_size,
> +                                          Error **errp)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIOStateBuffer *lb;
> +
> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
> +    }
> +
> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
> +    if (lb->is_present) {
> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
> +                   packet->idx);
> +        return false;
> +    }
> +
> +    assert(packet->idx >= multifd->load_buf_idx);
> +
> +    multifd->load_buf_queued_pending_buffers++;
> +    if (multifd->load_buf_queued_pending_buffers >
> +        vbasedev->migration_max_queued_buffers) {
> +        error_setg(errp,
> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
> +                   packet->idx, vbasedev->migration_max_queued_buffers);
> +        return false;
> +    }

AFAICT, attributes multifd->load_buf_queued_pending_buffers and
vbasedev->migration_max_queued_buffers are not strictly necessary.
They allow to count buffers and check an arbitrary limit, which
is UINT64_MAX today. It makes me wonder how useful they are.

Please introduce them in a separate patch at the end of the series,
adding documentation on the "x-migration-max-queued-buffers" property
and also general documentation on why and how to use it.

> +
> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
> +    lb->len = packet_total_size - sizeof(*packet);
> +    lb->is_present = true;
> +
> +    return true;
> +}
> +
> +static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                                   Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
> +
> +    /*
> +     * Holding BQL here would violate the lock order and can cause
> +     * a deadlock once we attempt to lock load_bufs_mutex below.
> +     */
> +    assert(!bql_locked());
> +
> +    if (!migration->multifd_transfer) {
> +        error_setg(errp,
> +                   "got device state packet but not doing multifd transfer");
> +        return false;
> +    }
> +
> +    assert(multifd);
> +
> +    if (data_size < sizeof(*packet)) {
> +        error_setg(errp, "packet too short at %zu (min is %zu)",
> +                   data_size, sizeof(*packet));
> +        return false;
> +    }
> +
> +    if (packet->version != 0) {

Please add a define for version, even if 0.

> +        error_setg(errp, "packet has unknown version %" PRIu32,
> +                   packet->version);
> +        return false;
> +    }
> +
> +    if (packet->idx == UINT32_MAX) {
> +        error_setg(errp, "packet has too high idx %" PRIu32,
> +                   packet->idx);

I don't think printing out packet->idx is useful here.

> +        return false;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);

I wonder if we can add thread ids to trace events. It would be useful.

> +
> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
> +
> +    /* config state packet should be the last one in the stream */
> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
> +        multifd->load_buf_idx_last = packet->idx;
> +    }
> +
> +    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {

So the migration thread calling multifd_device_state_recv() will
exit and the vfio thread loading the state into the device will
hang until its aborted ?

This sequence is expected to be called to release the vfio thread

        while (multifd->load_bufs_thread_running) {
             multifd->load_bufs_thread_want_exit = true;

             qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
	    ...
        }

right ?


The way the series is presented makes it a bit complex to follow the
proposition, especially regarding the creation and termination of
threads, something the reader should be aware of.

As an initial step in clarifying the design, I would have preferred
a series of patches introducing the various threads, migration threads
and VFIO threads, without any workload. Once the creation and termination
points are established I would then introduce the work load for each
thread.


Thanks,

C.




> +        return false;
> +    }
> +
> +    qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
> +
> +    return true;
> +}
> +
>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>                                            Error **errp)
>   {
> @@ -405,11 +508,23 @@ static VFIOMultifd *vfio_multifd_new(void)
>   {
>       VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>   
> +    vfio_state_buffers_init(&multifd->load_bufs);
> +
> +    qemu_mutex_init(&multifd->load_bufs_mutex);
> +
> +    multifd->load_buf_idx = 0;
> +    multifd->load_buf_idx_last = UINT32_MAX;
> +    multifd->load_buf_queued_pending_buffers = 0;
> +    qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
> +
>       return multifd;
>   }
>   
>   static void vfio_multifd_free(VFIOMultifd *multifd)
>   {
> +    qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
> +    qemu_mutex_destroy(&multifd->load_bufs_mutex);
> +
>       g_free(multifd);
>   }
>   
> @@ -940,6 +1055,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
>   };
>   
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 83090c544d95..2700b355ecf1 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3380,6 +3380,8 @@ static const Property vfio_pci_dev_properties[] = {
>       DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
>                               vbasedev.migration_load_config_after_iter,
>                               ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1bebe9877d88..042a3dc54a33 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -153,6 +153,7 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
>   vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
> +vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c0c9c0b1b263..0e8b0848882e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -139,6 +139,7 @@ typedef struct VFIODevice {
>       OnOffAuto enable_migration;
>       OnOffAuto migration_multifd_transfer;
>       OnOffAuto migration_load_config_after_iter;
> +    uint64_t migration_max_queued_buffers;
>       bool migration_events;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread
  2025-01-30 10:08 ` [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
@ 2025-02-12 15:48   ` Cédric Le Goater
  2025-02-12 16:19     ` Cédric Le Goater
  2025-02-17 22:09     ` Maciej S. Szmigiero
  0 siblings, 2 replies; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-12 15:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Since it's important to finish loading device state transferred via the
> main migration channel (via save_live_iterate SaveVMHandler) before
> starting loading the data asynchronously transferred via multifd the thread
> doing the actual loading of the multifd transferred data is only started
> from switchover_start SaveVMHandler.
> 
> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
> 
> This sub-command is only sent after all save_live_iterate data have already
> been posted so it is safe to commence loading of the multifd-transferred
> device state upon receiving it - loading of save_live_iterate data happens
> synchronously in the main migration thread (much like the processing of
> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
> processed all the proceeding data must have already been loaded.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c  | 229 +++++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/trace-events |   5 +
>   2 files changed, 234 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 0c0caec1bd64..ab5b097f59c9 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -301,8 +301,16 @@ typedef struct VFIOStateBuffer {
>   } VFIOStateBuffer;
>   
>   typedef struct VFIOMultifd {
> +    QemuThread load_bufs_thread;
> +    bool load_bufs_thread_running;
> +    bool load_bufs_thread_want_exit;
> +
> +    bool load_bufs_iter_done;
> +    QemuCond load_bufs_iter_done_cond;
> +
>       VFIOStateBuffers load_bufs;
>       QemuCond load_bufs_buffer_ready_cond;
> +    QemuCond load_bufs_thread_finished_cond;
>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>       uint32_t load_buf_idx;
>       uint32_t load_buf_idx_last;
> @@ -449,6 +457,171 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>       return true;
>   }
>   
> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
> +{
> +    VFIOStateBuffer *lb;
> +    guint bufs_len;
> +
> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
> +    if (multifd->load_buf_idx >= bufs_len) {
> +        assert(multifd->load_buf_idx == bufs_len);
> +        return NULL;
> +    }
> +
> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
> +                               multifd->load_buf_idx);
> +    if (!lb->is_present) {
> +        return NULL;
> +    }
> +
> +    return lb;
> +}
> +
> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> +{
> +    return -EINVAL;
> +}
> +
> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
> +                                         VFIOStateBuffer *lb,
> +                                         Error **errp)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    g_autofree char *buf = NULL;
> +    char *buf_cur;
> +    size_t buf_len;
> +
> +    if (!lb->len) {
> +        return true;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
> +                                                   multifd->load_buf_idx);
> +
> +    /* lb might become re-allocated when we drop the lock */
> +    buf = g_steal_pointer(&lb->data);
> +    buf_cur = buf;
> +    buf_len = lb->len;
> +    while (buf_len > 0) {
> +        ssize_t wr_ret;
> +        int errno_save;
> +
> +        /*
> +         * Loading data to the device takes a while,
> +         * drop the lock during this process.
> +         */
> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);
> +        errno_save = errno;
> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
> +
> +        if (wr_ret < 0) {
> +            error_setg(errp,
> +                       "writing state buffer %" PRIu32 " failed: %d",
> +                       multifd->load_buf_idx, errno_save);
> +            return false;
> +        }
> +
> +        assert(wr_ret <= buf_len);
> +        buf_len -= wr_ret;
> +        buf_cur += wr_ret;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
> +                                                 multifd->load_buf_idx);
> +
> +    return true;
> +}
> +
> +static bool vfio_load_bufs_thread_want_abort(VFIOMultifd *multifd,
> +                                             bool *should_quit)
> +{
> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
> +}

_abort or _exit or _quit ? I would opt for vfio_load_bufs_thread_want_exit()
to match multifd->load_bufs_thread_want_exit.


> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    bool ret = true;
> +    int config_ret;
> +
> +    assert(multifd);
> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
> +
> +    assert(multifd->load_bufs_thread_running);
> +
> +    while (!vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
> +        VFIOStateBuffer *lb;
> +
> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
> +
> +        lb = vfio_load_state_buffer_get(multifd);
> +        if (!lb) {
> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
> +                                                        multifd->load_buf_idx);
> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
> +                           &multifd->load_bufs_mutex);
> +            continue;
> +        }
> +
> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
> +            break;
> +        }
> +
> +        if (multifd->load_buf_idx == 0) {
> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
> +        }
> +
> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
> +            ret = false;
> +            goto ret_signal;
> +        }
> +
> +        assert(multifd->load_buf_queued_pending_buffers > 0);
> +        multifd->load_buf_queued_pending_buffers--;
> +
> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
> +        }
> +
> +        multifd->load_buf_idx++;
> +    }
> +
> +    if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
> +        error_setg(errp, "operation cancelled");
> +        ret = false;
> +        goto ret_signal;
> +    }
> +
> +    if (vfio_load_config_after_iter(vbasedev)) {
> +        while (!multifd->load_bufs_iter_done) {
> +            qemu_cond_wait(&multifd->load_bufs_iter_done_cond,
> +                           &multifd->load_bufs_mutex);
> +
> +            if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
> +                error_setg(errp, "operation cancelled");
> +                ret = false;
> +                goto ret_signal;
> +            }
> +        }
> +    }

Please put the above chunck at the end of the series with the patch
adding ARM support. I think load_bufs_iter_done_cond should be moved
out of this patch too.



Thanks,

C.



> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
> +    if (config_ret) {
> +        error_setg(errp, "load config state failed: %d", config_ret);
> +        ret = false;
> +    }
> +
> +ret_signal:
> +    multifd->load_bufs_thread_running = false;
> +    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
> +
> +    return ret;
> +}
> +
>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>                                            Error **errp)
>   {
> @@ -517,11 +690,40 @@ static VFIOMultifd *vfio_multifd_new(void)
>       multifd->load_buf_queued_pending_buffers = 0;
>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>   
> +    multifd->load_bufs_iter_done = false;
> +    qemu_cond_init(&multifd->load_bufs_iter_done_cond);
> +
> +    multifd->load_bufs_thread_running = false;
> +    multifd->load_bufs_thread_want_exit = false;
> +    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
> +
>       return multifd;
>   }
>   
> +static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
> +{
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> +        while (multifd->load_bufs_thread_running) {
> +            multifd->load_bufs_thread_want_exit = true;
> +
> +            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
> +            qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
> +            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
> +                           &multifd->load_bufs_mutex);
> +        }
> +    }
> +    bql_lock();
> +}
> +
>   static void vfio_multifd_free(VFIOMultifd *multifd)
>   {
> +    vfio_load_cleanup_load_bufs_thread(multifd);
> +
> +    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
> +    qemu_cond_destroy(&multifd->load_bufs_iter_done_cond);
> +    vfio_state_buffers_destroy(&multifd->load_bufs);
>       qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
>       qemu_mutex_destroy(&multifd->load_bufs_mutex);
>   
> @@ -1042,6 +1244,32 @@ static bool vfio_switchover_ack_needed(void *opaque)
>       return vfio_precopy_supported(vbasedev);
>   }
>   
> +static int vfio_switchover_start(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +
> +    if (!migration->multifd_transfer) {
> +        /* Load thread is only used for multifd transfer */
> +        return 0;
> +    }
> +
> +    assert(multifd);
> +
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> +        assert(!multifd->load_bufs_thread_running);
> +        multifd->load_bufs_thread_running = true;
> +    }
> +    bql_lock();
> +
> +    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
> +
> +    return 0;
> +}
> +
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
> @@ -1057,6 +1285,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_state = vfio_load_state,
>       .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
> +    .switchover_start = vfio_switchover_start,
>   };
>   
>   /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 042a3dc54a33..418b378ebd29 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -154,6 +154,11 @@ vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
>   vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_start(const char *name) " (%s)"
> +vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_end(const char *name) " (%s)"
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread
  2025-02-12 15:48   ` Cédric Le Goater
@ 2025-02-12 16:19     ` Cédric Le Goater
  2025-02-17 22:09       ` Maciej S. Szmigiero
  2025-02-17 22:09     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-12 16:19 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 2/12/25 16:48, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Since it's important to finish loading device state transferred via the
>> main migration channel (via save_live_iterate SaveVMHandler) before
>> starting loading the data asynchronously transferred via multifd the thread
>> doing the actual loading of the multifd transferred data is only started
>> from switchover_start SaveVMHandler.
>>
>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>
>> This sub-command is only sent after all save_live_iterate data have already
>> been posted so it is safe to commence loading of the multifd-transferred
>> device state upon receiving it - loading of save_live_iterate data happens
>> synchronously in the main migration thread (much like the processing of
>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>> processed all the proceeding data must have already been loaded.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c  | 229 +++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   5 +
>>   2 files changed, 234 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 0c0caec1bd64..ab5b097f59c9 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -301,8 +301,16 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>   typedef struct VFIOMultifd {
>> +    QemuThread load_bufs_thread;
>> +    bool load_bufs_thread_running;
>> +    bool load_bufs_thread_want_exit;
>> +
>> +    bool load_bufs_iter_done;
>> +    QemuCond load_bufs_iter_done_cond;
>> +
>>       VFIOStateBuffers load_bufs;
>>       QemuCond load_bufs_buffer_ready_cond;
>> +    QemuCond load_bufs_thread_finished_cond;
>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>       uint32_t load_buf_idx;
>>       uint32_t load_buf_idx_last;
>> @@ -449,6 +457,171 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>       return true;
>>   }
>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>> +{
>> +    VFIOStateBuffer *lb;
>> +    guint bufs_len;
>> +
>> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
>> +    if (multifd->load_buf_idx >= bufs_len) {
>> +        assert(multifd->load_buf_idx == bufs_len);
>> +        return NULL;
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
>> +                               multifd->load_buf_idx);
>> +    if (!lb->is_present) {
>> +        return NULL;
>> +    }
>> +
>> +    return lb;
>> +}
>> +
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    return -EINVAL;
>> +}
>> +
>> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>> +                                         VFIOStateBuffer *lb,
>> +                                         Error **errp)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    g_autofree char *buf = NULL;
>> +    char *buf_cur;
>> +    size_t buf_len;
>> +
>> +    if (!lb->len) {
>> +        return true;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>> +                                                   multifd->load_buf_idx);
>> +
>> +    /* lb might become re-allocated when we drop the lock */
>> +    buf = g_steal_pointer(&lb->data);
>> +    buf_cur = buf;
>> +    buf_len = lb->len;
>> +    while (buf_len > 0) {
>> +        ssize_t wr_ret;
>> +        int errno_save;
>> +
>> +        /*
>> +         * Loading data to the device takes a while,
>> +         * drop the lock during this process.
>> +         */
>> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
>> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);
>> +        errno_save = errno;
>> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
>> +
>> +        if (wr_ret < 0) {
>> +            error_setg(errp,
>> +                       "writing state buffer %" PRIu32 " failed: %d",
>> +                       multifd->load_buf_idx, errno_save);
>> +            return false;
>> +        }
>> +
>> +        assert(wr_ret <= buf_len);
>> +        buf_len -= wr_ret;
>> +        buf_cur += wr_ret;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>> +                                                 multifd->load_buf_idx);
>> +
>> +    return true;
>> +}
>> +
>> +static bool vfio_load_bufs_thread_want_abort(VFIOMultifd *multifd,
>> +                                             bool *should_quit)
>> +{
>> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
>> +}
> 
> _abort or _exit or _quit ? I would opt for vfio_load_bufs_thread_want_exit()
> to match multifd->load_bufs_thread_want_exit.
> 
> 
>> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    bool ret = true;
>> +    int config_ret;
>> +
>> +    assert(multifd);
>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>> +
>> +    assert(multifd->load_bufs_thread_running);
>> +
>> +    while (!vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>> +        VFIOStateBuffer *lb;
>> +
>> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
>> +
>> +        lb = vfio_load_state_buffer_get(multifd);
>> +        if (!lb) {
>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>> +                                                        multifd->load_buf_idx);
>> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
>> +                           &multifd->load_bufs_mutex);
>> +            continue;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
>> +            break;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == 0) {
>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>> +        }
>> +
>> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
>> +            ret = false;
>> +            goto ret_signal;
>> +        }
>> +
>> +        assert(multifd->load_buf_queued_pending_buffers > 0);
>> +        multifd->load_buf_queued_pending_buffers--;
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>> +        }
>> +
>> +        multifd->load_buf_idx++;
>> +    }
>> +
>> +    if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>> +        error_setg(errp, "operation cancelled");
>> +        ret = false;
>> +        goto ret_signal;
>> +    }
>> +
>> +    if (vfio_load_config_after_iter(vbasedev)) {
>> +        while (!multifd->load_bufs_iter_done) {
>> +            qemu_cond_wait(&multifd->load_bufs_iter_done_cond,
>> +                           &multifd->load_bufs_mutex);
>> +
>> +            if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>> +                error_setg(errp, "operation cancelled");
>> +                ret = false;
>> +                goto ret_signal;
>> +            }
>> +        }
>> +    }
> 
> Please put the above chunck at the end of the series with the patch
> adding ARM support. I think load_bufs_iter_done_cond should be moved
> out of this patch too.
> 
> 
> 
> Thanks,
> 
> C.
> 
> 
> 
>> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
>> +    if (config_ret) {
>> +        error_setg(errp, "load config state failed: %d", config_ret);
>> +        ret = false;
>> +    }
>> +
>> +ret_signal:
>> +    multifd->load_bufs_thread_running = false;
>> +    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
>> +
>> +    return ret;
>> +}
>> +
>>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>                                            Error **errp)
>>   {
>> @@ -517,11 +690,40 @@ static VFIOMultifd *vfio_multifd_new(void)
>>       multifd->load_buf_queued_pending_buffers = 0;
>>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>> +    multifd->load_bufs_iter_done = false;
>> +    qemu_cond_init(&multifd->load_bufs_iter_done_cond);
>> +
>> +    multifd->load_bufs_thread_running = false;
>> +    multifd->load_bufs_thread_want_exit = false;
>> +    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
>> +
>>       return multifd;
>>   }
>> +static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
>> +{
>> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>> +    bql_unlock();
>> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
>> +        while (multifd->load_bufs_thread_running) {
>> +            multifd->load_bufs_thread_want_exit = true;
>> +
>> +            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
>> +            qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
>> +            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
>> +                           &multifd->load_bufs_mutex);
>> +        }
>> +    }
>> +    bql_lock();
>> +}
>> +
>>   static void vfio_multifd_free(VFIOMultifd *multifd)
>>   {
>> +    vfio_load_cleanup_load_bufs_thread(multifd);
>> +
>> +    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
>> +    qemu_cond_destroy(&multifd->load_bufs_iter_done_cond);
>> +    vfio_state_buffers_destroy(&multifd->load_bufs);
>>       qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
>>       qemu_mutex_destroy(&multifd->load_bufs_mutex);
>> @@ -1042,6 +1244,32 @@ static bool vfio_switchover_ack_needed(void *opaque)
>>       return vfio_precopy_supported(vbasedev);
>>   }
>> +static int vfio_switchover_start(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +
>> +    if (!migration->multifd_transfer) {
>> +        /* Load thread is only used for multifd transfer */
>> +        return 0;
>> +    }
>> +
>> +    assert(multifd);
>> +
>> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>> +    bql_unlock();
>> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
>> +        assert(!multifd->load_bufs_thread_running);
>> +        multifd->load_bufs_thread_running = true;
>> +    }
>> +    bql_lock();
>> +
>> +    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);

and please move these changes under a vfio_multifd_switchover_start()
routine.


Thanks,

C.



>> +    return 0;
>> +}
>> +
>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>       .save_prepare = vfio_save_prepare,
>>       .save_setup = vfio_save_setup,
>> @@ -1057,6 +1285,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>>       .load_state = vfio_load_state,
>>       .load_state_buffer = vfio_load_state_buffer,
>>       .switchover_ack_needed = vfio_switchover_ack_needed,
>> +    .switchover_start = vfio_switchover_start,
>>   };
>>   /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 042a3dc54a33..418b378ebd29 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -154,6 +154,11 @@ vfio_load_device_config_state_end(const char *name) " (%s)"
>>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
>>   vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
>> +vfio_load_state_device_buffer_start(const char *name) " (%s)"
>> +vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
>> +vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
>> +vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
>> +vfio_load_state_device_buffer_end(const char *name) " (%s)"
>>   vfio_migration_realize(const char *name) " (%s)"
>>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
>>
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 29/33] vfio/migration: Multifd device state transfer support - config loading support
  2025-01-30 10:08 ` [PATCH v4 29/33] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
@ 2025-02-12 16:21   ` Cédric Le Goater
  2025-02-17 22:09     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-12 16:21 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Load device config received via multifd using the existing machinery
> behind vfio_load_device_config_state().
> 
> Also, make sure to process the relevant main migration channel flags.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c | 103 +++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 98 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ab5b097f59c9..31f651ffee85 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -15,6 +15,7 @@
>   #include <linux/vfio.h>
>   #include <sys/ioctl.h>
>   
> +#include "io/channel-buffer.h"
>   #include "system/runstate.h"
>   #include "hw/vfio/vfio-common.h"
>   #include "migration/misc.h"
> @@ -457,6 +458,57 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>       return true;
>   }
>   
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
> +
> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
> +    VFIOStateBuffer *lb;
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    QEMUFile *f_out = NULL, *f_in = NULL;
> +    uint64_t mig_header;
> +    int ret;
> +
> +    assert(multifd->load_buf_idx == multifd->load_buf_idx_last);
> +    lb = vfio_state_buffers_at(&multifd->load_bufs, multifd->load_buf_idx);
> +    assert(lb->is_present);
> +
> +    bioc = qio_channel_buffer_new(lb->len);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
> +
> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
> +
> +    ret = qemu_fflush(f_out);
> +    if (ret) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        return ret;
> +    }
> +
> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
> +
> +    mig_header = qemu_get_be64(f_in);
> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        g_clear_pointer(&f_in, qemu_fclose);
> +        return -EINVAL;
> +    }
> +
> +    bql_lock();
> +    ret = vfio_load_device_config_state(f_in, vbasedev);
> +    bql_unlock();
> +
> +    g_clear_pointer(&f_out, qemu_fclose);
> +    g_clear_pointer(&f_in, qemu_fclose);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
>   static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>   {
>       VFIOStateBuffer *lb;
> @@ -477,11 +529,6 @@ static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>       return lb;
>   }
>   
> -static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> -{
> -    return -EINVAL;
> -}

Please remove this change from this patch and from patch 28.

>   static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>                                            VFIOStateBuffer *lb,
>                                            Error **errp)
> @@ -1168,6 +1215,8 @@ static int vfio_load_cleanup(void *opaque)
>   static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOMultifd *multifd = migration->multifd;
>       int ret = 0;
>       uint64_t data;
>   
> @@ -1179,6 +1228,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            if (migration->multifd_transfer) {
> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
> +                             vbasedev->name);
> +                return -EINVAL;
> +            }
> +
>               return vfio_load_device_config_state(f, opaque);
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> @@ -1223,6 +1278,44 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>   
>               return ret;
>           }
> +        case VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY:
> +        {
> +            if (!migration->multifd_transfer) {
> +                error_report("%s: got DEV_CONFIG_LOAD_READY outside multifd transfer",
> +                             vbasedev->name);
> +                return -EINVAL;
> +            }
> +
> +            if (!vfio_load_config_after_iter(vbasedev)) {
> +                error_report("%s: got DEV_CONFIG_LOAD_READY but was disabled",
> +                             vbasedev->name);
> +                return -EINVAL;
> +            }

Please put the above chunck at the end of the series with the patch
adding ARM support.


> +            assert(multifd);
> +
> +            /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +            bql_unlock();
> +            WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> +                if (multifd->load_bufs_iter_done) {
> +                    /* Can't print error here as we're outside BQL */
> +                    ret = -EINVAL;
> +                    break;
> +                }
> +
> +                multifd->load_bufs_iter_done = true;
> +                qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
> +                ret = 0;> +            }
> +            bql_lock();

Please introduce a vfio_multifd routine for the code above.



Thanks,

C.


> +
> +            if (ret) {
> +                error_report("%s: duplicate DEV_CONFIG_LOAD_READY",
> +                             vbasedev->name);
> +            }
> +            return ret;
> +        }
>           default:
>               error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
>               return -EINVAL;
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 31/33] vfio/migration: Multifd device state transfer support - send side
  2025-01-30 10:08 ` [PATCH v4 31/33] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2025-02-12 17:03   ` Cédric Le Goater
  2025-02-17 22:12     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-12 17:03 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Implement the multifd device state transfer via additional per-device
> thread inside save_live_complete_precopy_thread handler.
> 
> Switch between doing the data transfer in the new handler and doing it
> in the old save_state handler depending on the
> x-migration-multifd-transfer device property value.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c  | 159 +++++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/trace-events |   2 +
>   2 files changed, 161 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 31f651ffee85..37d1c0f3d32f 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -943,6 +943,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>       int ret;
>   
> +    /*
> +     * Make a copy of this setting at the start in case it is changed
> +     * mid-migration.
> +     */
> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> +    } else {
> +        migration->multifd_transfer =
> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> +    }
> +
> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return -EINVAL;
> +    }

Please implement a common routine vfio_multifd_is_enabled() that can be
shared with vfio_load_setup().

> +
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>   
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> @@ -1114,13 +1132,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>       return !migration->precopy_init_size && !migration->precopy_dirty_size;
>   }
>   
> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)

I would prefer naming it vfio_multifd_emit_dummy_eos().

> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    assert(migration->multifd_transfer);
> +
> +    /*
> +     * Emit dummy NOP data on the main migration channel since the actual
> +     * device state transfer is done via multifd channels.
> +     */
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +}
> +
>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       ssize_t data_size;
>       int ret;
>       Error *local_err = NULL;
>   
> +    if (migration->multifd_transfer) {
> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
> +        return 0;
> +    }
> +
>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>   
>       /* We reach here with device state STOP or STOP_COPY only */
> @@ -1146,12 +1183,133 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       return ret;
>   }
>   
> +static int
> +vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
> +                                                     char *idstr,
> +                                                     uint32_t instance_id,
> +                                                     uint32_t idx)

why use 'async_thread' in the name ?

vfio_save_complete_precopy_config_state() should be enough to refer
to its caller vfio_save_complete_precopy_thread(). Please add
an 'Error **' argument too.


> +{
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    g_autoptr(QEMUFile) f = NULL;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    size_t packet_len;
> +
> +    bioc = qio_channel_buffer_new(0);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
> +
> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
> +
> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);

I would prefer that we catch the error and propagate it to the caller.

> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = qemu_fflush(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    packet_len = sizeof(*packet) + bioc->usage;
> +    packet = g_malloc0(packet_len);
> +    packet->idx = idx;
> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
> +    memcpy(&packet->data, bioc->data, bioc->usage);
> +
> +    if (!multifd_queue_device_state(idstr, instance_id,
> +                                    (char *)packet, packet_len)) {
> +        return -1;
> +    }
> +
> +    qatomic_add(&bytes_transferred, packet_len);
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy_thread(char *idstr,
> +                                             uint32_t instance_id,
> +                                             bool *abort_flag,
> +                                             void *opaque)

This lacks an "Error **" argument. I am not sure what was decided
in patch 19 "migration: Add save_live_complete_precopy_thread
handler".

We should do our best to collect and propagate errors and avoid
error_report() calls. With VFIO involved, the reasons why errors
can occur are increasingly numerous, as hardware is exposed and
host drivers are involved.

I understand this is a complex request for code when this code
relies on a framework using callbacks, even more with threads.

> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    uint32_t idx;
> +
> +    if (!migration->multifd_transfer) {
> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */

why would vfio_save_complete_precopy_thread be called then ? Looks
like an error to me, may be not fatal but an error report would be
good to have. no ?

> +        return 0;
> +    }
> +
> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
> +                                                  idstr, instance_id);
> +
> +    /* We reach here with device state STOP or STOP_COPY only */
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +                                   VFIO_DEVICE_STATE_STOP, NULL);

Error missing.

> +    if (ret) {
> +        goto ret_finish;
> +    }
> +
> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
> +
> +    for (idx = 0; ; idx++) {
> +        ssize_t data_size;
> +        size_t packet_size;
> +
> +        if (qatomic_read(abort_flag)) {
> +            ret = -ECANCELED;
> +            goto ret_finish;
> +        }
> +
> +        data_size = read(migration->data_fd, &packet->data,
> +                         migration->data_buffer_size);
> +        if (data_size < 0) {
> +            ret = -errno;
> +            goto ret_finish;
> +        } else if (data_size == 0) {
> +            break;
> +        }
> +
> +        packet->idx = idx;
> +        packet_size = sizeof(*packet) + data_size;
> +
> +        if (!multifd_queue_device_state(idstr, instance_id,
> +                                        (char *)packet, packet_size)) {
> +            ret = -1;
> +            goto ret_finish;
> +        }
> +
> +        qatomic_add(&bytes_transferred, packet_size);
> +    }
> +
> +    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
> +                                                               instance_id,
> +                                                               idx);
> +
> +ret_finish:
> +    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
> +
> +    return ret;
> +}
> +
>   static void vfio_save_state(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       Error *local_err = NULL;
>       int ret;
>   
> +    if (migration->multifd_transfer) {
> +        if (vfio_load_config_after_iter(vbasedev)) {
> +            qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY);


Please put the above chunck at the end of the series with the patch
adding ARM support.

> +        } else {
> +            vfio_save_multifd_emit_dummy_eos(vbasedev, f);
> +        }

Please introduce a vfio_multifd_save_state() routine and a
vfio_"normal"_save_state() routine and change vfio_save_state()
to call one or the other.


Thanks,

C.



> +        return;
> +    }
> +
>       ret = vfio_save_device_config_state(f, opaque, &local_err);
>       if (ret) {
>           error_prepend(&local_err,
> @@ -1372,6 +1530,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
>       .save_state = vfio_save_state,
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 418b378ebd29..039979bdd98f 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -168,6 +168,8 @@ vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>   vfio_save_complete_precopy_start(const char *name) " (%s)"
> +vfio_save_complete_precopy_thread_start(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
> +vfio_save_complete_precopy_thread_end(const char *name, int ret) " (%s) ret %d"
>   vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_save_iterate_start(const char *name) " (%s)"
> 



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-01-30 10:08 ` [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2025-02-12 17:10   ` Cédric Le Goater
  2025-02-14 20:56     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-12 17:10 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 1/30/25 11:08, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This property allows configuring at runtime whether to transfer the
> particular device state via multifd channels when live migrating that
> device.
> 
> It defaults to AUTO, which means that VFIO device state transfer via
> multifd channels is attempted in configurations that otherwise support it.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/pci.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 2700b355ecf1..cd24f386aaf9 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>   }
>   
> +static PropertyInfo qdev_prop_on_off_auto_mutable;
> +
>   static const Property vfio_pci_dev_properties[] = {
>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
> @@ -3377,6 +3379,10 @@ static const Property vfio_pci_dev_properties[] = {
>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
> +                vbasedev.migration_multifd_transfer,
> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>       DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
>                               vbasedev.migration_load_config_after_iter,
>                               ON_OFF_AUTO_AUTO),
> @@ -3477,6 +3483,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
>   
>   static void register_vfio_pci_dev_type(void)
>   {
> +    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
> +    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
> +
>       type_register_static(&vfio_pci_dev_info);
>       type_register_static(&vfio_pci_nohotplug_dev_info);
>   }
> 

This looks wrong. Why not define the property simply with

    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
                 vbasedev.migration_multifd_transfer, ON_OFF_AUTO_AUTO)

? Also "x-migration-multifd" should be enough.


Thanks,

C.



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-12 10:55   ` Cédric Le Goater
@ 2025-02-14 20:55     ` Maciej S. Szmigiero
  2025-02-17  9:38       ` Cédric Le Goater
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-14 20:55 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.02.2025 11:55, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add support for VFIOMultifd data structure that will contain most of the
>> receive-side data together with its init/cleanup methods.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 52 +++++++++++++++++++++++++++++++++--
>>   include/hw/vfio/vfio-common.h |  5 ++++
>>   2 files changed, 55 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 3211041939c6..bcdf204d5cf4 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -300,6 +300,9 @@ typedef struct VFIOStateBuffer {
>>       size_t len;
>>   } VFIOStateBuffer;
>> +typedef struct VFIOMultifd {
>> +} VFIOMultifd;
>> +
>>   static void vfio_state_buffer_clear(gpointer data)
>>   {
>>       VFIOStateBuffer *lb = data;
>> @@ -398,6 +401,18 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>> +static VFIOMultifd *vfio_multifd_new(void)
>> +{
>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>> +
>> +    return multifd;
>> +}
>> +
>> +static void vfio_multifd_free(VFIOMultifd *multifd)
>> +{
>> +    g_free(multifd);
>> +}
>> +
>>   static void vfio_migration_cleanup(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -785,14 +800,47 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    /*
>> +     * Make a copy of this setting at the start in case it is changed
>> +     * mid-migration.
>> +     */
>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> 
> Attribute "migration->multifd_transfer" is not necessary. It can be
> replaced by a small inline helper testing pointer migration->multifd
> and this routine can use a local variable instead.

It's necessary for the send side since it does not need/allocate VFIOMultifd
at migration->multifd, so this (receive) side can use it for commonality too.

> I don't think the '_transfer' suffix adds much to the understanding.

The migration->multifd was already taken by VFIOMultifd struct, but
it could use other name (migration->multifd_switch? migration->multifd_on?).

>> +    } else {
>> +        migration->multifd_transfer =
>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>> +    }
>> +
>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return -EINVAL;
>> +    }
> 
> The above checks are also introduced in vfio_save_setup(). Please
> implement a common routine vfio_multifd_is_enabled() or some other
> name.

Done (as common vfio_multifd_transfer_setup()).

>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> +                                   migration->device_state, errp);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    if (migration->multifd_transfer) {
>> +        assert(!migration->multifd);
>> +        migration->multifd = vfio_multifd_new();
>> +    }
>> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> -                                    vbasedev->migration->device_state, errp);
>> +    return 0;
>>   }
>>   static int vfio_load_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);
> 
> please add a vfio_multifd_cleanup() routine.
> 

Done.

>>       vfio_migration_cleanup(vbasedev);
>>       trace_vfio_load_cleanup(vbasedev->name);
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 153d03745dc7..c0c9c0b1b263 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -61,6 +61,8 @@ typedef struct VFIORegion {
>>       uint8_t nr; /* cache the region number for debug */
>>   } VFIORegion;
>> +typedef struct VFIOMultifd VFIOMultifd;
>> +
>>   typedef struct VFIOMigration {
>>       struct VFIODevice *vbasedev;
>>       VMChangeStateEntry *vm_state;
>> @@ -72,6 +74,8 @@ typedef struct VFIOMigration {
>>       uint64_t mig_flags;
>>       uint64_t precopy_init_size;
>>       uint64_t precopy_dirty_size;
>> +    bool multifd_transfer;
>> +    VFIOMultifd *multifd;
>>       bool initial_data_sent;
>>       bool event_save_iterate_started;
>> @@ -133,6 +137,7 @@ typedef struct VFIODevice {
>>       bool no_mmap;
>>       bool ram_block_discard_allowed;
>>       OnOffAuto enable_migration;
>> +    OnOffAuto migration_multifd_transfer;
> 
> This property should be added at the end of the series, with documentation,
> and used in the vfio_multifd_some_name() routine I mentioned above.
> 

The property behind this variable *is* in fact introduced at the end of the series -
in a commit called "vfio/migration: Add x-migration-multifd-transfer VFIO property"
after which there are only commits adding the related compat entry and a VFIO
developer doc update.

The variable itself needs to be introduced earlier since various newly
introduced code blocks depend on its value to only get activated when multifd
transfer is enabled.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-02-12 17:10   ` Cédric Le Goater
@ 2025-02-14 20:56     ` Maciej S. Szmigiero
  2025-02-17 13:57       ` Cédric Le Goater
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-14 20:56 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.02.2025 18:10, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This property allows configuring at runtime whether to transfer the
>> particular device state via multifd channels when live migrating that
>> device.
>>
>> It defaults to AUTO, which means that VFIO device state transfer via
>> multifd channels is attempted in configurations that otherwise support it.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/pci.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 2700b355ecf1..cd24f386aaf9 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>   }
>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
>> +
>>   static const Property vfio_pci_dev_properties[] = {
>>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
>> @@ -3377,6 +3379,10 @@ static const Property vfio_pci_dev_properties[] = {
>>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
>> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>> +                vbasedev.migration_multifd_transfer,
>> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
>> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>>       DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
>>                               vbasedev.migration_load_config_after_iter,
>>                               ON_OFF_AUTO_AUTO),
>> @@ -3477,6 +3483,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
>>   static void register_vfio_pci_dev_type(void)
>>   {
>> +    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
>> +    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
>> +
>>       type_register_static(&vfio_pci_dev_info);
>>       type_register_static(&vfio_pci_nohotplug_dev_info);
>>   }
>>
> 
> This looks wrong. Why not define the property simply with
> 
>     DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>                  vbasedev.migration_multifd_transfer, ON_OFF_AUTO_AUTO)
> ?

I already explained the reason why I'm not using DEFINE_PROP_ON_OFF_AUTO()
here during the previous version review:
https://lore.kernel.org/qemu-devel/3ba62755-6f36-4707-8c18-8803dbd4f55b@maciej.szmigiero.name/

> Also "x-migration-multifd" should be enough.

I can change it to this shorter name if that's preferred.

> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-12 13:47   ` Cédric Le Goater
@ 2025-02-14 20:58     ` Maciej S. Szmigiero
  2025-02-17 13:48       ` Cédric Le Goater
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-14 20:58 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.02.2025 14:47, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The multifd received data needs to be reassembled since device state
>> packets sent via different multifd channels can arrive out-of-order.
>>
>> Therefore, each VFIO device state packet carries a header indicating its
>> position in the stream.
>> The raw device state data is saved into a VFIOStateBuffer for later
>> in-order loading into the device.
>>
>> The last such VFIO device state packet should have
>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 116 ++++++++++++++++++++++++++++++++++
>>   hw/vfio/pci.c                 |   2 +
>>   hw/vfio/trace-events          |   1 +
>>   include/hw/vfio/vfio-common.h |   1 +
>>   4 files changed, 120 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index bcdf204d5cf4..0c0caec1bd64 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -301,6 +301,12 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>   typedef struct VFIOMultifd {
>> +    VFIOStateBuffers load_bufs;
>> +    QemuCond load_bufs_buffer_ready_cond;
>> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>> +    uint32_t load_buf_idx;
>> +    uint32_t load_buf_idx_last;
>> +    uint32_t load_buf_queued_pending_buffers;
>>   } VFIOMultifd;
>>   static void vfio_state_buffer_clear(gpointer data)
>> @@ -346,6 +352,103 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>   }
> Each routine executed from a migration thread should have a preliminary
> comment saying from which context it is called: migration or VFIO

Do you mean like whether it is called from the code in qemu/migration/
directory or the code in hw/vfio/ directory?

What about internal linkage ("static") functions?
Do they need such comment too? That would actually decrease the readability
of these one-or-two line helpers due to high comment-to-code ratio.

As far as I can see, pretty much no existing VFIO migration function
has such comment.

>> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>> +                                          VFIODeviceStatePacket *packet,
>> +                                          size_t packet_total_size,
>> +                                          Error **errp)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIOStateBuffer *lb;
>> +
>> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
>> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
>> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
>> +    if (lb->is_present) {
>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>> +                   packet->idx);
>> +        return false;
>> +    }
>> +
>> +    assert(packet->idx >= multifd->load_buf_idx);
>> +
>> +    multifd->load_buf_queued_pending_buffers++;
>> +    if (multifd->load_buf_queued_pending_buffers >
>> +        vbasedev->migration_max_queued_buffers) {
>> +        error_setg(errp,
>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>> +        return false;
>> +    }
> 
> AFAICT, attributes multifd->load_buf_queued_pending_buffers and
> vbasedev->migration_max_queued_buffers are not strictly necessary.
> They allow to count buffers and check an arbitrary limit, which
> is UINT64_MAX today. It makes me wonder how useful they are.

You are right they aren't strictly necessary and in fact they weren't
there in early versions of this patch set.

It was introduced upon Peter's request since otherwise the source
could theoretically cause the target QEMU to allocate unlimited
amounts of memory for buffers-in-flight:
https://lore.kernel.org/qemu-devel/9e85016e-ac72-4207-8e69-8cba054cefb7@maciej.szmigiero.name/
(scroll to the "Risk of OOM on unlimited VFIO buffering" section).

If that's an actual risk in someone's use case then that person
could lower that limit from UINT64_MAX to, for example, 10 buffers.

> Please introduce them in a separate patch at the end of the series,
> adding documentation on the "x-migration-max-queued-buffers" property
> and also general documentation on why and how to use it.

I can certainly move it to the end of the series - done now.

>> +
>> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>> +    lb->len = packet_total_size - sizeof(*packet);
>> +    lb->is_present = true;
>> +
>> +    return true;
>> +}
>> +
>> +static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>> +                                   Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>> +
>> +    /*
>> +     * Holding BQL here would violate the lock order and can cause
>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>> +     */
>> +    assert(!bql_locked());
>> +
>> +    if (!migration->multifd_transfer) {
>> +        error_setg(errp,
>> +                   "got device state packet but not doing multifd transfer");
>> +        return false;
>> +    }
>> +
>> +    assert(multifd);
>> +
>> +    if (data_size < sizeof(*packet)) {
>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>> +                   data_size, sizeof(*packet));
>> +        return false;
>> +    }
>> +
>> +    if (packet->version != 0) {
> 
> Please add a define for version, even if 0.

I've introduced a new define VFIO_DEVICE_STATE_PACKET_VER_CURRENT.

>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>> +                   packet->version);
>> +        return false;
>> +    }
>> +
>> +    if (packet->idx == UINT32_MAX) {
>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>> +                   packet->idx);
> 
> I don't think printing out packet->idx is useful here.

Yeah, it's unlikely that the value of UINT32_MAX will ever change :)

Removed now.

>> +        return false;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
> 
> I wonder if we can add thread ids to trace events. It would be useful.

load_state_buffer is called from multifd channel receive threads
so passing multifd channel id there would require adding this multifd-specific
parameter to qemu_loadvm_load_state_buffer() and load_state_buffer
SaveVMHandler.

>> +
>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>> +
>> +    /* config state packet should be the last one in the stream */
>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>> +        multifd->load_buf_idx_last = packet->idx;
>> +    }
>> +
>> +    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {
> 
> So the migration thread calling multifd_device_state_recv() will
> exit 

The thread is calling multifd_device_state_recv() is a multifd
channel receive thread.

> and the vfio thread loading the state into the device will
> hang until its aborted ?

In the normal (successful) migration flow the vfio_load_bufs_thread()
will exit after loading (write()'ing) all buffers into the device
and then loading its config state.

In the aborted/error/unsuccessful migration flow it will get
terminated from vfio_load_cleanup() -> vfio_multifd_free() ->
vfio_load_cleanup_load_bufs_thread().

vfio_load_cleanup_load_bufs_thread() will signal
load_bufs_buffer_ready_cond and load_bufs_iter_done_cond since
the load thread indeed could be waiting on them.

> 
> This sequence is expected to be called to release the vfio thread
> 
>         while (multifd->load_bufs_thread_running) {
>              multifd->load_bufs_thread_want_exit = true;
> 
>              qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
>          ...
>         }
> 
> right ?

Right, that's a part of the code in vfio_load_cleanup_load_bufs_thread().

> 
> The way the series is presented makes it a bit complex to follow the
> proposition, especially regarding the creation and termination of
> threads, something the reader should be aware of.
> 
> As an initial step in clarifying the design, I would have preferred
> a series of patches introducing the various threads, migration threads
> and VFIO threads, without any workload. Once the creation and termination
> points are established I would then introduce the work load for each
> thread.

When I am doing review of anything more complex (though it's not usually
in QEMU) I mainly follow the final code flow as an operation is handled
since looking just from top to down at individual commits rarely gives
enough context to see how every part interacts together.

But for this the reviewer needs to see the whole code for the logical
operation, rather than just a part of it.

I think that adding the load operation in parts doesn't really
help since the reason why things are done such way in earlier patches
are only apparent in later patches and the earlier parts doesn't
really have much sense on their own.
Not to mention extra code churn when rebasing/reworking that increases
chance of a typo or a copy-paste mistake happening at some point.

I also see that in comments to a later patch you dislike that
a dummy vfio_load_bufs_thread_load_config() gets added in one patch
then immediately replaced by the real implementation in the next patch.
Previously, you also said that vfio_load_config_after_iter() seems
to be unused in the patch that adds it - that's exactly the kind of
issues that bringing the complete operation in one patch avoids.

I agree that, for example, x-migration-load-config-after-iter feature
could be a separate patch as it is a relatively simple change.

Same goes for x-migration-max-queued-buffers checking/enforcement,
compat changes, exporting existing settings (variables) as properties
or adding a g_autoptr() cleanup function for an existing type.

That's why originally the VFIO part of the series was divided into two
parts - receive and send, since these are two separate, yet internally
complete operations.

I also export the whole series (including the current WiP state, with
code moved to migration-multifd.{c,h} files, etc.) as a git tree at
https://gitlab.com/maciejsszmigiero/qemu/-/commits/multifd-device-state-transfer-vfio
since this way it can be easily seen how the QEMU code currently
looks after the whole patch set or set of patches there.

> 
> Thanks,
> 
> C.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-14 20:55     ` Maciej S. Szmigiero
@ 2025-02-17  9:38       ` Cédric Le Goater
  2025-02-17 22:13         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-17  9:38 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/14/25 21:55, Maciej S. Szmigiero wrote:
> On 12.02.2025 11:55, Cédric Le Goater wrote:
>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Add support for VFIOMultifd data structure that will contain most of the
>>> receive-side data together with its init/cleanup methods.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 52 +++++++++++++++++++++++++++++++++--
>>>   include/hw/vfio/vfio-common.h |  5 ++++
>>>   2 files changed, 55 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 3211041939c6..bcdf204d5cf4 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -300,6 +300,9 @@ typedef struct VFIOStateBuffer {
>>>       size_t len;
>>>   } VFIOStateBuffer;
>>> +typedef struct VFIOMultifd {
>>> +} VFIOMultifd;
>>> +
>>>   static void vfio_state_buffer_clear(gpointer data)
>>>   {
>>>       VFIOStateBuffer *lb = data;
>>> @@ -398,6 +401,18 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>>       return qemu_file_get_error(f);
>>>   }
>>> +static VFIOMultifd *vfio_multifd_new(void)
>>> +{
>>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>>> +
>>> +    return multifd;
>>> +}
>>> +
>>> +static void vfio_multifd_free(VFIOMultifd *multifd)
>>> +{
>>> +    g_free(multifd);
>>> +}
>>> +
>>>   static void vfio_migration_cleanup(VFIODevice *vbasedev)
>>>   {
>>>       VFIOMigration *migration = vbasedev->migration;
>>> @@ -785,14 +800,47 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    int ret;
>>> +
>>> +    /*
>>> +     * Make a copy of this setting at the start in case it is changed
>>> +     * mid-migration.
>>> +     */
>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>
>> Attribute "migration->multifd_transfer" is not necessary. It can be
>> replaced by a small inline helper testing pointer migration->multifd
>> and this routine can use a local variable instead.
> 
> It's necessary for the send side since it does not need/allocate VFIOMultifd
> at migration->multifd, so this (receive) side can use it for commonality too.

Hmm, we can allocate migration->multifd on the send side too, even
if the attributes are unused and it is up to vfio_multifd_free() to
make the difference between the send/recv side.


Something that is bothering me is the lack of introspection tools
and statistics. What could be possibly added under VFIOMultifd and
VfioStats ?

>> I don't think the '_transfer' suffix adds much to the understanding.
> 
> The migration->multifd was already taken by VFIOMultifd struct, but
> it could use other name (migration->multifd_switch? migration->multifd_on?).

yeah. Let's try to get rid of it first.
  
>>> +    } else {
>>> +        migration->multifd_transfer =
>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>> +    }
>>> +
>>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>>> +        error_setg(errp,
>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>> +                   vbasedev->name);
>>> +        return -EINVAL;
>>> +    }
>>
>> The above checks are also introduced in vfio_save_setup(). Please
>> implement a common routine vfio_multifd_is_enabled() or some other
>> name.
> 
> Done (as common vfio_multifd_transfer_setup()).

vfio_multifd_is_enabled() please, returning a bool.

> 
>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>>> +                                   migration->device_state, errp);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    if (migration->multifd_transfer) {
>>> +        assert(!migration->multifd);
>>> +        migration->multifd = vfio_multifd_new();
>>> +    }
>>> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>>> -                                    vbasedev->migration->device_state, errp);
>>> +    return 0;
>>>   }
>>>   static int vfio_load_cleanup(void *opaque)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    g_clear_pointer(&migration->multifd, vfio_multifd_free);
>>
>> please add a vfio_multifd_cleanup() routine.
>>
> 
> Done.
> 
>>>       vfio_migration_cleanup(vbasedev);
>>>       trace_vfio_load_cleanup(vbasedev->name);
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index 153d03745dc7..c0c9c0b1b263 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -61,6 +61,8 @@ typedef struct VFIORegion {
>>>       uint8_t nr; /* cache the region number for debug */
>>>   } VFIORegion;
>>> +typedef struct VFIOMultifd VFIOMultifd;
>>> +
>>>   typedef struct VFIOMigration {
>>>       struct VFIODevice *vbasedev;
>>>       VMChangeStateEntry *vm_state;
>>> @@ -72,6 +74,8 @@ typedef struct VFIOMigration {
>>>       uint64_t mig_flags;
>>>       uint64_t precopy_init_size;
>>>       uint64_t precopy_dirty_size;
>>> +    bool multifd_transfer;
>>> +    VFIOMultifd *multifd;
>>>       bool initial_data_sent;
>>>       bool event_save_iterate_started;
>>> @@ -133,6 +137,7 @@ typedef struct VFIODevice {
>>>       bool no_mmap;
>>>       bool ram_block_discard_allowed;
>>>       OnOffAuto enable_migration;
>>> +    OnOffAuto migration_multifd_transfer;
>>
>> This property should be added at the end of the series, with documentation,
>> and used in the vfio_multifd_some_name() routine I mentioned above.
>>
> 
> The property behind this variable *is* in fact introduced at the end of the series -
> in a commit called "vfio/migration: Add x-migration-multifd-transfer VFIO property"
> after which there are only commits adding the related compat entry and a VFIO
> developer doc update.
> 
> The variable itself needs to be introduced earlier since various newly
> introduced code blocks depend on its value to only get activated when multifd
> transfer is enabled.

Not if you introduce a vfio_multifd_is_enabled() routine hiding
the details. In that case, the property and attribute can be added
at the end of the series and you don't need to add the attribute
earlier.


Thanks,

C.





^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-14 20:58     ` Maciej S. Szmigiero
@ 2025-02-17 13:48       ` Cédric Le Goater
  2025-02-17 22:15         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-17 13:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/14/25 21:58, Maciej S. Szmigiero wrote:
> On 12.02.2025 14:47, Cédric Le Goater wrote:
>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> The multifd received data needs to be reassembled since device state
>>> packets sent via different multifd channels can arrive out-of-order.
>>>
>>> Therefore, each VFIO device state packet carries a header indicating its
>>> position in the stream.
>>> The raw device state data is saved into a VFIOStateBuffer for later
>>> in-order loading into the device.
>>>
>>> The last such VFIO device state packet should have
>>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 116 ++++++++++++++++++++++++++++++++++
>>>   hw/vfio/pci.c                 |   2 +
>>>   hw/vfio/trace-events          |   1 +
>>>   include/hw/vfio/vfio-common.h |   1 +
>>>   4 files changed, 120 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index bcdf204d5cf4..0c0caec1bd64 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -301,6 +301,12 @@ typedef struct VFIOStateBuffer {
>>>   } VFIOStateBuffer;
>>>   typedef struct VFIOMultifd {
>>> +    VFIOStateBuffers load_bufs;
>>> +    QemuCond load_bufs_buffer_ready_cond;
>>> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>> +    uint32_t load_buf_idx;
>>> +    uint32_t load_buf_idx_last;
>>> +    uint32_t load_buf_queued_pending_buffers;
>>>   } VFIOMultifd;
>>>   static void vfio_state_buffer_clear(gpointer data)
>>> @@ -346,6 +352,103 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>>   }
>> Each routine executed from a migration thread should have a preliminary
>> comment saying from which context it is called: migration or VFIO
> 
> Do you mean like whether it is called from the code in qemu/migration/
> directory or the code in hw/vfio/ directory?

Threads are spawned from different subsystems: migration callbacks
(save), and from VFIO (load, well, not load phase, switchover phase).
It would be good to provide hints to the reader.

I am struggling to understand how this works. Imagine a new comer
looking at the code and at the git history in 2y time ... Check
vfio in QEMU 1.3 (one small file) and see what it has become today.

> What about internal linkage ("static") functions?

There shouldn't be any static left when all multifd code is moved
to its own hw/vfio/migration-multifd.c file.

> Do they need such comment too?  That would actually decrease the readability> of these one-or-two line helpers due to high comment-to-code ratio.

I meant the higher level routines.

Tbh, this lacks tons of documentation, under docs, under each file,
for the properties, etc. This should be addressed before resend.

> As far as I can see, pretty much no existing VFIO migration function
> has such comment.> >>> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>>> +                                          VFIODeviceStatePacket *packet,
>>> +                                          size_t packet_total_size,
>>> +                                          Error **errp)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    VFIOStateBuffer *lb;
>>> +
>>> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
>>> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
>>> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
>>> +    }
>>> +
>>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
>>> +    if (lb->is_present) {
>>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>>> +                   packet->idx);
>>> +        return false;
>>> +    }
>>> +
>>> +    assert(packet->idx >= multifd->load_buf_idx);
>>> +
>>> +    multifd->load_buf_queued_pending_buffers++;
>>> +    if (multifd->load_buf_queued_pending_buffers >
>>> +        vbasedev->migration_max_queued_buffers) {
>>> +        error_setg(errp,
>>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>>> +        return false;
>>> +    }
>>
>> AFAICT, attributes multifd->load_buf_queued_pending_buffers and
>> vbasedev->migration_max_queued_buffers are not strictly necessary.
>> They allow to count buffers and check an arbitrary limit, which
>> is UINT64_MAX today. It makes me wonder how useful they are.
> 
> You are right they aren't strictly necessary and in fact they weren't
> there in early versions of this patch set.
>
> It was introduced upon Peter's request since otherwise the source> could theoretically cause the target QEMU to allocate unlimited
> amounts of memory for buffers-in-flight:
> https://lore.kernel.org/qemu-devel/9e85016e-ac72-4207-8e69-8cba054cefb7@maciej.szmigiero.name/
> (scroll to the "Risk of OOM on unlimited VFIO buffering" section).
> 
> If that's an actual risk in someone's use case then that person
> could lower that limit from UINT64_MAX to, for example, 10 buffers.
> >> Please introduce them in a separate patch at the end of the series,
>> adding documentation on the "x-migration-max-queued-buffers" property
>> and also general documentation on why and how to use it.
> 
> I can certainly move it to the end of the series - done now.

Great. Please add the comment above in the commit log. We will decide
it this is experimental or not.

Also, I wonder if this should be a global migration property.
  
>>> +
>>> +    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
>>> +    lb->len = packet_total_size - sizeof(*packet);
>>> +    lb->is_present = true;
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>> +                                   Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>>> +
>>> +    /*
>>> +     * Holding BQL here would violate the lock order and can cause
>>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>>> +     */
>>> +    assert(!bql_locked());
>>> +
>>> +    if (!migration->multifd_transfer) {
>>> +        error_setg(errp,
>>> +                   "got device state packet but not doing multifd transfer");
>>> +        return false;
>>> +    }
>>> +
>>> +    assert(multifd);
>>> +
>>> +    if (data_size < sizeof(*packet)) {
>>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>>> +                   data_size, sizeof(*packet));
>>> +        return false;
>>> +    }
>>> +
>>> +    if (packet->version != 0) {
>>
>> Please add a define for version, even if 0.
> 
> I've introduced a new define VFIO_DEVICE_STATE_PACKET_VER_CURRENT.
> 
>>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>>> +                   packet->version);
>>> +        return false;
>>> +    }
>>> +
>>> +    if (packet->idx == UINT32_MAX) {
>>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>>> +                   packet->idx);
>>
>> I don't think printing out packet->idx is useful here.
> 
> Yeah, it's unlikely that the value of UINT32_MAX will ever change :)
> 
> Removed now.
> 
>>> +        return false;
>>> +    }
>>> +
>>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>>
>> I wonder if we can add thread ids to trace events. It would be useful.
> 
> load_state_buffer is called from multifd channel receive threads
> so passing multifd channel id there would require adding this multifd-specific
> parameter to qemu_loadvm_load_state_buffer() and load_state_buffer
> SaveVMHandler.

'-msg timestamp=on' should be enough. Having logical thread names would
be nice. It's another topic.

  
>>> +
>>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>>> +
>>> +    /* config state packet should be the last one in the stream */
>>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>>> +        multifd->load_buf_idx_last = packet->idx;
>>> +    }
>>> +
>>> +    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {
>>
>> So the migration thread calling multifd_device_state_recv() will
>> exit 
> 
> The thread is calling multifd_device_state_recv() is a multifd
> channel receive thread.
> 
>> and the vfio thread loading the state into the device will
>> hang until its aborted ?
> 
> In the normal (successful) migration flow the vfio_load_bufs_thread()
> will exit after loading (write()'ing) all buffers into the device
> and then loading its config state.
> 
> In the aborted/error/unsuccessful migration flow it will get
> terminated from vfio_load_cleanup() -> vfio_multifd_free() ->
> vfio_load_cleanup_load_bufs_thread().
> 
> vfio_load_cleanup_load_bufs_thread() will signal
> load_bufs_buffer_ready_cond and load_bufs_iter_done_cond since
> the load thread indeed could be waiting on them.
>
> 
>>
>> This sequence is expected to be called to release the vfio thread
>>
>>         while (multifd->load_bufs_thread_running) {
>>              multifd->load_bufs_thread_want_exit = true;
>>
>>              qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
>>          ...
>>         }
>>
>> right ?
> 
> Right, that's a part of the code in vfio_load_cleanup_load_bufs_thread().

ok. So I think this lacks comments on thread termination points.
Please try to comment a bit more these areas in the code. I will
check next version more closely.

>> The way the series is presented makes it a bit complex to follow the
>> proposition, especially regarding the creation and termination of
>> threads, something the reader should be aware of.
>>
>> As an initial step in clarifying the design, I would have preferred
>> a series of patches introducing the various threads, migration threads
>> and VFIO threads, without any workload. Once the creation and termination
>> points are established I would then introduce the work load for each
>> thread.
> 
> When I am doing review of anything more complex (though it's not usually
> in QEMU) I mainly follow the final code flow as an operation is handled
> since looking just from top to down at individual commits rarely gives
> enough context to see how every part interacts together.
> 
> But for this the reviewer needs to see the whole code for the logical
> operation, rather than just a part of it.

and this is the very problematic :/ Very very hard to maintain on the
long run. I also don't have *time* to dig in all the context. So please
try to keep it as simple as possible.

> I think that adding the load operation in parts doesn't really
> help since the reason why things are done such way in earlier patches
> are only apparent in later patches and the earlier parts doesn't
> really have much sense on their own.
> Not to mention extra code churn when rebasing/reworking that increases
> chance of a typo or a copy-paste mistake happening at some point.
> > I also see that in comments to a later patch you dislike that
> a dummy vfio_load_bufs_thread_load_config() gets added in one patch
> then immediately replaced by the real implementation in the next patch.
> Previously, you also said that vfio_load_config_after_iter() seems
> to be unused in the patch that adds it - that's exactly the kind of
> issues that bringing the complete operation in one patch avoids.

May be I did. Sorry I switched context may times already and this
was lost in oblivion. Again, please help the reviewer. Changes
should be made obvious.

> I agree that, for example, x-migration-load-config-after-iter feature
> could be a separate patch as it is a relatively simple change.
> 
> Same goes for x-migration-max-queued-buffers checking/enforcement,
> compat changes, exporting existing settings (variables) as properties
> or adding a g_autoptr() cleanup function for an existing type.
> 
> That's why originally the VFIO part of the series was divided into two
> parts - receive and send, since these are two separate, yet internally
> complete operations.

I am now asking to have a better understanding of how threads are
created/terminated. It's another sub split of the load part AFAICT.
If you prefer we can forget about the load thread first, like I
asked initially iirc. I would very much prefer that for QEMU 10.0.


> I also export the whole series (including the current WiP state, with
> code moved to migration-multifd.{c,h} files, etc.) as a git tree at
> https://gitlab.com/maciejsszmigiero/qemu/-/commits/multifd-device-state-transfer-vfio
> since this way it can be easily seen how the QEMU code currently
> looks after the whole patch set or set of patches there.

Overall, I think this is making great progress. For such a complex
work, I would imagine a couple of RFCs first and half dozen normal
series. So ~10 iterations. We are only at v4. At least two more are
expected.


Thanks,

C.




^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-02-14 20:56     ` Maciej S. Szmigiero
@ 2025-02-17 13:57       ` Cédric Le Goater
  2025-02-17 14:16         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-17 13:57 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/14/25 21:56, Maciej S. Szmigiero wrote:
> On 12.02.2025 18:10, Cédric Le Goater wrote:
>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This property allows configuring at runtime whether to transfer the
>>> particular device state via multifd channels when live migrating that
>>> device.
>>>
>>> It defaults to AUTO, which means that VFIO device state transfer via
>>> multifd channels is attempted in configurations that otherwise support it.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/pci.c | 9 +++++++++
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 2700b355ecf1..cd24f386aaf9 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>>   }
>>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
>>> +
>>>   static const Property vfio_pci_dev_properties[] = {
>>>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>>>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
>>> @@ -3377,6 +3379,10 @@ static const Property vfio_pci_dev_properties[] = {
>>>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>>>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>>>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
>>> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>>> +                vbasedev.migration_multifd_transfer,
>>> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
>>> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>>>       DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
>>>                               vbasedev.migration_load_config_after_iter,
>>>                               ON_OFF_AUTO_AUTO),
>>> @@ -3477,6 +3483,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
>>>   static void register_vfio_pci_dev_type(void)
>>>   {
>>> +    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
>>> +    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
>>> +
>>>       type_register_static(&vfio_pci_dev_info);
>>>       type_register_static(&vfio_pci_nohotplug_dev_info);
>>>   }
>>>
>>
>> This looks wrong. Why not define the property simply with
>>
>>     DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>>                  vbasedev.migration_multifd_transfer, ON_OFF_AUTO_AUTO)
>> ?
> 
> I already explained the reason why I'm not using DEFINE_PROP_ON_OFF_AUTO()
> here during the previous version review:
> https://lore.kernel.org/qemu-devel/3ba62755-6f36-4707-8c18-8803dbd4f55b@maciej.szmigiero.name/

Ah yes, thanks for the reminder. I will repeat "make it simpler first".
Please simply use DEFINE_PROP_ON_OFF_AUTO() first.

Thanks,

C.




^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2025-02-17 13:57       ` Cédric Le Goater
@ 2025-02-17 14:16         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-17 14:16 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 17.02.2025 14:57, Cédric Le Goater wrote:
> On 2/14/25 21:56, Maciej S. Szmigiero wrote:
>> On 12.02.2025 18:10, Cédric Le Goater wrote:
>>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> This property allows configuring at runtime whether to transfer the
>>>> particular device state via multifd channels when live migrating that
>>>> device.
>>>>
>>>> It defaults to AUTO, which means that VFIO device state transfer via
>>>> multifd channels is attempted in configurations that otherwise support it.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/pci.c | 9 +++++++++
>>>>   1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 2700b355ecf1..cd24f386aaf9 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3353,6 +3353,8 @@ static void vfio_instance_init(Object *obj)
>>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>>>   }
>>>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
>>>> +
>>>>   static const Property vfio_pci_dev_properties[] = {
>>>>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>>>>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
>>>> @@ -3377,6 +3379,10 @@ static const Property vfio_pci_dev_properties[] = {
>>>>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>>>>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>>>>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
>>>> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>>>> +                vbasedev.migration_multifd_transfer,
>>>> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
>>>> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>>>>       DEFINE_PROP_ON_OFF_AUTO("x-migration-load-config-after-iter", VFIOPCIDevice,
>>>>                               vbasedev.migration_load_config_after_iter,
>>>>                               ON_OFF_AUTO_AUTO),
>>>> @@ -3477,6 +3483,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
>>>>   static void register_vfio_pci_dev_type(void)
>>>>   {
>>>> +    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
>>>> +    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
>>>> +
>>>>       type_register_static(&vfio_pci_dev_info);
>>>>       type_register_static(&vfio_pci_nohotplug_dev_info);
>>>>   }
>>>>
>>>
>>> This looks wrong. Why not define the property simply with
>>>
>>>     DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>>>                  vbasedev.migration_multifd_transfer, ON_OFF_AUTO_AUTO)
>>> ?
>>
>> I already explained the reason why I'm not using DEFINE_PROP_ON_OFF_AUTO()
>> here during the previous version review:
>> https://lore.kernel.org/qemu-devel/3ba62755-6f36-4707-8c18-8803dbd4f55b@maciej.szmigiero.name/
> 
> Ah yes, thanks for the reminder. I will repeat "make it simpler first".
> Please simply use DEFINE_PROP_ON_OFF_AUTO() first.

"use DEFINE_PROP_ON_OFF_AUTO() first" - by that do you mean moving that
custom property type to a separate patch?

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread
  2025-02-12 15:48   ` Cédric Le Goater
  2025-02-12 16:19     ` Cédric Le Goater
@ 2025-02-17 22:09     ` Maciej S. Szmigiero
  1 sibling, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-17 22:09 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.02.2025 16:48, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Since it's important to finish loading device state transferred via the
>> main migration channel (via save_live_iterate SaveVMHandler) before
>> starting loading the data asynchronously transferred via multifd the thread
>> doing the actual loading of the multifd transferred data is only started
>> from switchover_start SaveVMHandler.
>>
>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>
>> This sub-command is only sent after all save_live_iterate data have already
>> been posted so it is safe to commence loading of the multifd-transferred
>> device state upon receiving it - loading of save_live_iterate data happens
>> synchronously in the main migration thread (much like the processing of
>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>> processed all the proceeding data must have already been loaded.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c  | 229 +++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   5 +
>>   2 files changed, 234 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 0c0caec1bd64..ab5b097f59c9 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -301,8 +301,16 @@ typedef struct VFIOStateBuffer {
>>   } VFIOStateBuffer;
>>   typedef struct VFIOMultifd {
>> +    QemuThread load_bufs_thread;
>> +    bool load_bufs_thread_running;
>> +    bool load_bufs_thread_want_exit;
>> +
>> +    bool load_bufs_iter_done;
>> +    QemuCond load_bufs_iter_done_cond;
>> +
>>       VFIOStateBuffers load_bufs;
>>       QemuCond load_bufs_buffer_ready_cond;
>> +    QemuCond load_bufs_thread_finished_cond;
>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>       uint32_t load_buf_idx;
>>       uint32_t load_buf_idx_last;
>> @@ -449,6 +457,171 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>       return true;
>>   }
>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>> +{
>> +    VFIOStateBuffer *lb;
>> +    guint bufs_len;
>> +
>> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
>> +    if (multifd->load_buf_idx >= bufs_len) {
>> +        assert(multifd->load_buf_idx == bufs_len);
>> +        return NULL;
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
>> +                               multifd->load_buf_idx);
>> +    if (!lb->is_present) {
>> +        return NULL;
>> +    }
>> +
>> +    return lb;
>> +}
>> +
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    return -EINVAL;
>> +}
>> +
>> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>> +                                         VFIOStateBuffer *lb,
>> +                                         Error **errp)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    g_autofree char *buf = NULL;
>> +    char *buf_cur;
>> +    size_t buf_len;
>> +
>> +    if (!lb->len) {
>> +        return true;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>> +                                                   multifd->load_buf_idx);
>> +
>> +    /* lb might become re-allocated when we drop the lock */
>> +    buf = g_steal_pointer(&lb->data);
>> +    buf_cur = buf;
>> +    buf_len = lb->len;
>> +    while (buf_len > 0) {
>> +        ssize_t wr_ret;
>> +        int errno_save;
>> +
>> +        /*
>> +         * Loading data to the device takes a while,
>> +         * drop the lock during this process.
>> +         */
>> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
>> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);
>> +        errno_save = errno;
>> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
>> +
>> +        if (wr_ret < 0) {
>> +            error_setg(errp,
>> +                       "writing state buffer %" PRIu32 " failed: %d",
>> +                       multifd->load_buf_idx, errno_save);
>> +            return false;
>> +        }
>> +
>> +        assert(wr_ret <= buf_len);
>> +        buf_len -= wr_ret;
>> +        buf_cur += wr_ret;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>> +                                                 multifd->load_buf_idx);
>> +
>> +    return true;
>> +}
>> +
>> +static bool vfio_load_bufs_thread_want_abort(VFIOMultifd *multifd,
>> +                                             bool *should_quit)
>> +{
>> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
>> +}
> 
> _abort or _exit or _quit ? I would opt for vfio_load_bufs_thread_want_exit()
> to match multifd->load_bufs_thread_want_exit.
> 

Will rename to vfio_load_bufs_thread_want_exit().

> 
>> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    bool ret = true;
>> +    int config_ret;
>> +
>> +    assert(multifd);
>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>> +
>> +    assert(multifd->load_bufs_thread_running);
>> +
>> +    while (!vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>> +        VFIOStateBuffer *lb;
>> +
>> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
>> +
>> +        lb = vfio_load_state_buffer_get(multifd);
>> +        if (!lb) {
>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>> +                                                        multifd->load_buf_idx);
>> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
>> +                           &multifd->load_bufs_mutex);
>> +            continue;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
>> +            break;
>> +        }
>> +
>> +        if (multifd->load_buf_idx == 0) {
>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>> +        }
>> +
>> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
>> +            ret = false;
>> +            goto ret_signal;
>> +        }
>> +
>> +        assert(multifd->load_buf_queued_pending_buffers > 0);
>> +        multifd->load_buf_queued_pending_buffers--;
>> +
>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>> +        }
>> +
>> +        multifd->load_buf_idx++;
>> +    }
>> +
>> +    if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>> +        error_setg(errp, "operation cancelled");
>> +        ret = false;
>> +        goto ret_signal;
>> +    }
>> +
>> +    if (vfio_load_config_after_iter(vbasedev)) {
>> +        while (!multifd->load_bufs_iter_done) {
>> +            qemu_cond_wait(&multifd->load_bufs_iter_done_cond,
>> +                           &multifd->load_bufs_mutex);
>> +
>> +            if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>> +                error_setg(errp, "operation cancelled");
>> +                ret = false;
>> +                goto ret_signal;
>> +            }
>> +        }
>> +    }
> 
> Please put the above chunck at the end of the series with the patch
> adding ARM support. I think load_bufs_iter_done_cond should be moved
> out of this patch too.

Done, including moving multifd->load_bufs_iter_done_cond:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/3739872954d79373d4d99c8ff9ed50709e84a9c5

> 
> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread
  2025-02-12 16:19     ` Cédric Le Goater
@ 2025-02-17 22:09       ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-17 22:09 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.02.2025 17:19, Cédric Le Goater wrote:
> On 2/12/25 16:48, Cédric Le Goater wrote:
>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Since it's important to finish loading device state transferred via the
>>> main migration channel (via save_live_iterate SaveVMHandler) before
>>> starting loading the data asynchronously transferred via multifd the thread
>>> doing the actual loading of the multifd transferred data is only started
>>> from switchover_start SaveVMHandler.
>>>
>>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>>
>>> This sub-command is only sent after all save_live_iterate data have already
>>> been posted so it is safe to commence loading of the multifd-transferred
>>> device state upon receiving it - loading of save_live_iterate data happens
>>> synchronously in the main migration thread (much like the processing of
>>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>>> processed all the proceeding data must have already been loaded.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c  | 229 +++++++++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/trace-events |   5 +
>>>   2 files changed, 234 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 0c0caec1bd64..ab5b097f59c9 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -301,8 +301,16 @@ typedef struct VFIOStateBuffer {
>>>   } VFIOStateBuffer;
>>>   typedef struct VFIOMultifd {
>>> +    QemuThread load_bufs_thread;
>>> +    bool load_bufs_thread_running;
>>> +    bool load_bufs_thread_want_exit;
>>> +
>>> +    bool load_bufs_iter_done;
>>> +    QemuCond load_bufs_iter_done_cond;
>>> +
>>>       VFIOStateBuffers load_bufs;
>>>       QemuCond load_bufs_buffer_ready_cond;
>>> +    QemuCond load_bufs_thread_finished_cond;
>>>       QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>       uint32_t load_buf_idx;
>>>       uint32_t load_buf_idx_last;
>>> @@ -449,6 +457,171 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>>       return true;
>>>   }
>>> +static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>>> +{
>>> +    VFIOStateBuffer *lb;
>>> +    guint bufs_len;
>>> +
>>> +    bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
>>> +    if (multifd->load_buf_idx >= bufs_len) {
>>> +        assert(multifd->load_buf_idx == bufs_len);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    lb = vfio_state_buffers_at(&multifd->load_bufs,
>>> +                               multifd->load_buf_idx);
>>> +    if (!lb->is_present) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    return lb;
>>> +}
>>> +
>>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>>> +{
>>> +    return -EINVAL;
>>> +}
>>> +
>>> +static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>>> +                                         VFIOStateBuffer *lb,
>>> +                                         Error **errp)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    g_autofree char *buf = NULL;
>>> +    char *buf_cur;
>>> +    size_t buf_len;
>>> +
>>> +    if (!lb->len) {
>>> +        return true;
>>> +    }
>>> +
>>> +    trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>>> +                                                   multifd->load_buf_idx);
>>> +
>>> +    /* lb might become re-allocated when we drop the lock */
>>> +    buf = g_steal_pointer(&lb->data);
>>> +    buf_cur = buf;
>>> +    buf_len = lb->len;
>>> +    while (buf_len > 0) {
>>> +        ssize_t wr_ret;
>>> +        int errno_save;
>>> +
>>> +        /*
>>> +         * Loading data to the device takes a while,
>>> +         * drop the lock during this process.
>>> +         */
>>> +        qemu_mutex_unlock(&multifd->load_bufs_mutex);
>>> +        wr_ret = write(migration->data_fd, buf_cur, buf_len);
>>> +        errno_save = errno;
>>> +        qemu_mutex_lock(&multifd->load_bufs_mutex);
>>> +
>>> +        if (wr_ret < 0) {
>>> +            error_setg(errp,
>>> +                       "writing state buffer %" PRIu32 " failed: %d",
>>> +                       multifd->load_buf_idx, errno_save);
>>> +            return false;
>>> +        }
>>> +
>>> +        assert(wr_ret <= buf_len);
>>> +        buf_len -= wr_ret;
>>> +        buf_cur += wr_ret;
>>> +    }
>>> +
>>> +    trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>>> +                                                 multifd->load_buf_idx);
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static bool vfio_load_bufs_thread_want_abort(VFIOMultifd *multifd,
>>> +                                             bool *should_quit)
>>> +{
>>> +    return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
>>> +}
>>
>> _abort or _exit or _quit ? I would opt for vfio_load_bufs_thread_want_exit()
>> to match multifd->load_bufs_thread_want_exit.
>>
>>
>>> +static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +    bool ret = true;
>>> +    int config_ret;
>>> +
>>> +    assert(multifd);
>>> +    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
>>> +
>>> +    assert(multifd->load_bufs_thread_running);
>>> +
>>> +    while (!vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>>> +        VFIOStateBuffer *lb;
>>> +
>>> +        assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
>>> +
>>> +        lb = vfio_load_state_buffer_get(multifd);
>>> +        if (!lb) {
>>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>>> +                                                        multifd->load_buf_idx);
>>> +            qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
>>> +                           &multifd->load_bufs_mutex);
>>> +            continue;
>>> +        }
>>> +
>>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
>>> +            break;
>>> +        }
>>> +
>>> +        if (multifd->load_buf_idx == 0) {
>>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>>> +        }
>>> +
>>> +        if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
>>> +            ret = false;
>>> +            goto ret_signal;
>>> +        }
>>> +
>>> +        assert(multifd->load_buf_queued_pending_buffers > 0);
>>> +        multifd->load_buf_queued_pending_buffers--;
>>> +
>>> +        if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
>>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>>> +        }
>>> +
>>> +        multifd->load_buf_idx++;
>>> +    }
>>> +
>>> +    if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>>> +        error_setg(errp, "operation cancelled");
>>> +        ret = false;
>>> +        goto ret_signal;
>>> +    }
>>> +
>>> +    if (vfio_load_config_after_iter(vbasedev)) {
>>> +        while (!multifd->load_bufs_iter_done) {
>>> +            qemu_cond_wait(&multifd->load_bufs_iter_done_cond,
>>> +                           &multifd->load_bufs_mutex);
>>> +
>>> +            if (vfio_load_bufs_thread_want_abort(multifd, should_quit)) {
>>> +                error_setg(errp, "operation cancelled");
>>> +                ret = false;
>>> +                goto ret_signal;
>>> +            }
>>> +        }
>>> +    }
>>
>> Please put the above chunck at the end of the series with the patch
>> adding ARM support. I think load_bufs_iter_done_cond should be moved
>> out of this patch too.
>>
>>
>>
>> Thanks,
>>
>> C.
>>
>>
>>
>>> +    config_ret = vfio_load_bufs_thread_load_config(vbasedev);
>>> +    if (config_ret) {
>>> +        error_setg(errp, "load config state failed: %d", config_ret);
>>> +        ret = false;
>>> +    }
>>> +
>>> +ret_signal:
>>> +    multifd->load_bufs_thread_running = false;
>>> +    qemu_cond_signal(&multifd->load_bufs_thread_finished_cond);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>>                                            Error **errp)
>>>   {
>>> @@ -517,11 +690,40 @@ static VFIOMultifd *vfio_multifd_new(void)
>>>       multifd->load_buf_queued_pending_buffers = 0;
>>>       qemu_cond_init(&multifd->load_bufs_buffer_ready_cond);
>>> +    multifd->load_bufs_iter_done = false;
>>> +    qemu_cond_init(&multifd->load_bufs_iter_done_cond);
>>> +
>>> +    multifd->load_bufs_thread_running = false;
>>> +    multifd->load_bufs_thread_want_exit = false;
>>> +    qemu_cond_init(&multifd->load_bufs_thread_finished_cond);
>>> +
>>>       return multifd;
>>>   }
>>> +static void vfio_load_cleanup_load_bufs_thread(VFIOMultifd *multifd)
>>> +{
>>> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>>> +    bql_unlock();
>>> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
>>> +        while (multifd->load_bufs_thread_running) {
>>> +            multifd->load_bufs_thread_want_exit = true;
>>> +
>>> +            qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
>>> +            qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
>>> +            qemu_cond_wait(&multifd->load_bufs_thread_finished_cond,
>>> +                           &multifd->load_bufs_mutex);
>>> +        }
>>> +    }
>>> +    bql_lock();
>>> +}
>>> +
>>>   static void vfio_multifd_free(VFIOMultifd *multifd)
>>>   {
>>> +    vfio_load_cleanup_load_bufs_thread(multifd);
>>> +
>>> +    qemu_cond_destroy(&multifd->load_bufs_thread_finished_cond);
>>> +    qemu_cond_destroy(&multifd->load_bufs_iter_done_cond);
>>> +    vfio_state_buffers_destroy(&multifd->load_bufs);
>>>       qemu_cond_destroy(&multifd->load_bufs_buffer_ready_cond);
>>>       qemu_mutex_destroy(&multifd->load_bufs_mutex);
>>> @@ -1042,6 +1244,32 @@ static bool vfio_switchover_ack_needed(void *opaque)
>>>       return vfio_precopy_supported(vbasedev);
>>>   }
>>> +static int vfio_switchover_start(void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOMultifd *multifd = migration->multifd;
>>> +
>>> +    if (!migration->multifd_transfer) {
>>> +        /* Load thread is only used for multifd transfer */
>>> +        return 0;
>>> +    }
>>> +
>>> +    assert(multifd);
>>> +
>>> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>>> +    bql_unlock();
>>> +    WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
>>> +        assert(!multifd->load_bufs_thread_running);
>>> +        multifd->load_bufs_thread_running = true;
>>> +    }
>>> +    bql_lock();
>>> +
>>> +    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
> 
> and please move these changes under a vfio_multifd_switchover_start()
> routine.
> 

So you want to rename this function (now moved to migration-multifd.c)
into vfio_multifd_switchover_start() and add a new
vfio_switchover_start() in migration.c and make it call that
vfio_multifd_switchover_start(), correct?
  
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 29/33] vfio/migration: Multifd device state transfer support - config loading support
  2025-02-12 16:21   ` Cédric Le Goater
@ 2025-02-17 22:09     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-17 22:09 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.02.2025 17:21, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Load device config received via multifd using the existing machinery
>> behind vfio_load_device_config_state().
>>
>> Also, make sure to process the relevant main migration channel flags.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c | 103 +++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 98 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index ab5b097f59c9..31f651ffee85 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/vfio.h>
>>   #include <sys/ioctl.h>
>> +#include "io/channel-buffer.h"
>>   #include "system/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "migration/misc.h"
>> @@ -457,6 +458,57 @@ static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>       return true;
>>   }
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>> +
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>> +    VFIOStateBuffer *lb;
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    QEMUFile *f_out = NULL, *f_in = NULL;
>> +    uint64_t mig_header;
>> +    int ret;
>> +
>> +    assert(multifd->load_buf_idx == multifd->load_buf_idx_last);
>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, multifd->load_buf_idx);
>> +    assert(lb->is_present);
>> +
>> +    bioc = qio_channel_buffer_new(lb->len);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
>> +
>> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
>> +
>> +    ret = qemu_fflush(f_out);
>> +    if (ret) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        return ret;
>> +    }
>> +
>> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
>> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
>> +
>> +    mig_header = qemu_get_be64(f_in);
>> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        g_clear_pointer(&f_in, qemu_fclose);
>> +        return -EINVAL;
>> +    }
>> +
>> +    bql_lock();
>> +    ret = vfio_load_device_config_state(f_in, vbasedev);
>> +    bql_unlock();
>> +
>> +    g_clear_pointer(&f_out, qemu_fclose);
>> +    g_clear_pointer(&f_in, qemu_fclose);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>>   {
>>       VFIOStateBuffer *lb;
>> @@ -477,11 +529,6 @@ static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
>>       return lb;
>>   }
>> -static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> -{
>> -    return -EINVAL;
>> -}
> 
> Please remove this change from this patch and from patch 28.

The dummy call has to be there, otherwise the code at the
previous commit time wouldn't compile since that
vfio_load_bufs_thread_load_config() call is a part of
vfio_load_bufs_thread().

This is an artifact of splitting the whole load operation in
multiple commits.

>>   static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
>>                                            VFIOStateBuffer *lb,
>>                                            Error **errp)
>> @@ -1168,6 +1215,8 @@ static int vfio_load_cleanup(void *opaque)
>>   static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOMultifd *multifd = migration->multifd;
>>       int ret = 0;
>>       uint64_t data;
>> @@ -1179,6 +1228,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>           switch (data) {
>>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>           {
>> +            if (migration->multifd_transfer) {
>> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
>> +                             vbasedev->name);
>> +                return -EINVAL;
>> +            }
>> +
>>               return vfio_load_device_config_state(f, opaque);
>>           }
>>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>> @@ -1223,6 +1278,44 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>               return ret;
>>           }
>> +        case VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY:
>> +        {
>> +            if (!migration->multifd_transfer) {
>> +                error_report("%s: got DEV_CONFIG_LOAD_READY outside multifd transfer",
>> +                             vbasedev->name);
>> +                return -EINVAL;
>> +            }
>> +
>> +            if (!vfio_load_config_after_iter(vbasedev)) {
>> +                error_report("%s: got DEV_CONFIG_LOAD_READY but was disabled",
>> +                             vbasedev->name);
>> +                return -EINVAL;
>> +            }
> 
> Please put the above chunck at the end of the series with the patch
> adding ARM support.

Done.

> 
>> +            assert(multifd);
>> +
>> +            /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>> +            bql_unlock();
>> +            WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
>> +                if (multifd->load_bufs_iter_done) {
>> +                    /* Can't print error here as we're outside BQL */
>> +                    ret = -EINVAL;
>> +                    break;
>> +                }
>> +
>> +                multifd->load_bufs_iter_done = true;
>> +                qemu_cond_signal(&multifd->load_bufs_iter_done_cond);
>> +                ret = 0;> +            }
>> +            bql_lock();
> 
> Please introduce a vfio_multifd routine for the code above.

Done.

> 
> 
> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 31/33] vfio/migration: Multifd device state transfer support - send side
  2025-02-12 17:03   ` Cédric Le Goater
@ 2025-02-17 22:12     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-17 22:12 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.02.2025 18:03, Cédric Le Goater wrote:
> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Implement the multifd device state transfer via additional per-device
>> thread inside save_live_complete_precopy_thread handler.
>>
>> Switch between doing the data transfer in the new handler and doing it
>> in the old save_state handler depending on the
>> x-migration-multifd-transfer device property value.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c  | 159 +++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   2 +
>>   2 files changed, 161 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 31f651ffee85..37d1c0f3d32f 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -943,6 +943,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>       int ret;
>> +    /*
>> +     * Make a copy of this setting at the start in case it is changed
>> +     * mid-migration.
>> +     */
>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>> +    } else {
>> +        migration->multifd_transfer =
>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>> +    }
>> +
>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return -EINVAL;
>> +    }
> 
> Please implement a common routine vfio_multifd_is_enabled() that can be
> shared with vfio_load_setup().

Done/almost done (details are being worked out in conversation about other patch).

>> +
>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>> @@ -1114,13 +1132,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>       return !migration->precopy_init_size && !migration->precopy_dirty_size;
>>   }
>> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
> 
> I would prefer naming it vfio_multifd_emit_dummy_eos().

Done.

>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    assert(migration->multifd_transfer);
>> +
>> +    /*
>> +     * Emit dummy NOP data on the main migration channel since the actual
>> +     * device state transfer is done via multifd channels.
>> +     */
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +}
>> +
>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       ssize_t data_size;
>>       int ret;
>>       Error *local_err = NULL;
>> +    if (migration->multifd_transfer) {
>> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
>> +        return 0;
>> +    }
>> +
>>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>>       /* We reach here with device state STOP or STOP_COPY only */
>> @@ -1146,12 +1183,133 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>> +static int
>> +vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
>> +                                                     char *idstr,
>> +                                                     uint32_t instance_id,
>> +                                                     uint32_t idx)
> 
> why use 'async_thread' in the name ?
> 
> vfio_save_complete_precopy_config_state() should be enough to refer
> to its caller vfio_save_complete_precopy_thread(). 

That "async" part is truly unnecessary since it's a leftover from
patch set v1 days when the thread entry point was called
"vfio_save_complete_precopy_async_thread".

But I will keep the "thread" part since naming it just
"vfio_save_complete_precopy_config_state()" would suggest it's
called from normal precopy handler (vfio_save_complete_precopy()).

> Please add
> an 'Error **' argument too.
> 

Good idea, done now.

>> +{
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    g_autoptr(QEMUFile) f = NULL;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    size_t packet_len;
>> +
>> +    bioc = qio_channel_buffer_new(0);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>> +
>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +
>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
> 
> I would prefer that we catch the error and propagate it to the caller.

Sure.

>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = qemu_fflush(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    packet_len = sizeof(*packet) + bioc->usage;
>> +    packet = g_malloc0(packet_len);
>> +    packet->idx = idx;
>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>> +
>> +    if (!multifd_queue_device_state(idstr, instance_id,
>> +                                    (char *)packet, packet_len)) {
>> +        return -1;
>> +    }
>> +
>> +    qatomic_add(&bytes_transferred, packet_len);
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_complete_precopy_thread(char *idstr,
>> +                                             uint32_t instance_id,
>> +                                             bool *abort_flag,
>> +                                             void *opaque)
> 
> This lacks an "Error **" argument. I am not sure what was decided
> in patch 19 "migration: Add save_live_complete_precopy_thread
> handler".
> 
> We should do our best to collect and propagate errors and avoid
> error_report() calls. With VFIO involved, the reasons why errors
> can occur are increasingly numerous, as hardware is exposed and
> host drivers are involved.
> 
> I understand this is a complex request for code when this code
> relies on a framework using callbacks, even more with threads.

It now has an Error argument from the changes resulting from
discussions with Peter:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/0e23b66291b95c10ec1f0d82830320cae9e06ce4

>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    uint32_t idx;
>> +
>> +    if (!migration->multifd_transfer) {
>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
> 
> why would vfio_save_complete_precopy_thread be called then ? 

The migration core launches these threads if it supports device
state transfer.

But the driver (in this case) VFIO might have such transfer
unsupported for its own reasons - like because user disabled
switchover start message or just explicitly disabled this transfer.

Or maybe even the device does not like this transfer for some
reason (possible in the ARM case without config state interlock).

We discussed this detail during v3 with Avihai and Peter and
decided to do this this way (launching this thread unconditionally)
rather than export additional SaveVMHandler through which
the device could tell the migration core whether it wants such thread:
https://lore.kernel.org/qemu-devel/Z2BkbkF6P-2MHNN2@x1n/

> Looks
> like an error to me, may be not fatal but an error report would be
> good to have. no ?
> 
>> +        return 0;
>> +    }
>> +
>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>> +                                                  idstr, instance_id);
>> +
>> +    /* We reach here with device state STOP or STOP_COPY only */
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>> +                                   VFIO_DEVICE_STATE_STOP, NULL);
> 
> Error missing.

Already taken care of when adding Error parameter to the thread entry
point function, but good point anyway.

>> +    if (ret) {
>> +        goto ret_finish;
>> +    }
>> +
>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>> +
>> +    for (idx = 0; ; idx++) {
>> +        ssize_t data_size;
>> +        size_t packet_size;
>> +
>> +        if (qatomic_read(abort_flag)) {
>> +            ret = -ECANCELED;
>> +            goto ret_finish;
>> +        }
>> +
>> +        data_size = read(migration->data_fd, &packet->data,
>> +                         migration->data_buffer_size);
>> +        if (data_size < 0) {
>> +            ret = -errno;
>> +            goto ret_finish;
>> +        } else if (data_size == 0) {
>> +            break;
>> +        }
>> +
>> +        packet->idx = idx;
>> +        packet_size = sizeof(*packet) + data_size;
>> +
>> +        if (!multifd_queue_device_state(idstr, instance_id,
>> +                                        (char *)packet, packet_size)) {
>> +            ret = -1;
>> +            goto ret_finish;
>> +        }
>> +
>> +        qatomic_add(&bytes_transferred, packet_size);
>> +    }
>> +
>> +    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
>> +                                                               instance_id,
>> +                                                               idx);
>> +
>> +ret_finish:
>> +    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
>> +
>> +    return ret;
>> +}
>> +
>>   static void vfio_save_state(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       Error *local_err = NULL;
>>       int ret;
>> +    if (migration->multifd_transfer) {
>> +        if (vfio_load_config_after_iter(vbasedev)) {
>> +            qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_LOAD_READY);
> 
> 
> Please put the above chunck at the end of the series with the patch
> adding ARM support.

Done.

>> +        } else {
>> +            vfio_save_multifd_emit_dummy_eos(vbasedev, f);
>> +        }
> 
> Please introduce a vfio_multifd_save_state() routine and a
> vfio_"normal"_save_state() routine and change vfio_save_state()
> to call one or the other.
> 

So what should be the name of this "normal" save state routine
then, since you put "normal" in quotes?
vfio_nonmultifd_save_state()?

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-17  9:38       ` Cédric Le Goater
@ 2025-02-17 22:13         ` Maciej S. Szmigiero
  2025-02-18  7:54           ` Cédric Le Goater
  0 siblings, 1 reply; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-17 22:13 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 17.02.2025 10:38, Cédric Le Goater wrote:
> On 2/14/25 21:55, Maciej S. Szmigiero wrote:
>> On 12.02.2025 11:55, Cédric Le Goater wrote:
>>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Add support for VFIOMultifd data structure that will contain most of the
>>>> receive-side data together with its init/cleanup methods.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration.c           | 52 +++++++++++++++++++++++++++++++++--
>>>>   include/hw/vfio/vfio-common.h |  5 ++++
>>>>   2 files changed, 55 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 3211041939c6..bcdf204d5cf4 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -300,6 +300,9 @@ typedef struct VFIOStateBuffer {
>>>>       size_t len;
>>>>   } VFIOStateBuffer;
>>>> +typedef struct VFIOMultifd {
>>>> +} VFIOMultifd;
>>>> +
>>>>   static void vfio_state_buffer_clear(gpointer data)
>>>>   {
>>>>       VFIOStateBuffer *lb = data;
>>>> @@ -398,6 +401,18 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>>>       return qemu_file_get_error(f);
>>>>   }
>>>> +static VFIOMultifd *vfio_multifd_new(void)
>>>> +{
>>>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>>>> +
>>>> +    return multifd;
>>>> +}
>>>> +
>>>> +static void vfio_multifd_free(VFIOMultifd *multifd)
>>>> +{
>>>> +    g_free(multifd);
>>>> +}
>>>> +
>>>>   static void vfio_migration_cleanup(VFIODevice *vbasedev)
>>>>   {
>>>>       VFIOMigration *migration = vbasedev->migration;
>>>> @@ -785,14 +800,47 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    int ret;
>>>> +
>>>> +    /*
>>>> +     * Make a copy of this setting at the start in case it is changed
>>>> +     * mid-migration.
>>>> +     */
>>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>>
>>> Attribute "migration->multifd_transfer" is not necessary. It can be
>>> replaced by a small inline helper testing pointer migration->multifd
>>> and this routine can use a local variable instead.
>>
>> It's necessary for the send side since it does not need/allocate VFIOMultifd
>> at migration->multifd, so this (receive) side can use it for commonality too.
> 
> Hmm, we can allocate migration->multifd on the send side too, even
> if the attributes are unused and it is up to vfio_multifd_free() to
> make the difference between the send/recv side.

Allocating an unnecessary VFIOMultifd structure that has 12 members,
some of them complex like QemuThread, QemuCond or QemuMutex, just
to avoid having one extra bool variable (migration_multifd_transfer or
whatever it ends being named) seem like a poor trade-off for me.

> 
> Something that is bothering me is the lack of introspection tools
> and statistics. What could be possibly added under VFIOMultifd and
> VfioStats ?

There's already VFIO bytes transferred counter and also a
multifd bytes transferred counter.

There are quite a few trace events (both existing and newly added
by this patch).

While even more statistics and traces may help with tuning/debugging
in some cases that's something easily added in the future.

>>> I don't think the '_transfer' suffix adds much to the understanding.
>>
>> The migration->multifd was already taken by VFIOMultifd struct, but
>> it could use other name (migration->multifd_switch? migration->multifd_on?).
> 
> yeah. Let's try to get rid of it first.
> 
>>>> +    } else {
>>>> +        migration->multifd_transfer =
>>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>>> +    }
>>>> +
>>>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>>>> +        error_setg(errp,
>>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>>> +                   vbasedev->name);
>>>> +        return -EINVAL;
>>>> +    }
>>>
>>> The above checks are also introduced in vfio_save_setup(). Please
>>> implement a common routine vfio_multifd_is_enabled() or some other
>>> name.
>>
>> Done (as common vfio_multifd_transfer_setup()).
> 
> vfio_multifd_is_enabled() please, returning a bool.

Functions named *_is_something() normally just check some conditions
and return a computed value without having any side effects.

Here, vfio_multifd_transfer_setup() also sets migration->multifd_transfer
appropriately (or could migration->multifd) - that's common code for
save and load.

I guess you meant to move something else rather than this block
of code into vfio_multifd_is_enabled() - see my answer below.

>>>>       vfio_migration_cleanup(vbasedev);
>>>>       trace_vfio_load_cleanup(vbasedev->name);
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index 153d03745dc7..c0c9c0b1b263 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -61,6 +61,8 @@ typedef struct VFIORegion {
>>>>       uint8_t nr; /* cache the region number for debug */
>>>>   } VFIORegion;
>>>> +typedef struct VFIOMultifd VFIOMultifd;
>>>> +
>>>>   typedef struct VFIOMigration {
>>>>       struct VFIODevice *vbasedev;
>>>>       VMChangeStateEntry *vm_state;
>>>> @@ -72,6 +74,8 @@ typedef struct VFIOMigration {
>>>>       uint64_t mig_flags;
>>>>       uint64_t precopy_init_size;
>>>>       uint64_t precopy_dirty_size;
>>>> +    bool multifd_transfer;
>>>> +    VFIOMultifd *multifd;
>>>>       bool initial_data_sent;
>>>>       bool event_save_iterate_started;
>>>> @@ -133,6 +137,7 @@ typedef struct VFIODevice {
>>>>       bool no_mmap;
>>>>       bool ram_block_discard_allowed;
>>>>       OnOffAuto enable_migration;
>>>> +    OnOffAuto migration_multifd_transfer;
>>>
>>> This property should be added at the end of the series, with documentation,
>>> and used in the vfio_multifd_some_name() routine I mentioned above.
>>>
>>
>> The property behind this variable *is* in fact introduced at the end of the series -
>> in a commit called "vfio/migration: Add x-migration-multifd-transfer VFIO property"
>> after which there are only commits adding the related compat entry and a VFIO
>> developer doc update.
>>
>> The variable itself needs to be introduced earlier since various newly
>> introduced code blocks depend on its value to only get activated when multifd
>> transfer is enabled.
> 
> Not if you introduce a vfio_multifd_is_enabled() routine hiding
> the details. In that case, the property and attribute can be added
> at the end of the series and you don't need to add the attribute
> earlier.

The part above that you wanted to be moved into vfio_multifd_is_enabled()
is one-time check for load or save setup time.

That's *not* the switch to be tested by other parts of the code
during the migration process to determine whether multifd transfer
is in use.

If you want vfio_multifd_is_enabled() to be that switch that's tested by
other parts of the VFIO migration code then it will finally consist of
just a single line of code:
"return migration->multifd_transfer" (or "return migration->multifd").

Then indeed the variable could be introduced with the property than
controls it, but a dummy vfio_multifd_is_enabled() will need to be
introduced earlier as "return false" to not break the build.

> 
> Thanks,
> 
> C.
> 

Thanks,
Maciej




^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing
  2025-02-17 13:48       ` Cédric Le Goater
@ 2025-02-17 22:15         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-17 22:15 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 17.02.2025 14:48, Cédric Le Goater wrote:
> On 2/14/25 21:58, Maciej S. Szmigiero wrote:
>> On 12.02.2025 14:47, Cédric Le Goater wrote:
>>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> The multifd received data needs to be reassembled since device state
>>>> packets sent via different multifd channels can arrive out-of-order.
>>>>
>>>> Therefore, each VFIO device state packet carries a header indicating its
>>>> position in the stream.
>>>> The raw device state data is saved into a VFIOStateBuffer for later
>>>> in-order loading into the device.
>>>>
>>>> The last such VFIO device state packet should have
>>>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration.c           | 116 ++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/pci.c                 |   2 +
>>>>   hw/vfio/trace-events          |   1 +
>>>>   include/hw/vfio/vfio-common.h |   1 +
>>>>   4 files changed, 120 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index bcdf204d5cf4..0c0caec1bd64 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -301,6 +301,12 @@ typedef struct VFIOStateBuffer {
>>>>   } VFIOStateBuffer;
>>>>   typedef struct VFIOMultifd {
>>>> +    VFIOStateBuffers load_bufs;
>>>> +    QemuCond load_bufs_buffer_ready_cond;
>>>> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
>>>> +    uint32_t load_buf_idx;
>>>> +    uint32_t load_buf_idx_last;
>>>> +    uint32_t load_buf_queued_pending_buffers;
>>>>   } VFIOMultifd;
>>>>   static void vfio_state_buffer_clear(gpointer data)
>>>> @@ -346,6 +352,103 @@ static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>>>       return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>>>   }
>>> Each routine executed from a migration thread should have a preliminary
>>> comment saying from which context it is called: migration or VFIO
>>
>> Do you mean like whether it is called from the code in qemu/migration/
>> directory or the code in hw/vfio/ directory?
> 
> Threads are spawned from different subsystems: migration callbacks
> (save), and from VFIO (load, well, not load phase, switchover phase).
> It would be good to provide hints to the reader.

There are just two new threads here:
vfio_save_complete_precopy_thread and vfio_load_bufs_thread,
both have name ending in "_thread" to denote a thread entry point
function.

So you want to have a comment that vfio_save_complete_precopy_thread
is launched directly by migration core via SaveVMHandler while
vfio_load_bufs_thread is launched by vfio_switchover_start()
SaveVMHandler, correct?

> I am struggling to understand how this works. Imagine a new comer
> looking at the code and at the git history in 2y time ... Check
> vfio in QEMU 1.3 (one small file) and see what it has become today.
> 
>> What about internal linkage ("static") functions?
> 
> There shouldn't be any static left when all multifd code is moved
> to its own hw/vfio/migration-multifd.c file.

There are 15 static functions in the new hw/vfio/migration-multifd.c:
https://gitlab.com/maciejsszmigiero/qemu/-/blob/622de616178467f2ca968c6f0bd1e67f6249677f/hw/vfio/migration-multifd.c

But if these launching thread comments are to be added to
vfio_save_complete_precopy_thread and vfio_load_bufs_thread then
that's not a problem.

>> Do they need such comment too?  That would actually decrease the readability> of these one-or-two line helpers due to high comment-to-code ratio.
> 
> I meant the higher level routines.
> 
> Tbh, this lacks tons of documentation, under docs, under each file,
> for the properties, etc. This should be addressed before resend.

This series adds a grand total of 3 properties:
x-migration-multifd-transfer, x-migration-load-config-after-iter and
x-migration-max-queued-buffers.

The fist two of these three are already described in the updated
docs/devel/migration/vfio.rst:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/622de616178467f2ca968c6f0bd1e67f6249677f

Adding the description of x-migration-max-queued-buffers can be done too.

Looking at the first 5 VFIOPCIDevice properties in QEMU source tree
(excluding "host" and "display" which are too generic to grep):
vf-token, x-pre-copy-dirty-page-tracking, x-device-dirty-page-tracking and xres
are not documented anywhere, sysfsdev is documented only for s390 and
has a short mention in vfio-iommufd.rst.

It's also rare for a function in hw/vfio/migration.c to have any
descriptive comment attached too: out of 41 functions currently in the
upstream QEMU git tree only two have a comment above describing what
they do: vfio_migration_set_state_or_reset() and vfio_migration_realize().

In addition to these, vfio_save_block() has a comment above that
describes the return value - don't know whether this qualifies as a
function description.

>> As far as I can see, pretty much no existing VFIO migration function
>> has such comment.> >>> +static bool vfio_load_state_buffer_insert(VFIODevice *vbasedev,
>>>> +                                          VFIODeviceStatePacket *packet,
>>>> +                                          size_t packet_total_size,
>>>> +                                          Error **errp)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    VFIOMultifd *multifd = migration->multifd;
>>>> +    VFIOStateBuffer *lb;
>>>> +
>>>> +    vfio_state_buffers_assert_init(&multifd->load_bufs);
>>>> +    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
>>>> +        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
>>>> +    }
>>>> +
>>>> +    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
>>>> +    if (lb->is_present) {
>>>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>>>> +                   packet->idx);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    assert(packet->idx >= multifd->load_buf_idx);
>>>> +
>>>> +    multifd->load_buf_queued_pending_buffers++;
>>>> +    if (multifd->load_buf_queued_pending_buffers >
>>>> +        vbasedev->migration_max_queued_buffers) {
>>>> +        error_setg(errp,
>>>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>>>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>>>> +        return false;
>>>> +    }
>>>
>>> AFAICT, attributes multifd->load_buf_queued_pending_buffers and
>>> vbasedev->migration_max_queued_buffers are not strictly necessary.
>>> They allow to count buffers and check an arbitrary limit, which
>>> is UINT64_MAX today. It makes me wonder how useful they are.
>>
>> You are right they aren't strictly necessary and in fact they weren't
>> there in early versions of this patch set.
>>
>> It was introduced upon Peter's request since otherwise the source> could theoretically cause the target QEMU to allocate unlimited
>> amounts of memory for buffers-in-flight:
>> https://lore.kernel.org/qemu-devel/9e85016e-ac72-4207-8e69-8cba054cefb7@maciej.szmigiero.name/
>> (scroll to the "Risk of OOM on unlimited VFIO buffering" section).
>>
>> If that's an actual risk in someone's use case then that person
>> could lower that limit from UINT64_MAX to, for example, 10 buffers.
>> >> Please introduce them in a separate patch at the end of the series,
>>> adding documentation on the "x-migration-max-queued-buffers" property
>>> and also general documentation on why and how to use it.
>>
>> I can certainly move it to the end of the series - done now.
> 
> Great. Please add the comment above in the commit log. We will decide
> it this is experimental or not.

The description above about the property use case was already
added to the comment log last week:
https://gitlab.com/maciejsszmigiero/qemu/-/commit/15fc96349940b6c2a113753d41e5369f786deb7d

I guess by "i[f] this is experimental or not" you mean whether
this property should be included or not (rather than literally
whether it should be marked with the experimental prefix "x-").

That queuing limit was introduced in v2 in August last year upon
Peter's justified comment to v1 as to give graceful possibility
to avoid target QEMU unbounded memory allocation and OOM.

If you have other ways to achieve that please let me know since
we shouldn't leave this for the very last moment.

The current implementation of this limit is really simple -
it's just a counter that gets incremented when new buffer gets
queued and decremented when a buffer gets consumed (written
into the device), with a max value check on increment.

> 
> Also, I wonder if this should be a global migration property.

It's VFIO migration code that does buffer queuing, not the main
migration code.

(..)
>>
>>>
>>> This sequence is expected to be called to release the vfio thread
>>>
>>>         while (multifd->load_bufs_thread_running) {
>>>              multifd->load_bufs_thread_want_exit = true;
>>>
>>>              qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
>>>          ...
>>>         }
>>>
>>> right ?
>>
>> Right, that's a part of the code in vfio_load_cleanup_load_bufs_thread().
> 
> ok. So I think this lacks comments on thread termination points.
> Please try to comment a bit more these areas in the code. I will
> check next version more closely.

Will try to add more comments about thread termination then.

>>> The way the series is presented makes it a bit complex to follow the
>>> proposition, especially regarding the creation and termination of
>>> threads, something the reader should be aware of.
>>>
>>> As an initial step in clarifying the design, I would have preferred
>>> a series of patches introducing the various threads, migration threads
>>> and VFIO threads, without any workload. Once the creation and termination
>>> points are established I would then introduce the work load for each
>>> thread.
>>
>> When I am doing review of anything more complex (though it's not usually
>> in QEMU) I mainly follow the final code flow as an operation is handled
>> since looking just from top to down at individual commits rarely gives
>> enough context to see how every part interacts together.
>>
>> But for this the reviewer needs to see the whole code for the logical
>> operation, rather than just a part of it.
> 
> and this is the very problematic :/ Very very hard to maintain on the
> long run. I also don't have *time* to dig in all the context. So please
> try to keep it as simple as possible.

I definitely try to keep things simple where possible (but not simpler
to not end with a messy code).

For me, looking at the code flow for the whole operation also helps
avoid unnecessary comments/e-mail exchanges that add up to a lot of time.

It also totally makes sense to first ask the submitter about how the
operation code flows (where it is not obvious) or other implementation
details before suggesting changes there.

This helps avoid long discussions about changes which in the end turn out
to be a misunderstanding somewhere.

>> I think that adding the load operation in parts doesn't really
>> help since the reason why things are done such way in earlier patches
>> are only apparent in later patches and the earlier parts doesn't
>> really have much sense on their own.
>> Not to mention extra code churn when rebasing/reworking that increases
>> chance of a typo or a copy-paste mistake happening at some point.
>> > I also see that in comments to a later patch you dislike that
>> a dummy vfio_load_bufs_thread_load_config() gets added in one patch
>> then immediately replaced by the real implementation in the next patch.
>> Previously, you also said that vfio_load_config_after_iter() seems
>> to be unused in the patch that adds it - that's exactly the kind of
>> issues that bringing the complete operation in one patch avoids.
> 
> May be I did. Sorry I switched context may times already and this
> was lost in oblivion. Again, please help the reviewer. Changes
> should be made obvious.
>
>> I agree that, for example, x-migration-load-config-after-iter feature
>> could be a separate patch as it is a relatively simple change.
>>
>> Same goes for x-migration-max-queued-buffers checking/enforcement,
>> compat changes, exporting existing settings (variables) as properties
>> or adding a g_autoptr() cleanup function for an existing type.
>>
>> That's why originally the VFIO part of the series was divided into two
>> parts - receive and send, since these are two separate, yet internally
>> complete operations.
> 
> I am now asking to have a better understanding of how threads are
> created/terminated. It's another sub split of the load part AFAICT.
> If you prefer we can forget about the load thread first, like I
> asked initially iirc. I would very much prefer that for QEMU 10.0.

I think it makes sense to submit both the send and receive parts (or
save and load parts) rather than add code that effectively can't be
used in any meaningful way.

Especially than both send and receive parts need to have common
understanding of the migration bit stream.

As I suggested above, please don't hesitate to ask questions about
the parts that aren't clear.

I will try to explain these ASAP, since explaining things is much
easier than discussing changes (where I am wondering a lot what the
change is really wanting to achieve).

> 
>> I also export the whole series (including the current WiP state, with
>> code moved to migration-multifd.{c,h} files, etc.) as a git tree at
>> https://gitlab.com/maciejsszmigiero/qemu/-/commits/multifd-device-state-transfer-vfio
>> since this way it can be easily seen how the QEMU code currently
>> looks after the whole patch set or set of patches there.
> 
> Overall, I think this is making great progress. For such a complex
> work, I would imagine a couple of RFCs first and half dozen normal
> series. So ~10 iterations. We are only at v4. At least two more are
> expected.

As I wrote above, I am trying to integrate changes immediately
after they have been discussed enough for them to be clear to me.

On the other hand, having a lot of versions isn't great either
since with each rebase/rework/update there's a possibility of
accidentally introducing a copy-paste error or a typo somewhere.

Especially changes like moving code between files tend to cause
conflict with every later patch that touches this code or its
neighboring lines so they need quite a bit of (risky) manual
editing.

On the overall note, my plan is to try adding more comments
about threading and general operation flow and post a new
version, hopefully with most small changes discussed in other
recent messages also included.

> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup
  2025-02-17 22:13         ` Maciej S. Szmigiero
@ 2025-02-18  7:54           ` Cédric Le Goater
  0 siblings, 0 replies; 137+ messages in thread
From: Cédric Le Goater @ 2025-02-18  7:54 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 2/17/25 23:13, Maciej S. Szmigiero wrote:
> On 17.02.2025 10:38, Cédric Le Goater wrote:
>> On 2/14/25 21:55, Maciej S. Szmigiero wrote:
>>> On 12.02.2025 11:55, Cédric Le Goater wrote:
>>>> On 1/30/25 11:08, Maciej S. Szmigiero wrote:
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> Add support for VFIOMultifd data structure that will contain most of the
>>>>> receive-side data together with its init/cleanup methods.
>>>>>
>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>> ---
>>>>>   hw/vfio/migration.c           | 52 +++++++++++++++++++++++++++++++++--
>>>>>   include/hw/vfio/vfio-common.h |  5 ++++
>>>>>   2 files changed, 55 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>> index 3211041939c6..bcdf204d5cf4 100644
>>>>> --- a/hw/vfio/migration.c
>>>>> +++ b/hw/vfio/migration.c
>>>>> @@ -300,6 +300,9 @@ typedef struct VFIOStateBuffer {
>>>>>       size_t len;
>>>>>   } VFIOStateBuffer;
>>>>> +typedef struct VFIOMultifd {
>>>>> +} VFIOMultifd;
>>>>> +
>>>>>   static void vfio_state_buffer_clear(gpointer data)
>>>>>   {
>>>>>       VFIOStateBuffer *lb = data;
>>>>> @@ -398,6 +401,18 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>>>>       return qemu_file_get_error(f);
>>>>>   }
>>>>> +static VFIOMultifd *vfio_multifd_new(void)
>>>>> +{
>>>>> +    VFIOMultifd *multifd = g_new(VFIOMultifd, 1);
>>>>> +
>>>>> +    return multifd;
>>>>> +}
>>>>> +
>>>>> +static void vfio_multifd_free(VFIOMultifd *multifd)
>>>>> +{
>>>>> +    g_free(multifd);
>>>>> +}
>>>>> +
>>>>>   static void vfio_migration_cleanup(VFIODevice *vbasedev)
>>>>>   {
>>>>>       VFIOMigration *migration = vbasedev->migration;
>>>>> @@ -785,14 +800,47 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>>>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>>>>   {
>>>>>       VFIODevice *vbasedev = opaque;
>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>> +    int ret;
>>>>> +
>>>>> +    /*
>>>>> +     * Make a copy of this setting at the start in case it is changed
>>>>> +     * mid-migration.
>>>>> +     */
>>>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>>>
>>>> Attribute "migration->multifd_transfer" is not necessary. It can be
>>>> replaced by a small inline helper testing pointer migration->multifd
>>>> and this routine can use a local variable instead.
>>>
>>> It's necessary for the send side since it does not need/allocate VFIOMultifd
>>> at migration->multifd, so this (receive) side can use it for commonality too.
>>
>> Hmm, we can allocate migration->multifd on the send side too, even
>> if the attributes are unused and it is up to vfio_multifd_free() to
>> make the difference between the send/recv side.
> 
> Allocating an unnecessary VFIOMultifd structure that has 12 members,
> some of them complex like QemuThread, QemuCond or QemuMutex, just
> to avoid having one extra bool variable (migration_multifd_transfer or
> whatever it ends being named) seem like a poor trade-off for me.
> 
>>
>> Something that is bothering me is the lack of introspection tools
>> and statistics. What could be possibly added under VFIOMultifd and
>> VfioStats ?
> 
> There's already VFIO bytes transferred counter and also a
> multifd bytes transferred counter.
> 
> There are quite a few trace events (both existing and newly added
> by this patch).
> 
> While even more statistics and traces may help with tuning/debugging
> in some cases that's something easily added in the future.
> 
>>>> I don't think the '_transfer' suffix adds much to the understanding.
>>>
>>> The migration->multifd was already taken by VFIOMultifd struct, but
>>> it could use other name (migration->multifd_switch? migration->multifd_on?).
>>
>> yeah. Let's try to get rid of it first.
>>
>>>>> +    } else {
>>>>> +        migration->multifd_transfer =
>>>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>>>> +    }
>>>>> +
>>>>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>>>>> +        error_setg(errp,
>>>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>>>> +                   vbasedev->name);
>>>>> +        return -EINVAL;
>>>>> +    }
>>>>
>>>> The above checks are also introduced in vfio_save_setup(). Please
>>>> implement a common routine vfio_multifd_is_enabled() or some other
>>>> name.
>>>
>>> Done (as common vfio_multifd_transfer_setup()).
>>
>> vfio_multifd_is_enabled() please, returning a bool.
> 
> Functions named *_is_something() normally just check some conditions
> and return a computed value without having any side effects.
> 
> Here, vfio_multifd_transfer_setup() also sets migration->multifd_transfer
> appropriately (or could migration->multifd) - that's common code for
> save and load.
> 
> I guess you meant to move something else rather than this block
> of code into vfio_multifd_is_enabled() - see my answer below.
> 
>>>>>       vfio_migration_cleanup(vbasedev);
>>>>>       trace_vfio_load_cleanup(vbasedev->name);
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>> index 153d03745dc7..c0c9c0b1b263 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -61,6 +61,8 @@ typedef struct VFIORegion {
>>>>>       uint8_t nr; /* cache the region number for debug */
>>>>>   } VFIORegion;
>>>>> +typedef struct VFIOMultifd VFIOMultifd;
>>>>> +
>>>>>   typedef struct VFIOMigration {
>>>>>       struct VFIODevice *vbasedev;
>>>>>       VMChangeStateEntry *vm_state;
>>>>> @@ -72,6 +74,8 @@ typedef struct VFIOMigration {
>>>>>       uint64_t mig_flags;
>>>>>       uint64_t precopy_init_size;
>>>>>       uint64_t precopy_dirty_size;
>>>>> +    bool multifd_transfer;
>>>>> +    VFIOMultifd *multifd;
>>>>>       bool initial_data_sent;
>>>>>       bool event_save_iterate_started;
>>>>> @@ -133,6 +137,7 @@ typedef struct VFIODevice {
>>>>>       bool no_mmap;
>>>>>       bool ram_block_discard_allowed;
>>>>>       OnOffAuto enable_migration;
>>>>> +    OnOffAuto migration_multifd_transfer;
>>>>
>>>> This property should be added at the end of the series, with documentation,
>>>> and used in the vfio_multifd_some_name() routine I mentioned above.
>>>>
>>>
>>> The property behind this variable *is* in fact introduced at the end of the series -
>>> in a commit called "vfio/migration: Add x-migration-multifd-transfer VFIO property"
>>> after which there are only commits adding the related compat entry and a VFIO
>>> developer doc update.
>>>
>>> The variable itself needs to be introduced earlier since various newly
>>> introduced code blocks depend on its value to only get activated when multifd
>>> transfer is enabled.
>>
>> Not if you introduce a vfio_multifd_is_enabled() routine hiding
>> the details. In that case, the property and attribute can be added
>> at the end of the series and you don't need to add the attribute
>> earlier.
> 
> The part above that you wanted to be moved into vfio_multifd_is_enabled()
> is one-time check for load or save setup time.
> 
> That's *not* the switch to be tested by other parts of the code
> during the migration process to determine whether multifd transfer
> is in use.
> 
> If you want vfio_multifd_is_enabled() to be that switch that's tested by
> other parts of the VFIO migration code then it will finally consist of
> just a single line of code:
> "return migration->multifd_transfer" (or "return migration->multifd").
> 
> Then indeed the variable could be introduced with the property than
> controls it, but a dummy vfio_multifd_is_enabled() will need to be
> introduced earlier as "return false" to not break the build.


Sorry but I have switched to another series now, the one for live update,
and I will recheck your v5 proposal.


Thanks,

C.




^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
  2025-02-03 14:19 ` Cédric Le Goater
@ 2025-02-21  6:57   ` Yanghang Liu
  2025-02-22  9:51     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 137+ messages in thread
From: Yanghang Liu @ 2025-02-21  6:57 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas, Alex Williamson,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

Let me share my performance report after applying the patches for the
information:

1. live mlx VF migration

outgoing migration:
+------------------+---------------+---------------+----------------+
| VF(s) number     | 1             | 2             | 4              |
+------------------+---------------+---------------+----------------+
| Memory bandwidth | 733.693 MiB/s | 556.565 MiB/s | 475.310 MiB/s  |
| Total downtime   | 227ms         | 358ms         | 460ms          |
+------------------+---------------+---------------+----------------+

incoming migration:
+------------------+---------------+---------------+----------------+
| VF(s) number     | 1             | 2             | 4              |
+------------------+---------------+---------------+----------------+
| Memory bandwidth | 738.758 MiB/s | 566.175 MiB/s | 458.936 MiB/s  |
| Total downtime   | 220ms         | 342ms         | 459ms          |
+------------------+---------------+---------------+----------------+


2. live mlx VF multifd migration

outgoing migration:
+------------------+---------------+----------------+
| VF(s) number     | 1             | 1              |
+------------------+---------------+----------------+
| Channel          | 4             | 6              |
| Memory bandwidth | 786.942 MiB/s | 848.362 MiB/s  |
| Total downtime   | 142ms         | 188ms          |
+------------------+---------------+----------------+

+------------------+----------------+---------------+----------------+
| VF(s) number     | 2              | 2             | 2              |
+------------------+----------------+---------------+----------------+
| Channel          | 4              | 6             | 8              |
| Memory bandwidth |  774.315 MiB/s | 831.938 MiB/s | 769.799 MiB/s  |
| Total downtime   | 160ms          | 178ms         | 156ms          |
+------------------+----------------+---------------+----------------+

+------------------+----------------+---------------+----------------+
| VF(s) number     | 4              | 4             | 4              |
+------------------+----------------+---------------+----------------+
| Channel          | 6              | 8             | 16             |
| Memory bandwidth |  715.210 MiB/s | 742.962 MiB/s | 747.188 MiB/s  |
| Total downtime   | 180ms          | 219ms         | 190ms          |
+------------------+----------------+---------------+----------------+

incoming migration:
+------------------+---------------+----------------+
| VF(s) number     | 1             | 1              |
+------------------+---------------+----------------+
| Channel          | 4             | 6              |
| Memory bandwidth | 807.958 MiB/s | 859.525 MiB/s  |
| Total downtime   | 150ms         | 177ms          |
+------------------+---------------+----------------+

+------------------+---------------+---------------+----------------+
| VF(s) number     | 2             | 2             | 2              |
+------------------+---------------+---------------+----------------+
| Channel          | 4             | 6             | 8              |
| Memory bandwidth | 768.104 MiB/s | 825.462 MiB/s | 791.582 MiB/s  |
| Total downtime   | 170ms         | 185ms         | 175ms          |
+------------------+---------------+---------------+----------------+

+------------------+---------------+---------------+----------------+
| VF(s) number     | 4             | 4             | 4              |
+------------------+---------------+---------------+----------------+
| Channel          | 6             | 8             | 16             |
| Memory bandwidth | 706.921 MiB/s | 750.706 MiB/s | 746.295 MiB/s  |
| Total downtime   | 174ms         | 193ms         | 191ms          |
+------------------+---------------+---------------+----------------+

Best Regards,
Yanghang Liu


On Mon, Feb 3, 2025 at 10:20 PM Cédric Le Goater <clg@redhat.com> wrote:
>
> Hello Maciej,
>
> > This patch set is targeting QEMU 10.0.
> >
> > What's not yet present is documentation update under docs/devel/migration
> > but I didn't want to delay posting the code any longer.
> > Such doc can still be merged later when the design is 100% finalized.
> The changes are quite complex, the design is not trivial, the benefits are
> not huge as far as we know. I'd rather have the doc update first please.
>
> Thanks,
>
> C.
>
>
>



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer
  2025-02-21  6:57   ` Yanghang Liu
@ 2025-02-22  9:51     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 137+ messages in thread
From: Maciej S. Szmigiero @ 2025-02-22  9:51 UTC (permalink / raw)
  To: Yanghang Liu, Cédric Le Goater
  Cc: Peter Xu, Fabiano Rosas, Alex Williamson, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Thanks Yanghang for your measurements.

Maciej

On 21.02.2025 07:57, Yanghang Liu wrote:
> Let me share my performance report after applying the patches for the
> information:
> 
> 1. live mlx VF migration
> 
> outgoing migration:
> +------------------+---------------+---------------+----------------+
> | VF(s) number     | 1             | 2             | 4              |
> +------------------+---------------+---------------+----------------+
> | Memory bandwidth | 733.693 MiB/s | 556.565 MiB/s | 475.310 MiB/s  |
> | Total downtime   | 227ms         | 358ms         | 460ms          |
> +------------------+---------------+---------------+----------------+
> 
> incoming migration:
> +------------------+---------------+---------------+----------------+
> | VF(s) number     | 1             | 2             | 4              |
> +------------------+---------------+---------------+----------------+
> | Memory bandwidth | 738.758 MiB/s | 566.175 MiB/s | 458.936 MiB/s  |
> | Total downtime   | 220ms         | 342ms         | 459ms          |
> +------------------+---------------+---------------+----------------+
> 
> 
> 2. live mlx VF multifd migration
> 
> outgoing migration:
> +------------------+---------------+----------------+
> | VF(s) number     | 1             | 1              |
> +------------------+---------------+----------------+
> | Channel          | 4             | 6              |
> | Memory bandwidth | 786.942 MiB/s | 848.362 MiB/s  |
> | Total downtime   | 142ms         | 188ms          |
> +------------------+---------------+----------------+
> 
> +------------------+----------------+---------------+----------------+
> | VF(s) number     | 2              | 2             | 2              |
> +------------------+----------------+---------------+----------------+
> | Channel          | 4              | 6             | 8              |
> | Memory bandwidth |  774.315 MiB/s | 831.938 MiB/s | 769.799 MiB/s  |
> | Total downtime   | 160ms          | 178ms         | 156ms          |
> +------------------+----------------+---------------+----------------+
> 
> +------------------+----------------+---------------+----------------+
> | VF(s) number     | 4              | 4             | 4              |
> +------------------+----------------+---------------+----------------+
> | Channel          | 6              | 8             | 16             |
> | Memory bandwidth |  715.210 MiB/s | 742.962 MiB/s | 747.188 MiB/s  |
> | Total downtime   | 180ms          | 219ms         | 190ms          |
> +------------------+----------------+---------------+----------------+
> 
> incoming migration:
> +------------------+---------------+----------------+
> | VF(s) number     | 1             | 1              |
> +------------------+---------------+----------------+
> | Channel          | 4             | 6              |
> | Memory bandwidth | 807.958 MiB/s | 859.525 MiB/s  |
> | Total downtime   | 150ms         | 177ms          |
> +------------------+---------------+----------------+
> 
> +------------------+---------------+---------------+----------------+
> | VF(s) number     | 2             | 2             | 2              |
> +------------------+---------------+---------------+----------------+
> | Channel          | 4             | 6             | 8              |
> | Memory bandwidth | 768.104 MiB/s | 825.462 MiB/s | 791.582 MiB/s  |
> | Total downtime   | 170ms         | 185ms         | 175ms          |
> +------------------+---------------+---------------+----------------+
> 
> +------------------+---------------+---------------+----------------+
> | VF(s) number     | 4             | 4             | 4              |
> +------------------+---------------+---------------+----------------+
> | Channel          | 6             | 8             | 16             |
> | Memory bandwidth | 706.921 MiB/s | 750.706 MiB/s | 746.295 MiB/s  |
> | Total downtime   | 174ms         | 193ms         | 191ms          |
> +------------------+---------------+---------------+----------------+
> 
> Best Regards,
> Yanghang Liu



^ permalink raw reply	[flat|nested] 137+ messages in thread

end of thread, other threads:[~2025-02-22  9:52 UTC | newest]

Thread overview: 137+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-30 10:08 [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 01/33] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 02/33] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 03/33] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 04/33] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 05/33] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 06/33] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 07/33] io: tls: Allow terminating the TLS session gracefully with EOF Maciej S. Szmigiero
2025-02-04 15:15   ` Daniel P. Berrangé
2025-02-04 16:02     ` Maciej S. Szmigiero
2025-02-04 16:14       ` Daniel P. Berrangé
2025-02-04 18:25         ` Maciej S. Szmigiero
2025-02-06 21:53           ` Peter Xu
2025-01-30 10:08 ` [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels Maciej S. Szmigiero
2025-02-03 18:20   ` Peter Xu
2025-02-03 18:53     ` Maciej S. Szmigiero
2025-02-03 20:20       ` Peter Xu
2025-02-03 21:41         ` Maciej S. Szmigiero
2025-02-03 22:56           ` Peter Xu
2025-02-04 13:51             ` Fabiano Rosas
2025-02-04 14:39             ` Maciej S. Szmigiero
2025-02-04 15:00               ` Fabiano Rosas
2025-02-04 15:10                 ` Maciej S. Szmigiero
2025-02-04 15:31               ` Peter Xu
2025-02-04 15:39                 ` Daniel P. Berrangé
2025-02-05 19:09                   ` Fabiano Rosas
2025-02-05 20:42                     ` Fabiano Rosas
2025-02-05 20:55                       ` Maciej S. Szmigiero
2025-02-06 14:13                         ` Fabiano Rosas
2025-02-06 14:53                           ` Maciej S. Szmigiero
2025-02-06 15:20                             ` Fabiano Rosas
2025-02-06 16:01                               ` Maciej S. Szmigiero
2025-02-06 17:32                                 ` Fabiano Rosas
2025-02-06 17:55                                   ` Maciej S. Szmigiero
2025-02-06 21:51                                   ` Peter Xu
2025-02-07 13:17                                     ` Fabiano Rosas
2025-02-07 14:04                                       ` Peter Xu
2025-02-07 14:16                                         ` Fabiano Rosas
2025-02-05 21:13                       ` Peter Xu
2025-02-06 14:19                         ` Fabiano Rosas
2025-02-04 15:10         ` Daniel P. Berrangé
2025-02-04 15:08     ` Daniel P. Berrangé
2025-02-04 16:02       ` Peter Xu
2025-02-04 16:12         ` Daniel P. Berrangé
2025-02-04 16:29           ` Peter Xu
2025-02-04 18:25         ` Fabiano Rosas
2025-02-04 19:34           ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 09/33] migration: postcopy_ram_listen_thread() needs to take BQL for some calls Maciej S. Szmigiero
2025-02-02  2:06   ` Dr. David Alan Gilbert
2025-02-02 11:55     ` Maciej S. Szmigiero
2025-02-02 12:45       ` Dr. David Alan Gilbert
2025-02-03 13:57         ` Maciej S. Szmigiero
2025-02-03 19:58           ` Peter Xu
2025-02-03 20:15             ` Maciej S. Szmigiero
2025-02-03 20:36               ` Peter Xu
2025-02-03 21:41                 ` Maciej S. Szmigiero
2025-02-03 23:02                   ` Peter Xu
2025-02-04 14:57                     ` Maciej S. Szmigiero
2025-02-04 15:39                       ` Peter Xu
2025-02-04 19:32                         ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 10/33] error: define g_autoptr() cleanup function for the Error type Maciej S. Szmigiero
2025-02-03 20:53   ` Peter Xu
2025-02-03 21:13   ` Daniel P. Berrangé
2025-02-03 21:51     ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 11/33] migration: Add thread pool of optional load threads Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 12/33] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 13/33] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2025-02-03 21:27   ` Peter Xu
2025-02-03 22:18     ` Maciej S. Szmigiero
2025-02-03 22:59       ` Peter Xu
2025-02-04 14:40         ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 14/33] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 15/33] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 16/33] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2025-02-03 21:47   ` Peter Xu
2025-01-30 10:08 ` [PATCH v4 17/33] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
2025-02-07 14:36   ` Fabiano Rosas
2025-02-07 19:43     ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 18/33] migration/multifd: Add multifd_device_state_supported() Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 19/33] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
2025-02-04 17:54   ` Peter Xu
2025-02-04 19:32     ` Maciej S. Szmigiero
2025-02-04 20:34       ` Peter Xu
2025-02-05 11:53         ` Maciej S. Szmigiero
2025-02-05 15:55           ` Peter Xu
2025-02-06 11:41             ` Maciej S. Szmigiero
2025-02-06 22:16               ` Peter Xu
2025-01-30 10:08 ` [PATCH v4 20/33] vfio/migration: Add x-migration-load-config-after-iter VFIO property Maciej S. Szmigiero
2025-02-10 17:24   ` Cédric Le Goater
2025-02-11 14:37     ` Maciej S. Szmigiero
2025-02-11 15:00       ` Cédric Le Goater
2025-02-11 15:57         ` Maciej S. Szmigiero
2025-02-11 16:28           ` Cédric Le Goater
2025-01-30 10:08 ` [PATCH v4 21/33] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 22/33] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
2025-01-30 21:35   ` Cédric Le Goater
2025-01-31  9:47     ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 23/33] vfio/migration: Multifd device state transfer support - basic types Maciej S. Szmigiero
2025-02-10 17:17   ` Cédric Le Goater
2025-01-30 10:08 ` [PATCH v4 24/33] vfio/migration: Multifd device state transfer support - VFIOStateBuffer(s) Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 25/33] vfio/migration: Multifd device state transfer - add support checking function Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 26/33] vfio/migration: Multifd device state transfer support - receive init/cleanup Maciej S. Szmigiero
2025-02-12 10:55   ` Cédric Le Goater
2025-02-14 20:55     ` Maciej S. Szmigiero
2025-02-17  9:38       ` Cédric Le Goater
2025-02-17 22:13         ` Maciej S. Szmigiero
2025-02-18  7:54           ` Cédric Le Goater
2025-01-30 10:08 ` [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing Maciej S. Szmigiero
2025-02-12 13:47   ` Cédric Le Goater
2025-02-14 20:58     ` Maciej S. Szmigiero
2025-02-17 13:48       ` Cédric Le Goater
2025-02-17 22:15         ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 28/33] vfio/migration: Multifd device state transfer support - load thread Maciej S. Szmigiero
2025-02-12 15:48   ` Cédric Le Goater
2025-02-12 16:19     ` Cédric Le Goater
2025-02-17 22:09       ` Maciej S. Szmigiero
2025-02-17 22:09     ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 29/33] vfio/migration: Multifd device state transfer support - config loading support Maciej S. Szmigiero
2025-02-12 16:21   ` Cédric Le Goater
2025-02-17 22:09     ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 30/33] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 31/33] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2025-02-12 17:03   ` Cédric Le Goater
2025-02-17 22:12     ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 32/33] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
2025-02-12 17:10   ` Cédric Le Goater
2025-02-14 20:56     ` Maciej S. Szmigiero
2025-02-17 13:57       ` Cédric Le Goater
2025-02-17 14:16         ` Maciej S. Szmigiero
2025-01-30 10:08 ` [PATCH v4 33/33] hw/core/machine: Add compat for " Maciej S. Szmigiero
2025-01-30 20:19 ` [PATCH v4 00/33] Multifd 🔀 device state transfer support with VFIO consumer Fabiano Rosas
2025-01-30 20:27   ` Maciej S. Szmigiero
2025-01-30 20:46     ` Fabiano Rosas
2025-01-31 18:16     ` Maciej S. Szmigiero
2025-02-03 14:19 ` Cédric Le Goater
2025-02-21  6:57   ` Yanghang Liu
2025-02-22  9:51     ` Maciej S. Szmigiero

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).