[PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
@ 2024-11-17 19:19 Maciej S. Szmigiero
  2024-11-17 19:19 ` [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
                   ` (25 more replies)
  0 siblings, 26 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:19 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is an updated v3 patch series of the v2 series located here:
https://lore.kernel.org/qemu-devel/cover.1724701542.git.maciej.szmigiero@oracle.com/

Changes from v2:
* Reworked the non-AIO (generic) thread pool to use Glib's GThreadPool
instead of making the current QEMU AIO thread pool generic.

* Added QEMU_VM_COMMAND MIG_CMD_SWITCHOVER_START sub-command to the
migration bit stream protocol via migration compatibility flag.
Used this new bit stream sub-command to achieve barrier between main
migration channel device state data and multifd device state data instead
of introducing save_live_complete_precopy_{begin,end} handlers for that as
the previous patch set version did,

* Added a new migration core thread pool of optional load threads and used
it to implement VFIO load thread instead of introducing load_finish handler
as the previous patch set version did.

* Made VFIO device config state load operation happen from that device load
thread instead of from (now gone) load_finish handler that did such load on
the main migration thread.
In the future this may allow pushing BQL deeper into the device config
state load operation internals and so doing more of it in parallel.

* Switched multifd_send() to using a serializing mutex for thread safety
instead of atomics as suggested by Peter since this seems to not cause
any performance regression while being simpler.

* Added two patches improving SaveVMHandlers documentation: one documenting
the BQL behavior of load SaveVMHandlers, another one explaining
{load,save}_cleanup handlers semantics.

* Added Peter's proposed patch making MultiFDSendData a struct from
https://lore.kernel.org/qemu-devel/ZuCickYhs3nf2ERC@x1n/
Other two patches from that message bring no performance benefits so they
were skipped (as discussed in that e-mail thread).

* Switched x-migration-multifd-transfer VFIO property to tri-state (On,
Off, Auto), with Auto being now the default value.
This means hat VFIO device state transfer via multifd channels is
automatically attempted in configurations that otherwise support it.
Note that in this patch set version (in contrast with the previous version)
x-migration-multifd-transfer setting is meaningful both on source AND
destination QEMU.

* Fixed a race condition with respect to the final multifd channel SYNC
packet sent by the RAM transfer code.

* Made VFIO's bytes_transferred counter atomic since it is accessed from
multiple threads (thanks Avihai for spotting it).

* Fixed an issue where VFIO device config sender QEMUFile wouldn't be
closed in some error conditions, switched to QEMUFile g_autoptr() automatic
memory management there to avoid such bugs in the future (also thanks
to Avihai for spotting the issue).

* Many, MANY small changes, like renamed functions, added review tags,
locks annotations, code formatting, split out changes into separate
commits, etc.

* Redid benchmarks.

========================================================================

Benchmark results:
These are 25th percentile of downtime results from 70-100 back-and-forth
live migrations with the same VM config (guest wasn't restarted during
these migrations).

Previous benchmarks reported the lowest downtime results ("0th percentile")
instead but these were subject to variation due to often being one of
outliers.

The used setup for bechmarking was the same as the RFC version of patch set
used.

Results with 6 multifd channels:
            4 VFs   2 VFs    1 VF
Disabled: 1900 ms  859 ms  487 ms
Enabled:  1095 ms  556 ms  366 ms 

Results with 4 VFs but varied multifd channel count:
             6 ch     8 ch    15 ch
Enabled:  1095 ms  1104 ms  1125 ms 

Important note:
4 VF benchmarks were done with commit 5504a8126115
("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
reverted since this seems to improve performance in this VM config if the
multifd transfer is enabled: the downtime performance with this commit
present is 1141 ms enabled / 1730 ms disabled.

Smaller VF counts actually do seem to benefit from this commit, so it's
likely that in the future adding some kind of a memslot pre-allocation
bit stream message might make sense to avoid this downtime regression for
4 VF configs (and likely higher VF count too).

========================================================================

This series is obviously targeting post QEMU 9.2 release by now
(AFAIK called 10.0).

Will need to be changed to use hw_compat_10_0 once these become available.

========================================================================

Maciej S. Szmigiero (23):
  migration: Clarify that {load,save}_cleanup handlers can run without
    setup
  thread-pool: Remove thread_pool_submit() function
  thread-pool: Rename AIO pool functions to *_aio() and data types to
    *Aio
  thread-pool: Implement generic (non-AIO) pool support
  migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: Document the BQL behavior of load SaveVMHandlers
  migration: Add thread pool of optional load threads
  migration/multifd: Split packet into header and RAM data
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Make multifd_send() thread safe
  migration/multifd: Add an explicit MultiFDSendData destructor
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add migration_has_device_state_support()
  migration/multifd: Send final SYNC only after device state is complete
  migration: Add save_live_complete_precopy_thread handler
  vfio/migration: Don't run load cleanup if load setup didn't run
  vfio/migration: Add x-migration-multifd-transfer VFIO property
  vfio/migration: Add load_device_config_state_start trace event
  vfio/migration: Convert bytes_transferred counter to atomic
  vfio/migration: Multifd device state transfer support - receive side
  migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  vfio/migration: Multifd device state transfer support - send side

Peter Xu (1):
  migration/multifd: Make MultiFDSendData a struct

 hw/core/machine.c                  |   2 +
 hw/vfio/migration.c                | 588 ++++++++++++++++++++++++++++-
 hw/vfio/pci.c                      |  11 +
 hw/vfio/trace-events               |  11 +-
 include/block/aio.h                |   8 +-
 include/block/thread-pool.h        |  20 +-
 include/hw/vfio/vfio-common.h      |  21 ++
 include/migration/client-options.h |   4 +
 include/migration/misc.h           |  16 +
 include/migration/register.h       |  67 +++-
 include/qemu/typedefs.h            |   5 +
 migration/colo.c                   |   3 +
 migration/meson.build              |   1 +
 migration/migration-hmp-cmds.c     |   2 +
 migration/migration.c              |   3 +
 migration/migration.h              |   2 +
 migration/multifd-device-state.c   | 193 ++++++++++
 migration/multifd-nocomp.c         |  45 ++-
 migration/multifd.c                | 228 +++++++++--
 migration/multifd.h                |  73 +++-
 migration/options.c                |   9 +
 migration/qemu-file.h              |   2 +
 migration/ram.c                    |  10 +-
 migration/savevm.c                 | 183 ++++++++-
 migration/savevm.h                 |   4 +
 migration/trace-events             |   1 +
 scripts/analyze-migration.py       |  11 +
 tests/unit/test-thread-pool.c      |   2 +-
 util/async.c                       |   6 +-
 util/thread-pool.c                 | 174 +++++++--
 util/trace-events                  |   6 +-
 31 files changed, 1586 insertions(+), 125 deletions(-)
 create mode 100644 migration/multifd-device-state.c

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
@ 2024-11-17 19:19 ` Maciej S. Szmigiero
  2024-11-25 19:08   ` Fabiano Rosas
  2024-11-26 16:25   ` [PATCH v3 01/24] migration: Clarify that {load,save}_cleanup " Cédric Le Goater
  2024-11-17 19:19 ` [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
                   ` (24 subsequent siblings)
  25 siblings, 2 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:19 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

It's possible for {load,save}_cleanup SaveVMHandlers to get called without
the corresponding {load,save}_setup handler being called first.

One such example is if {load,save}_setup handler of a proceeding device
returns error.
In this case the migration core cleanup code will call all corresponding
cleanup handlers, even for these devices which haven't had its setup
handler called.

Since this behavior can generate some surprises let's clearly document it
in these SaveVMHandlers description.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index f60e797894e5..0b0292738320 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -69,7 +69,9 @@ typedef struct SaveVMHandlers {
     /**
      * @save_cleanup
      *
-     * Uninitializes the data structures on the source
+     * Uninitializes the data structures on the source.
+     * Note that this handler can be called even if save_setup
+     * wasn't called earlier.
      *
      * @opaque: data pointer passed to register_savevm_live()
      */
@@ -244,6 +246,8 @@ typedef struct SaveVMHandlers {
      * @load_cleanup
      *
      * Uninitializes the data structures on the destination.
+     * Note that this handler can be called even if load_setup
+     * wasn't called earlier.
      *
      * @opaque: data pointer passed to register_savevm_live()
      *


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup
  2024-11-17 19:19 ` [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
@ 2024-11-25 19:08   ` Fabiano Rosas
  2024-11-26 16:25   ` [PATCH v3 01/24] migration: Clarify that {load,save}_cleanup " Cédric Le Goater
  1 sibling, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-25 19:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> It's possible for {load,save}_cleanup SaveVMHandlers to get called without
> the corresponding {load,save}_setup handler being called first.
>
> One such example is if {load,save}_setup handler of a proceeding device
> returns error.
> In this case the migration core cleanup code will call all corresponding
> cleanup handlers, even for these devices which haven't had its setup
> handler called.
>
> Since this behavior can generate some surprises let's clearly document it
> in these SaveVMHandlers description.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 01/24] migration: Clarify that {load,save}_cleanup handlers can run without setup
  2024-11-17 19:19 ` [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
  2024-11-25 19:08   ` Fabiano Rosas
@ 2024-11-26 16:25   ` Cédric Le Goater
  1 sibling, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-26 16:25 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:19, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> It's possible for {load,save}_cleanup SaveVMHandlers to get called without
> the corresponding {load,save}_setup handler being called first.
> 
> One such example is if {load,save}_setup handler of a proceeding device
> returns error.
> In this case the migration core cleanup code will call all corresponding
> cleanup handlers, even for these devices which haven't had its setup
> handler called.
> 
> Since this behavior can generate some surprises let's clearly document it
> in these SaveVMHandlers description.

I think we should spend some time analyzing the issues too. I would prefer
to avoid the changes in patch 18 ("vfio/migration: Don't run load cleanup
if load setup didn't run") if possible.

> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/migration/register.h | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index f60e797894e5..0b0292738320 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -69,7 +69,9 @@ typedef struct SaveVMHandlers {
>       /**
>        * @save_cleanup
>        *
> -     * Uninitializes the data structures on the source
> +     * Uninitializes the data structures on the source.
> +     * Note that this handler can be called even if save_setup
> +     * wasn't called earlier.
>        *
>        * @opaque: data pointer passed to register_savevm_live()
>        */
> @@ -244,6 +246,8 @@ typedef struct SaveVMHandlers {
>        * @load_cleanup
>        *
>        * Uninitializes the data structures on the destination.
> +     * Note that this handler can be called even if load_setup
> +     * wasn't called earlier.
>        *
>        * @opaque: data pointer passed to register_savevm_live()
>        *
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-11-17 19:19 ` [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
@ 2024-11-17 19:19 ` Maciej S. Szmigiero
  2024-11-25 19:13   ` Fabiano Rosas
                     ` (2 more replies)
  2024-11-17 19:19 ` [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
                   ` (23 subsequent siblings)
  25 siblings, 3 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:19 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This function name conflicts with one used by a future generic thread pool
function and it was only used by one test anyway.

Update the trace event name in thread_pool_submit_aio() accordingly.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/thread-pool.h   | 3 +--
 tests/unit/test-thread-pool.c | 2 +-
 util/thread-pool.c            | 7 +------
 util/trace-events             | 2 +-
 4 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 948ff5f30c31..4f6694026123 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -30,13 +30,12 @@ ThreadPool *thread_pool_new(struct AioContext *ctx);
 void thread_pool_free(ThreadPool *pool);
 
 /*
- * thread_pool_submit* API: submit I/O requests in the thread's
+ * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
  * current AioContext.
  */
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
-void thread_pool_submit(ThreadPoolFunc *func, void *arg);
 
 void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
index 1483e53473db..7a7055141ddb 100644
--- a/tests/unit/test-thread-pool.c
+++ b/tests/unit/test-thread-pool.c
@@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
 static void test_submit(void)
 {
     WorkerTestData data = { .n = 0 };
-    thread_pool_submit(worker_cb, &data);
+    thread_pool_submit_aio(worker_cb, &data, NULL, NULL);
     while (data.n == 0) {
         aio_poll(ctx, true);
     }
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 27eb777e855b..2f751d55b33f 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -256,7 +256,7 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 
     QLIST_INSERT_HEAD(&pool->head, req, all);
 
-    trace_thread_pool_submit(pool, req, arg);
+    trace_thread_pool_submit_aio(pool, req, arg);
 
     qemu_mutex_lock(&pool->lock);
     if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
@@ -290,11 +290,6 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
     return tpc.ret;
 }
 
-void thread_pool_submit(ThreadPoolFunc *func, void *arg)
-{
-    thread_pool_submit_aio(func, arg, NULL, NULL);
-}
-
 void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
 {
     qemu_mutex_lock(&pool->lock);
diff --git a/util/trace-events b/util/trace-events
index 49a4962e1886..5be12d7fab89 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -14,7 +14,7 @@ aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
 reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
 
 # thread-pool.c
-thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
+thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
 thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function
  2024-11-17 19:19 ` [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
@ 2024-11-25 19:13   ` Fabiano Rosas
  2024-11-26 16:25   ` Cédric Le Goater
  2024-12-04 19:24   ` Peter Xu
  2 siblings, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-25 19:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This function name conflicts with one used by a future generic thread pool
> function and it was only used by one test anyway.
>
> Update the trace event name in thread_pool_submit_aio() accordingly.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Acked-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function
  2024-11-17 19:19 ` [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
  2024-11-25 19:13   ` Fabiano Rosas
@ 2024-11-26 16:25   ` Cédric Le Goater
  2024-12-04 19:24   ` Peter Xu
  2 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-26 16:25 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:19, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This function name conflicts with one used by a future generic thread pool
> function and it was only used by one test anyway.
> 
> Update the trace event name in thread_pool_submit_aio() accordingly.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/block/thread-pool.h   | 3 +--
>   tests/unit/test-thread-pool.c | 2 +-
>   util/thread-pool.c            | 7 +------
>   util/trace-events             | 2 +-
>   4 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index 948ff5f30c31..4f6694026123 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -30,13 +30,12 @@ ThreadPool *thread_pool_new(struct AioContext *ctx);
>   void thread_pool_free(ThreadPool *pool);
>   
>   /*
> - * thread_pool_submit* API: submit I/O requests in the thread's
> + * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
>    * current AioContext.
>    */
>   BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>                                      BlockCompletionFunc *cb, void *opaque);
>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
> -void thread_pool_submit(ThreadPoolFunc *func, void *arg);
>   
>   void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>   
> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
> index 1483e53473db..7a7055141ddb 100644
> --- a/tests/unit/test-thread-pool.c
> +++ b/tests/unit/test-thread-pool.c
> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>   static void test_submit(void)
>   {
>       WorkerTestData data = { .n = 0 };
> -    thread_pool_submit(worker_cb, &data);
> +    thread_pool_submit_aio(worker_cb, &data, NULL, NULL);
>       while (data.n == 0) {
>           aio_poll(ctx, true);
>       }
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 27eb777e855b..2f751d55b33f 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -256,7 +256,7 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>   
>       QLIST_INSERT_HEAD(&pool->head, req, all);
>   
> -    trace_thread_pool_submit(pool, req, arg);
> +    trace_thread_pool_submit_aio(pool, req, arg);
>   
>       qemu_mutex_lock(&pool->lock);
>       if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
> @@ -290,11 +290,6 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>       return tpc.ret;
>   }
>   
> -void thread_pool_submit(ThreadPoolFunc *func, void *arg)
> -{
> -    thread_pool_submit_aio(func, arg, NULL, NULL);
> -}
> -
>   void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
>   {
>       qemu_mutex_lock(&pool->lock);
> diff --git a/util/trace-events b/util/trace-events
> index 49a4962e1886..5be12d7fab89 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -14,7 +14,7 @@ aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
>   reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
>   
>   # thread-pool.c
> -thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
> +thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
>   thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
>   thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
>   
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function
  2024-11-17 19:19 ` [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
  2024-11-25 19:13   ` Fabiano Rosas
  2024-11-26 16:25   ` Cédric Le Goater
@ 2024-12-04 19:24   ` Peter Xu
  2024-12-06 21:11     ` Maciej S. Szmigiero
  2 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-04 19:24 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:19:57PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This function name conflicts with one used by a future generic thread pool
> function and it was only used by one test anyway.
> 
> Update the trace event name in thread_pool_submit_aio() accordingly.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

One nitpick:

> ---
>  include/block/thread-pool.h   | 3 +--
>  tests/unit/test-thread-pool.c | 2 +-
>  util/thread-pool.c            | 7 +------
>  util/trace-events             | 2 +-
>  4 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index 948ff5f30c31..4f6694026123 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -30,13 +30,12 @@ ThreadPool *thread_pool_new(struct AioContext *ctx);
>  void thread_pool_free(ThreadPool *pool);
>  
>  /*
> - * thread_pool_submit* API: submit I/O requests in the thread's
> + * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
>   * current AioContext.
>   */
>  BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>                                     BlockCompletionFunc *cb, void *opaque);
>  int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
> -void thread_pool_submit(ThreadPoolFunc *func, void *arg);
>  
>  void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>  
> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
> index 1483e53473db..7a7055141ddb 100644
> --- a/tests/unit/test-thread-pool.c
> +++ b/tests/unit/test-thread-pool.c
> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>  static void test_submit(void)

The test name was still trying to follow the name of API.  It can be
renamed to test_submit_no_complete() (also the test name str below).

>  {
>      WorkerTestData data = { .n = 0 };
> -    thread_pool_submit(worker_cb, &data);
> +    thread_pool_submit_aio(worker_cb, &data, NULL, NULL);
>      while (data.n == 0) {
>          aio_poll(ctx, true);
>      }
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 27eb777e855b..2f751d55b33f 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -256,7 +256,7 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>  
>      QLIST_INSERT_HEAD(&pool->head, req, all);
>  
> -    trace_thread_pool_submit(pool, req, arg);
> +    trace_thread_pool_submit_aio(pool, req, arg);
>  
>      qemu_mutex_lock(&pool->lock);
>      if (pool->idle_threads == 0 && pool->cur_threads < pool->max_threads) {
> @@ -290,11 +290,6 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>      return tpc.ret;
>  }
>  
> -void thread_pool_submit(ThreadPoolFunc *func, void *arg)
> -{
> -    thread_pool_submit_aio(func, arg, NULL, NULL);
> -}
> -
>  void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
>  {
>      qemu_mutex_lock(&pool->lock);
> diff --git a/util/trace-events b/util/trace-events
> index 49a4962e1886..5be12d7fab89 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -14,7 +14,7 @@ aio_co_schedule_bh_cb(void *ctx, void *co) "ctx %p co %p"
>  reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
>  
>  # thread-pool.c
> -thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
> +thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
>  thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
>  thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
>  
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function
  2024-12-04 19:24   ` Peter Xu
@ 2024-12-06 21:11     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 21:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.12.2024 20:24, Peter Xu wrote:
> On Sun, Nov 17, 2024 at 08:19:57PM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This function name conflicts with one used by a future generic thread pool
>> function and it was only used by one test anyway.
>>
>> Update the trace event name in thread_pool_submit_aio() accordingly.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> One nitpick:
> 
>> ---
>>   include/block/thread-pool.h   | 3 +--
>>   tests/unit/test-thread-pool.c | 2 +-
>>   util/thread-pool.c            | 7 +------
>>   util/trace-events             | 2 +-
>>   4 files changed, 4 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>> index 948ff5f30c31..4f6694026123 100644
>> --- a/include/block/thread-pool.h
>> +++ b/include/block/thread-pool.h
>> @@ -30,13 +30,12 @@ ThreadPool *thread_pool_new(struct AioContext *ctx);
>>   void thread_pool_free(ThreadPool *pool);
>>   
>>   /*
>> - * thread_pool_submit* API: submit I/O requests in the thread's
>> + * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
>>    * current AioContext.
>>    */
>>   BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>                                      BlockCompletionFunc *cb, void *opaque);
>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>> -void thread_pool_submit(ThreadPoolFunc *func, void *arg);
>>   
>>   void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>>   
>> diff --git a/tests/unit/test-thread-pool.c b/tests/unit/test-thread-pool.c
>> index 1483e53473db..7a7055141ddb 100644
>> --- a/tests/unit/test-thread-pool.c
>> +++ b/tests/unit/test-thread-pool.c
>> @@ -46,7 +46,7 @@ static void done_cb(void *opaque, int ret)
>>   static void test_submit(void)
> 
> The test name was still trying to follow the name of API. 
>
> It can be renamed to test_submit_no_complete() 

Ack.

> (also the test name str below).
> 

I guess you mean also changing "/thread-pool/submit" to
"/thread-pool/submit_no_complete" in the test main().

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
  2024-11-17 19:19 ` [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
  2024-11-17 19:19 ` [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
@ 2024-11-17 19:19 ` Maciej S. Szmigiero
  2024-11-25 19:15   ` Fabiano Rosas
                     ` (2 more replies)
  2024-11-17 19:19 ` [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
                   ` (22 subsequent siblings)
  25 siblings, 3 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:19 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

These names conflict with ones used by future generic thread pool
equivalents.
Generic names should belong to the generic pool type, not specific (AIO)
type.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/aio.h         |  8 ++---
 include/block/thread-pool.h |  8 ++---
 util/async.c                |  6 ++--
 util/thread-pool.c          | 58 ++++++++++++++++++-------------------
 util/trace-events           |  4 +--
 5 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 43883a8a33a8..b2ab3514de23 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -54,7 +54,7 @@ typedef void QEMUBHFunc(void *opaque);
 typedef bool AioPollFn(void *opaque);
 typedef void IOHandler(void *opaque);
 
-struct ThreadPool;
+struct ThreadPoolAio;
 struct LinuxAioState;
 typedef struct LuringState LuringState;
 
@@ -207,7 +207,7 @@ struct AioContext {
     /* Thread pool for performing work and receiving completion callbacks.
      * Has its own locking.
      */
-    struct ThreadPool *thread_pool;
+    struct ThreadPoolAio *thread_pool;
 
 #ifdef CONFIG_LINUX_AIO
     struct LinuxAioState *linux_aio;
@@ -500,8 +500,8 @@ void aio_set_event_notifier_poll(AioContext *ctx,
  */
 GSource *aio_get_g_source(AioContext *ctx);
 
-/* Return the ThreadPool bound to this AioContext */
-struct ThreadPool *aio_get_thread_pool(AioContext *ctx);
+/* Return the ThreadPoolAio bound to this AioContext */
+struct ThreadPoolAio *aio_get_thread_pool(AioContext *ctx);
 
 /* Setup the LinuxAioState bound to this AioContext */
 struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 4f6694026123..6f27eb085b45 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -24,10 +24,10 @@
 
 typedef int ThreadPoolFunc(void *opaque);
 
-typedef struct ThreadPool ThreadPool;
+typedef struct ThreadPoolAio ThreadPoolAio;
 
-ThreadPool *thread_pool_new(struct AioContext *ctx);
-void thread_pool_free(ThreadPool *pool);
+ThreadPoolAio *thread_pool_new_aio(struct AioContext *ctx);
+void thread_pool_free_aio(ThreadPoolAio *pool);
 
 /*
  * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
@@ -36,7 +36,7 @@ void thread_pool_free(ThreadPool *pool);
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque);
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
+void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
 
-void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
 
 #endif
diff --git a/util/async.c b/util/async.c
index 99db28389f66..f8b7678aefc8 100644
--- a/util/async.c
+++ b/util/async.c
@@ -369,7 +369,7 @@ aio_ctx_finalize(GSource     *source)
     QEMUBH *bh;
     unsigned flags;
 
-    thread_pool_free(ctx->thread_pool);
+    thread_pool_free_aio(ctx->thread_pool);
 
 #ifdef CONFIG_LINUX_AIO
     if (ctx->linux_aio) {
@@ -435,10 +435,10 @@ GSource *aio_get_g_source(AioContext *ctx)
     return &ctx->source;
 }
 
-ThreadPool *aio_get_thread_pool(AioContext *ctx)
+ThreadPoolAio *aio_get_thread_pool(AioContext *ctx)
 {
     if (!ctx->thread_pool) {
-        ctx->thread_pool = thread_pool_new(ctx);
+        ctx->thread_pool = thread_pool_new_aio(ctx);
     }
     return ctx->thread_pool;
 }
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 2f751d55b33f..908194dc070f 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -23,9 +23,9 @@
 #include "block/thread-pool.h"
 #include "qemu/main-loop.h"
 
-static void do_spawn_thread(ThreadPool *pool);
+static void do_spawn_thread(ThreadPoolAio *pool);
 
-typedef struct ThreadPoolElement ThreadPoolElement;
+typedef struct ThreadPoolElementAio ThreadPoolElementAio;
 
 enum ThreadState {
     THREAD_QUEUED,
@@ -33,9 +33,9 @@ enum ThreadState {
     THREAD_DONE,
 };
 
-struct ThreadPoolElement {
+struct ThreadPoolElementAio {
     BlockAIOCB common;
-    ThreadPool *pool;
+    ThreadPoolAio *pool;
     ThreadPoolFunc *func;
     void *arg;
 
@@ -47,13 +47,13 @@ struct ThreadPoolElement {
     int ret;
 
     /* Access to this list is protected by lock.  */
-    QTAILQ_ENTRY(ThreadPoolElement) reqs;
+    QTAILQ_ENTRY(ThreadPoolElementAio) reqs;
 
     /* This list is only written by the thread pool's mother thread.  */
-    QLIST_ENTRY(ThreadPoolElement) all;
+    QLIST_ENTRY(ThreadPoolElementAio) all;
 };
 
-struct ThreadPool {
+struct ThreadPoolAio {
     AioContext *ctx;
     QEMUBH *completion_bh;
     QemuMutex lock;
@@ -62,10 +62,10 @@ struct ThreadPool {
     QEMUBH *new_thread_bh;
 
     /* The following variables are only accessed from one AioContext. */
-    QLIST_HEAD(, ThreadPoolElement) head;
+    QLIST_HEAD(, ThreadPoolElementAio) head;
 
     /* The following variables are protected by lock.  */
-    QTAILQ_HEAD(, ThreadPoolElement) request_list;
+    QTAILQ_HEAD(, ThreadPoolElementAio) request_list;
     int cur_threads;
     int idle_threads;
     int new_threads;     /* backlog of threads we need to create */
@@ -76,14 +76,14 @@ struct ThreadPool {
 
 static void *worker_thread(void *opaque)
 {
-    ThreadPool *pool = opaque;
+    ThreadPoolAio *pool = opaque;
 
     qemu_mutex_lock(&pool->lock);
     pool->pending_threads--;
     do_spawn_thread(pool);
 
     while (pool->cur_threads <= pool->max_threads) {
-        ThreadPoolElement *req;
+        ThreadPoolElementAio *req;
         int ret;
 
         if (QTAILQ_EMPTY(&pool->request_list)) {
@@ -131,7 +131,7 @@ static void *worker_thread(void *opaque)
     return NULL;
 }
 
-static void do_spawn_thread(ThreadPool *pool)
+static void do_spawn_thread(ThreadPoolAio *pool)
 {
     QemuThread t;
 
@@ -148,14 +148,14 @@ static void do_spawn_thread(ThreadPool *pool)
 
 static void spawn_thread_bh_fn(void *opaque)
 {
-    ThreadPool *pool = opaque;
+    ThreadPoolAio *pool = opaque;
 
     qemu_mutex_lock(&pool->lock);
     do_spawn_thread(pool);
     qemu_mutex_unlock(&pool->lock);
 }
 
-static void spawn_thread(ThreadPool *pool)
+static void spawn_thread(ThreadPoolAio *pool)
 {
     pool->cur_threads++;
     pool->new_threads++;
@@ -173,8 +173,8 @@ static void spawn_thread(ThreadPool *pool)
 
 static void thread_pool_completion_bh(void *opaque)
 {
-    ThreadPool *pool = opaque;
-    ThreadPoolElement *elem, *next;
+    ThreadPoolAio *pool = opaque;
+    ThreadPoolElementAio *elem, *next;
 
     defer_call_begin(); /* cb() may use defer_call() to coalesce work */
 
@@ -184,8 +184,8 @@ restart:
             continue;
         }
 
-        trace_thread_pool_complete(pool, elem, elem->common.opaque,
-                                   elem->ret);
+        trace_thread_pool_complete_aio(pool, elem, elem->common.opaque,
+                                       elem->ret);
         QLIST_REMOVE(elem, all);
 
         if (elem->common.cb) {
@@ -217,10 +217,10 @@ restart:
 
 static void thread_pool_cancel(BlockAIOCB *acb)
 {
-    ThreadPoolElement *elem = (ThreadPoolElement *)acb;
-    ThreadPool *pool = elem->pool;
+    ThreadPoolElementAio *elem = (ThreadPoolElementAio *)acb;
+    ThreadPoolAio *pool = elem->pool;
 
-    trace_thread_pool_cancel(elem, elem->common.opaque);
+    trace_thread_pool_cancel_aio(elem, elem->common.opaque);
 
     QEMU_LOCK_GUARD(&pool->lock);
     if (elem->state == THREAD_QUEUED) {
@@ -234,16 +234,16 @@ static void thread_pool_cancel(BlockAIOCB *acb)
 }
 
 static const AIOCBInfo thread_pool_aiocb_info = {
-    .aiocb_size         = sizeof(ThreadPoolElement),
+    .aiocb_size         = sizeof(ThreadPoolElementAio),
     .cancel_async       = thread_pool_cancel,
 };
 
 BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
                                    BlockCompletionFunc *cb, void *opaque)
 {
-    ThreadPoolElement *req;
+    ThreadPoolElementAio *req;
     AioContext *ctx = qemu_get_current_aio_context();
-    ThreadPool *pool = aio_get_thread_pool(ctx);
+    ThreadPoolAio *pool = aio_get_thread_pool(ctx);
 
     /* Assert that the thread submitting work is the same running the pool */
     assert(pool->ctx == qemu_get_current_aio_context());
@@ -290,7 +290,7 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
     return tpc.ret;
 }
 
-void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
+void thread_pool_update_params(ThreadPoolAio *pool, AioContext *ctx)
 {
     qemu_mutex_lock(&pool->lock);
 
@@ -317,7 +317,7 @@ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
     qemu_mutex_unlock(&pool->lock);
 }
 
-static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
+static void thread_pool_init_one(ThreadPoolAio *pool, AioContext *ctx)
 {
     if (!ctx) {
         ctx = qemu_get_aio_context();
@@ -337,14 +337,14 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
     thread_pool_update_params(pool, ctx);
 }
 
-ThreadPool *thread_pool_new(AioContext *ctx)
+ThreadPoolAio *thread_pool_new_aio(AioContext *ctx)
 {
-    ThreadPool *pool = g_new(ThreadPool, 1);
+    ThreadPoolAio *pool = g_new(ThreadPoolAio, 1);
     thread_pool_init_one(pool, ctx);
     return pool;
 }
 
-void thread_pool_free(ThreadPool *pool)
+void thread_pool_free_aio(ThreadPoolAio *pool)
 {
     if (!pool) {
         return;
diff --git a/util/trace-events b/util/trace-events
index 5be12d7fab89..bd8f25fb5920 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -15,8 +15,8 @@ reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
 
 # thread-pool.c
 thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
-thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
+thread_pool_complete_aio(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
+thread_pool_cancel_aio(void *req, void *opaque) "req %p opaque %p"
 
 # buffer.c
 buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio
  2024-11-17 19:19 ` [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
@ 2024-11-25 19:15   ` Fabiano Rosas
  2024-11-26 16:26   ` Cédric Le Goater
  2024-12-04 19:26   ` Peter Xu
  2 siblings, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-25 19:15 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> These names conflict with ones used by future generic thread pool
> equivalents.
> Generic names should belong to the generic pool type, not specific (AIO)
> type.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Acked-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio
  2024-11-17 19:19 ` [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
  2024-11-25 19:15   ` Fabiano Rosas
@ 2024-11-26 16:26   ` Cédric Le Goater
  2024-12-04 19:26   ` Peter Xu
  2 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-26 16:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:19, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> These names conflict with ones used by future generic thread pool
> equivalents.
> Generic names should belong to the generic pool type, not specific (AIO)
> type.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/block/aio.h         |  8 ++---
>   include/block/thread-pool.h |  8 ++---
>   util/async.c                |  6 ++--
>   util/thread-pool.c          | 58 ++++++++++++++++++-------------------
>   util/trace-events           |  4 +--
>   5 files changed, 42 insertions(+), 42 deletions(-)
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 43883a8a33a8..b2ab3514de23 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -54,7 +54,7 @@ typedef void QEMUBHFunc(void *opaque);
>   typedef bool AioPollFn(void *opaque);
>   typedef void IOHandler(void *opaque);
>   
> -struct ThreadPool;
> +struct ThreadPoolAio;
>   struct LinuxAioState;
>   typedef struct LuringState LuringState;
>   
> @@ -207,7 +207,7 @@ struct AioContext {
>       /* Thread pool for performing work and receiving completion callbacks.
>        * Has its own locking.
>        */
> -    struct ThreadPool *thread_pool;
> +    struct ThreadPoolAio *thread_pool;
>   
>   #ifdef CONFIG_LINUX_AIO
>       struct LinuxAioState *linux_aio;
> @@ -500,8 +500,8 @@ void aio_set_event_notifier_poll(AioContext *ctx,
>    */
>   GSource *aio_get_g_source(AioContext *ctx);
>   
> -/* Return the ThreadPool bound to this AioContext */
> -struct ThreadPool *aio_get_thread_pool(AioContext *ctx);
> +/* Return the ThreadPoolAio bound to this AioContext */
> +struct ThreadPoolAio *aio_get_thread_pool(AioContext *ctx);
>   
>   /* Setup the LinuxAioState bound to this AioContext */
>   struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index 4f6694026123..6f27eb085b45 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -24,10 +24,10 @@
>   
>   typedef int ThreadPoolFunc(void *opaque);
>   
> -typedef struct ThreadPool ThreadPool;
> +typedef struct ThreadPoolAio ThreadPoolAio;
>   
> -ThreadPool *thread_pool_new(struct AioContext *ctx);
> -void thread_pool_free(ThreadPool *pool);
> +ThreadPoolAio *thread_pool_new_aio(struct AioContext *ctx);
> +void thread_pool_free_aio(ThreadPoolAio *pool);
>   
>   /*
>    * thread_pool_submit_{aio,co} API: submit I/O requests in the thread's
> @@ -36,7 +36,7 @@ void thread_pool_free(ThreadPool *pool);
>   BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>                                      BlockCompletionFunc *cb, void *opaque);
>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
> +void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>   
> -void thread_pool_update_params(ThreadPool *pool, struct AioContext *ctx);
>   
>   #endif
> diff --git a/util/async.c b/util/async.c
> index 99db28389f66..f8b7678aefc8 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -369,7 +369,7 @@ aio_ctx_finalize(GSource     *source)
>       QEMUBH *bh;
>       unsigned flags;
>   
> -    thread_pool_free(ctx->thread_pool);
> +    thread_pool_free_aio(ctx->thread_pool);
>   
>   #ifdef CONFIG_LINUX_AIO
>       if (ctx->linux_aio) {
> @@ -435,10 +435,10 @@ GSource *aio_get_g_source(AioContext *ctx)
>       return &ctx->source;
>   }
>   
> -ThreadPool *aio_get_thread_pool(AioContext *ctx)
> +ThreadPoolAio *aio_get_thread_pool(AioContext *ctx)
>   {
>       if (!ctx->thread_pool) {
> -        ctx->thread_pool = thread_pool_new(ctx);
> +        ctx->thread_pool = thread_pool_new_aio(ctx);
>       }
>       return ctx->thread_pool;
>   }
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 2f751d55b33f..908194dc070f 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -23,9 +23,9 @@
>   #include "block/thread-pool.h"
>   #include "qemu/main-loop.h"
>   
> -static void do_spawn_thread(ThreadPool *pool);
> +static void do_spawn_thread(ThreadPoolAio *pool);
>   
> -typedef struct ThreadPoolElement ThreadPoolElement;
> +typedef struct ThreadPoolElementAio ThreadPoolElementAio;
>   
>   enum ThreadState {
>       THREAD_QUEUED,
> @@ -33,9 +33,9 @@ enum ThreadState {
>       THREAD_DONE,
>   };
>   
> -struct ThreadPoolElement {
> +struct ThreadPoolElementAio {
>       BlockAIOCB common;
> -    ThreadPool *pool;
> +    ThreadPoolAio *pool;
>       ThreadPoolFunc *func;
>       void *arg;
>   
> @@ -47,13 +47,13 @@ struct ThreadPoolElement {
>       int ret;
>   
>       /* Access to this list is protected by lock.  */
> -    QTAILQ_ENTRY(ThreadPoolElement) reqs;
> +    QTAILQ_ENTRY(ThreadPoolElementAio) reqs;
>   
>       /* This list is only written by the thread pool's mother thread.  */
> -    QLIST_ENTRY(ThreadPoolElement) all;
> +    QLIST_ENTRY(ThreadPoolElementAio) all;
>   };
>   
> -struct ThreadPool {
> +struct ThreadPoolAio {
>       AioContext *ctx;
>       QEMUBH *completion_bh;
>       QemuMutex lock;
> @@ -62,10 +62,10 @@ struct ThreadPool {
>       QEMUBH *new_thread_bh;
>   
>       /* The following variables are only accessed from one AioContext. */
> -    QLIST_HEAD(, ThreadPoolElement) head;
> +    QLIST_HEAD(, ThreadPoolElementAio) head;
>   
>       /* The following variables are protected by lock.  */
> -    QTAILQ_HEAD(, ThreadPoolElement) request_list;
> +    QTAILQ_HEAD(, ThreadPoolElementAio) request_list;
>       int cur_threads;
>       int idle_threads;
>       int new_threads;     /* backlog of threads we need to create */
> @@ -76,14 +76,14 @@ struct ThreadPool {
>   
>   static void *worker_thread(void *opaque)
>   {
> -    ThreadPool *pool = opaque;
> +    ThreadPoolAio *pool = opaque;
>   
>       qemu_mutex_lock(&pool->lock);
>       pool->pending_threads--;
>       do_spawn_thread(pool);
>   
>       while (pool->cur_threads <= pool->max_threads) {
> -        ThreadPoolElement *req;
> +        ThreadPoolElementAio *req;
>           int ret;
>   
>           if (QTAILQ_EMPTY(&pool->request_list)) {
> @@ -131,7 +131,7 @@ static void *worker_thread(void *opaque)
>       return NULL;
>   }
>   
> -static void do_spawn_thread(ThreadPool *pool)
> +static void do_spawn_thread(ThreadPoolAio *pool)
>   {
>       QemuThread t;
>   
> @@ -148,14 +148,14 @@ static void do_spawn_thread(ThreadPool *pool)
>   
>   static void spawn_thread_bh_fn(void *opaque)
>   {
> -    ThreadPool *pool = opaque;
> +    ThreadPoolAio *pool = opaque;
>   
>       qemu_mutex_lock(&pool->lock);
>       do_spawn_thread(pool);
>       qemu_mutex_unlock(&pool->lock);
>   }
>   
> -static void spawn_thread(ThreadPool *pool)
> +static void spawn_thread(ThreadPoolAio *pool)
>   {
>       pool->cur_threads++;
>       pool->new_threads++;
> @@ -173,8 +173,8 @@ static void spawn_thread(ThreadPool *pool)
>   
>   static void thread_pool_completion_bh(void *opaque)
>   {
> -    ThreadPool *pool = opaque;
> -    ThreadPoolElement *elem, *next;
> +    ThreadPoolAio *pool = opaque;
> +    ThreadPoolElementAio *elem, *next;
>   
>       defer_call_begin(); /* cb() may use defer_call() to coalesce work */
>   
> @@ -184,8 +184,8 @@ restart:
>               continue;
>           }
>   
> -        trace_thread_pool_complete(pool, elem, elem->common.opaque,
> -                                   elem->ret);
> +        trace_thread_pool_complete_aio(pool, elem, elem->common.opaque,
> +                                       elem->ret);
>           QLIST_REMOVE(elem, all);
>   
>           if (elem->common.cb) {
> @@ -217,10 +217,10 @@ restart:
>   
>   static void thread_pool_cancel(BlockAIOCB *acb)
>   {
> -    ThreadPoolElement *elem = (ThreadPoolElement *)acb;
> -    ThreadPool *pool = elem->pool;
> +    ThreadPoolElementAio *elem = (ThreadPoolElementAio *)acb;
> +    ThreadPoolAio *pool = elem->pool;
>   
> -    trace_thread_pool_cancel(elem, elem->common.opaque);
> +    trace_thread_pool_cancel_aio(elem, elem->common.opaque);
>   
>       QEMU_LOCK_GUARD(&pool->lock);
>       if (elem->state == THREAD_QUEUED) {
> @@ -234,16 +234,16 @@ static void thread_pool_cancel(BlockAIOCB *acb)
>   }
>   
>   static const AIOCBInfo thread_pool_aiocb_info = {
> -    .aiocb_size         = sizeof(ThreadPoolElement),
> +    .aiocb_size         = sizeof(ThreadPoolElementAio),
>       .cancel_async       = thread_pool_cancel,
>   };
>   
>   BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>                                      BlockCompletionFunc *cb, void *opaque)
>   {
> -    ThreadPoolElement *req;
> +    ThreadPoolElementAio *req;
>       AioContext *ctx = qemu_get_current_aio_context();
> -    ThreadPool *pool = aio_get_thread_pool(ctx);
> +    ThreadPoolAio *pool = aio_get_thread_pool(ctx);
>   
>       /* Assert that the thread submitting work is the same running the pool */
>       assert(pool->ctx == qemu_get_current_aio_context());
> @@ -290,7 +290,7 @@ int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg)
>       return tpc.ret;
>   }
>   
> -void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
> +void thread_pool_update_params(ThreadPoolAio *pool, AioContext *ctx)
>   {
>       qemu_mutex_lock(&pool->lock);
>   
> @@ -317,7 +317,7 @@ void thread_pool_update_params(ThreadPool *pool, AioContext *ctx)
>       qemu_mutex_unlock(&pool->lock);
>   }
>   
> -static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
> +static void thread_pool_init_one(ThreadPoolAio *pool, AioContext *ctx)
>   {
>       if (!ctx) {
>           ctx = qemu_get_aio_context();
> @@ -337,14 +337,14 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
>       thread_pool_update_params(pool, ctx);
>   }
>   
> -ThreadPool *thread_pool_new(AioContext *ctx)
> +ThreadPoolAio *thread_pool_new_aio(AioContext *ctx)
>   {
> -    ThreadPool *pool = g_new(ThreadPool, 1);
> +    ThreadPoolAio *pool = g_new(ThreadPoolAio, 1);
>       thread_pool_init_one(pool, ctx);
>       return pool;
>   }
>   
> -void thread_pool_free(ThreadPool *pool)
> +void thread_pool_free_aio(ThreadPoolAio *pool)
>   {
>       if (!pool) {
>           return;
> diff --git a/util/trace-events b/util/trace-events
> index 5be12d7fab89..bd8f25fb5920 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -15,8 +15,8 @@ reentrant_aio(void *ctx, const char *name) "ctx %p name %s"
>   
>   # thread-pool.c
>   thread_pool_submit_aio(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
> -thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
> -thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
> +thread_pool_complete_aio(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
> +thread_pool_cancel_aio(void *req, void *opaque) "req %p opaque %p"
>   
>   # buffer.c
>   buffer_resize(const char *buf, size_t olen, size_t len) "%s: old %zd, new %zd"
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio
  2024-11-17 19:19 ` [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
  2024-11-25 19:15   ` Fabiano Rosas
  2024-11-26 16:26   ` Cédric Le Goater
@ 2024-12-04 19:26   ` Peter Xu
  2 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-04 19:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:19:58PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> These names conflict with ones used by future generic thread pool
> equivalents.
> Generic names should belong to the generic pool type, not specific (AIO)
> type.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2024-11-17 19:19 ` [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
@ 2024-11-17 19:19 ` Maciej S. Szmigiero
  2024-11-25 19:41   ` Fabiano Rosas
                     ` (3 more replies)
  2024-11-17 19:20 ` [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
                   ` (21 subsequent siblings)
  25 siblings, 4 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:19 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Migration code wants to manage device data sending threads in one place.

QEMU has an existing thread pool implementation, however it is limited
to queuing AIO operations only and essentially has a 1:1 mapping between
the current AioContext and the AIO ThreadPool in use.

Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
GThreadPool.

This brings a few new operations on a pool:
* thread_pool_wait() operation waits until all the submitted work requests
have finished.

* thread_pool_set_max_threads() explicitly sets the maximum thread count
in the pool.

* thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
in the pool to equal the number of still waiting in queue or unfinished work.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/block/thread-pool.h |   9 +++
 util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
index 6f27eb085b45..3f9f66307b65 100644
--- a/include/block/thread-pool.h
+++ b/include/block/thread-pool.h
@@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
 int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
 void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
 
+typedef struct ThreadPool ThreadPool;
+
+ThreadPool *thread_pool_new(void);
+void thread_pool_free(ThreadPool *pool);
+void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                        void *opaque, GDestroyNotify opaque_destroy);
+void thread_pool_wait(ThreadPool *pool);
+bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
+bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
 
 #endif
diff --git a/util/thread-pool.c b/util/thread-pool.c
index 908194dc070f..d80c4181c897 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
     qemu_mutex_destroy(&pool->lock);
     g_free(pool);
 }
+
+struct ThreadPool { /* type safety */
+    GThreadPool *t;
+    size_t unfinished_el_ctr;
+    QemuMutex unfinished_el_ctr_mutex;
+    QemuCond unfinished_el_ctr_zero_cond;
+};
+
+typedef struct {
+    ThreadPoolFunc *func;
+    void *opaque;
+    GDestroyNotify opaque_destroy;
+} ThreadPoolElement;
+
+static void thread_pool_func(gpointer data, gpointer user_data)
+{
+    ThreadPool *pool = user_data;
+    g_autofree ThreadPoolElement *el = data;
+
+    el->func(el->opaque);
+
+    if (el->opaque_destroy) {
+        el->opaque_destroy(el->opaque);
+    }
+
+    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
+
+    assert(pool->unfinished_el_ctr > 0);
+    pool->unfinished_el_ctr--;
+
+    if (pool->unfinished_el_ctr == 0) {
+        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
+    }
+}
+
+ThreadPool *thread_pool_new(void)
+{
+    ThreadPool *pool = g_new(ThreadPool, 1);
+
+    pool->unfinished_el_ctr = 0;
+    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
+    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
+
+    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
+    /*
+     * g_thread_pool_new() can only return errors if initial thread(s)
+     * creation fails but we ask for 0 initial threads above.
+     */
+    assert(pool->t);
+
+    return pool;
+}
+
+void thread_pool_free(ThreadPool *pool)
+{
+    g_thread_pool_free(pool->t, FALSE, TRUE);
+
+    qemu_cond_destroy(&pool->unfinished_el_ctr_zero_cond);
+    qemu_mutex_destroy(&pool->unfinished_el_ctr_mutex);
+
+    g_free(pool);
+}
+
+void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
+                        void *opaque, GDestroyNotify opaque_destroy)
+{
+    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
+
+    el->func = func;
+    el->opaque = opaque;
+    el->opaque_destroy = opaque_destroy;
+
+    WITH_QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex) {
+        pool->unfinished_el_ctr++;
+    }
+
+    /*
+     * Ignore the return value since this function can only return errors
+     * if creation of an additional thread fails but even in this case the
+     * provided work is still getting queued (just for the existing threads).
+     */
+    g_thread_pool_push(pool->t, el, NULL);
+}
+
+void thread_pool_wait(ThreadPool *pool)
+{
+    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
+
+    if (pool->unfinished_el_ctr > 0) {
+        qemu_cond_wait(&pool->unfinished_el_ctr_zero_cond,
+                       &pool->unfinished_el_ctr_mutex);
+        assert(pool->unfinished_el_ctr == 0);
+    }
+}
+
+bool thread_pool_set_max_threads(ThreadPool *pool,
+                                 int max_threads)
+{
+    assert(max_threads > 0);
+
+    return g_thread_pool_set_max_threads(pool->t, max_threads, NULL);
+}
+
+bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool)
+{
+    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
+
+    return thread_pool_set_max_threads(pool, pool->unfinished_el_ctr);
+}


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-17 19:19 ` [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
@ 2024-11-25 19:41   ` Fabiano Rosas
  2024-11-25 19:55     ` Maciej S. Szmigiero
  2024-11-26 19:29   ` Cédric Le Goater
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-25 19:41 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Migration code wants to manage device data sending threads in one place.
>
> QEMU has an existing thread pool implementation, however it is limited
> to queuing AIO operations only and essentially has a 1:1 mapping between
> the current AioContext and the AIO ThreadPool in use.
>
> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
> GThreadPool.
>
> This brings a few new operations on a pool:
> * thread_pool_wait() operation waits until all the submitted work requests
> have finished.
>
> * thread_pool_set_max_threads() explicitly sets the maximum thread count
> in the pool.
>
> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
> in the pool to equal the number of still waiting in queue or unfinished work.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/block/thread-pool.h |   9 +++
>  util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 118 insertions(+)
>
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index 6f27eb085b45..3f9f66307b65 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>  int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>  void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>  
> +typedef struct ThreadPool ThreadPool;
> +
> +ThreadPool *thread_pool_new(void);
> +void thread_pool_free(ThreadPool *pool);
> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                        void *opaque, GDestroyNotify opaque_destroy);
> +void thread_pool_wait(ThreadPool *pool);
> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>  
>  #endif
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 908194dc070f..d80c4181c897 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>      qemu_mutex_destroy(&pool->lock);
>      g_free(pool);
>  }
> +
> +struct ThreadPool { /* type safety */
> +    GThreadPool *t;
> +    size_t unfinished_el_ctr;
> +    QemuMutex unfinished_el_ctr_mutex;
> +    QemuCond unfinished_el_ctr_zero_cond;
> +};
> +
> +typedef struct {
> +    ThreadPoolFunc *func;
> +    void *opaque;
> +    GDestroyNotify opaque_destroy;
> +} ThreadPoolElement;
> +
> +static void thread_pool_func(gpointer data, gpointer user_data)
> +{
> +    ThreadPool *pool = user_data;
> +    g_autofree ThreadPoolElement *el = data;
> +
> +    el->func(el->opaque);
> +
> +    if (el->opaque_destroy) {
> +        el->opaque_destroy(el->opaque);
> +    }
> +
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    assert(pool->unfinished_el_ctr > 0);
> +    pool->unfinished_el_ctr--;
> +
> +    if (pool->unfinished_el_ctr == 0) {
> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
> +    }
> +}
> +
> +ThreadPool *thread_pool_new(void)
> +{
> +    ThreadPool *pool = g_new(ThreadPool, 1);
> +
> +    pool->unfinished_el_ctr = 0;
> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
> +
> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
> +    /*
> +     * g_thread_pool_new() can only return errors if initial thread(s)
> +     * creation fails but we ask for 0 initial threads above.
> +     */
> +    assert(pool->t);
> +
> +    return pool;
> +}
> +
> +void thread_pool_free(ThreadPool *pool)
> +{
> +    g_thread_pool_free(pool->t, FALSE, TRUE);

Should we make it an error to call thread_poll_free without first
calling thread_poll_wait? I worry the current usage will lead to having
two different ways of waiting with one of them (this one) being quite
implicit.

> +
> +    qemu_cond_destroy(&pool->unfinished_el_ctr_zero_cond);
> +    qemu_mutex_destroy(&pool->unfinished_el_ctr_mutex);
> +
> +    g_free(pool);
> +}
> +
> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                        void *opaque, GDestroyNotify opaque_destroy)
> +{
> +    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
> +
> +    el->func = func;
> +    el->opaque = opaque;
> +    el->opaque_destroy = opaque_destroy;
> +
> +    WITH_QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex) {
> +        pool->unfinished_el_ctr++;
> +    }
> +
> +    /*
> +     * Ignore the return value since this function can only return errors
> +     * if creation of an additional thread fails but even in this case the
> +     * provided work is still getting queued (just for the existing threads).
> +     */
> +    g_thread_pool_push(pool->t, el, NULL);
> +}
> +
> +void thread_pool_wait(ThreadPool *pool)
> +{
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    if (pool->unfinished_el_ctr > 0) {
> +        qemu_cond_wait(&pool->unfinished_el_ctr_zero_cond,
> +                       &pool->unfinished_el_ctr_mutex);
> +        assert(pool->unfinished_el_ctr == 0);
> +    }
> +}
> +
> +bool thread_pool_set_max_threads(ThreadPool *pool,
> +                                 int max_threads)
> +{
> +    assert(max_threads > 0);
> +
> +    return g_thread_pool_set_max_threads(pool->t, max_threads, NULL);
> +}
> +
> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool)
> +{
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    return thread_pool_set_max_threads(pool, pool->unfinished_el_ctr);
> +}


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-25 19:41   ` Fabiano Rosas
@ 2024-11-25 19:55     ` Maciej S. Szmigiero
  2024-11-25 20:51       ` Fabiano Rosas
  2024-11-26 19:25       ` Cédric Le Goater
  0 siblings, 2 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-25 19:55 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Peter Xu

On 25.11.2024 20:41, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Migration code wants to manage device data sending threads in one place.
>>
>> QEMU has an existing thread pool implementation, however it is limited
>> to queuing AIO operations only and essentially has a 1:1 mapping between
>> the current AioContext and the AIO ThreadPool in use.
>>
>> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
>> GThreadPool.
>>
>> This brings a few new operations on a pool:
>> * thread_pool_wait() operation waits until all the submitted work requests
>> have finished.
>>
>> * thread_pool_set_max_threads() explicitly sets the maximum thread count
>> in the pool.
>>
>> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
>> in the pool to equal the number of still waiting in queue or unfinished work.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/block/thread-pool.h |   9 +++
>>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 118 insertions(+)
>>
>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>> index 6f27eb085b45..3f9f66307b65 100644
>> --- a/include/block/thread-pool.h
>> +++ b/include/block/thread-pool.h
>> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>>   
>> +typedef struct ThreadPool ThreadPool;
>> +
>> +ThreadPool *thread_pool_new(void);
>> +void thread_pool_free(ThreadPool *pool);
>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>> +                        void *opaque, GDestroyNotify opaque_destroy);
>> +void thread_pool_wait(ThreadPool *pool);
>> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
>> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>>   
>>   #endif
>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>> index 908194dc070f..d80c4181c897 100644
>> --- a/util/thread-pool.c
>> +++ b/util/thread-pool.c
>> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>>       qemu_mutex_destroy(&pool->lock);
>>       g_free(pool);
>>   }
>> +
>> +struct ThreadPool { /* type safety */
>> +    GThreadPool *t;
>> +    size_t unfinished_el_ctr;
>> +    QemuMutex unfinished_el_ctr_mutex;
>> +    QemuCond unfinished_el_ctr_zero_cond;
>> +};
>> +
>> +typedef struct {
>> +    ThreadPoolFunc *func;
>> +    void *opaque;
>> +    GDestroyNotify opaque_destroy;
>> +} ThreadPoolElement;
>> +
>> +static void thread_pool_func(gpointer data, gpointer user_data)
>> +{
>> +    ThreadPool *pool = user_data;
>> +    g_autofree ThreadPoolElement *el = data;
>> +
>> +    el->func(el->opaque);
>> +
>> +    if (el->opaque_destroy) {
>> +        el->opaque_destroy(el->opaque);
>> +    }
>> +
>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>> +
>> +    assert(pool->unfinished_el_ctr > 0);
>> +    pool->unfinished_el_ctr--;
>> +
>> +    if (pool->unfinished_el_ctr == 0) {
>> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
>> +    }
>> +}
>> +
>> +ThreadPool *thread_pool_new(void)
>> +{
>> +    ThreadPool *pool = g_new(ThreadPool, 1);
>> +
>> +    pool->unfinished_el_ctr = 0;
>> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
>> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
>> +
>> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
>> +    /*
>> +     * g_thread_pool_new() can only return errors if initial thread(s)
>> +     * creation fails but we ask for 0 initial threads above.
>> +     */
>> +    assert(pool->t);
>> +
>> +    return pool;
>> +}
>> +
>> +void thread_pool_free(ThreadPool *pool)
>> +{
>> +    g_thread_pool_free(pool->t, FALSE, TRUE);
> 
> Should we make it an error to call thread_poll_free without first
> calling thread_poll_wait? I worry the current usage will lead to having
> two different ways of waiting with one of them (this one) being quite
> implicit.
> 

thread_pool_wait() can be used as a barrier between two sets of
tasks executed on a thread pool without destroying it or in a performance
sensitive path where we want to just wait for task completion while
deferring the free operation for later, less sensitive time.

I don't think requiring explicit thread_pool_wait() before
thread_pool_free() actually gives any advantage, while at the same
time it's making this API usage slightly more complex in cases
where the consumer is fine with having combined wait+free semantics
for thread_pool_free().

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-25 19:55     ` Maciej S. Szmigiero
@ 2024-11-25 20:51       ` Fabiano Rosas
  2024-11-26 19:25       ` Cédric Le Goater
  1 sibling, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-25 20:51 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Peter Xu

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> On 25.11.2024 20:41, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>> 
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Migration code wants to manage device data sending threads in one place.
>>>
>>> QEMU has an existing thread pool implementation, however it is limited
>>> to queuing AIO operations only and essentially has a 1:1 mapping between
>>> the current AioContext and the AIO ThreadPool in use.
>>>
>>> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
>>> GThreadPool.
>>>
>>> This brings a few new operations on a pool:
>>> * thread_pool_wait() operation waits until all the submitted work requests
>>> have finished.
>>>
>>> * thread_pool_set_max_threads() explicitly sets the maximum thread count
>>> in the pool.
>>>
>>> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
>>> in the pool to equal the number of still waiting in queue or unfinished work.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   include/block/thread-pool.h |   9 +++
>>>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>>>   2 files changed, 118 insertions(+)
>>>
>>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>>> index 6f27eb085b45..3f9f66307b65 100644
>>> --- a/include/block/thread-pool.h
>>> +++ b/include/block/thread-pool.h
>>> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>>>   
>>> +typedef struct ThreadPool ThreadPool;
>>> +
>>> +ThreadPool *thread_pool_new(void);
>>> +void thread_pool_free(ThreadPool *pool);
>>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>> +                        void *opaque, GDestroyNotify opaque_destroy);
>>> +void thread_pool_wait(ThreadPool *pool);
>>> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
>>> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>>>   
>>>   #endif
>>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>>> index 908194dc070f..d80c4181c897 100644
>>> --- a/util/thread-pool.c
>>> +++ b/util/thread-pool.c
>>> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>>>       qemu_mutex_destroy(&pool->lock);
>>>       g_free(pool);
>>>   }
>>> +
>>> +struct ThreadPool { /* type safety */
>>> +    GThreadPool *t;
>>> +    size_t unfinished_el_ctr;
>>> +    QemuMutex unfinished_el_ctr_mutex;
>>> +    QemuCond unfinished_el_ctr_zero_cond;
>>> +};
>>> +
>>> +typedef struct {
>>> +    ThreadPoolFunc *func;
>>> +    void *opaque;
>>> +    GDestroyNotify opaque_destroy;
>>> +} ThreadPoolElement;
>>> +
>>> +static void thread_pool_func(gpointer data, gpointer user_data)
>>> +{
>>> +    ThreadPool *pool = user_data;
>>> +    g_autofree ThreadPoolElement *el = data;
>>> +
>>> +    el->func(el->opaque);
>>> +
>>> +    if (el->opaque_destroy) {
>>> +        el->opaque_destroy(el->opaque);
>>> +    }
>>> +
>>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>>> +
>>> +    assert(pool->unfinished_el_ctr > 0);
>>> +    pool->unfinished_el_ctr--;
>>> +
>>> +    if (pool->unfinished_el_ctr == 0) {
>>> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
>>> +    }
>>> +}
>>> +
>>> +ThreadPool *thread_pool_new(void)
>>> +{
>>> +    ThreadPool *pool = g_new(ThreadPool, 1);
>>> +
>>> +    pool->unfinished_el_ctr = 0;
>>> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
>>> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
>>> +
>>> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
>>> +    /*
>>> +     * g_thread_pool_new() can only return errors if initial thread(s)
>>> +     * creation fails but we ask for 0 initial threads above.
>>> +     */
>>> +    assert(pool->t);
>>> +
>>> +    return pool;
>>> +}
>>> +
>>> +void thread_pool_free(ThreadPool *pool)
>>> +{
>>> +    g_thread_pool_free(pool->t, FALSE, TRUE);
>> 
>> Should we make it an error to call thread_poll_free without first
>> calling thread_poll_wait? I worry the current usage will lead to having
>> two different ways of waiting with one of them (this one) being quite
>> implicit.
>> 
>
> thread_pool_wait() can be used as a barrier between two sets of
> tasks executed on a thread pool without destroying it or in a performance
> sensitive path where we want to just wait for task completion while
> deferring the free operation for later, less sensitive time.
>
> I don't think requiring explicit thread_pool_wait() before
> thread_pool_free() actually gives any advantage, while at the same
> time it's making this API usage slightly more complex in cases
> where the consumer is fine with having combined wait+free semantics
> for thread_pool_free().

Fair enough,

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-25 19:55     ` Maciej S. Szmigiero
  2024-11-25 20:51       ` Fabiano Rosas
@ 2024-11-26 19:25       ` Cédric Le Goater
  2024-11-26 21:21         ` Maciej S. Szmigiero
  1 sibling, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-26 19:25 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Peter Xu

On 11/25/24 20:55, Maciej S. Szmigiero wrote:
> On 25.11.2024 20:41, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Migration code wants to manage device data sending threads in one place.
>>>
>>> QEMU has an existing thread pool implementation, however it is limited
>>> to queuing AIO operations only and essentially has a 1:1 mapping between
>>> the current AioContext and the AIO ThreadPool in use.
>>>
>>> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
>>> GThreadPool.
>>>
>>> This brings a few new operations on a pool:
>>> * thread_pool_wait() operation waits until all the submitted work requests
>>> have finished.
>>>
>>> * thread_pool_set_max_threads() explicitly sets the maximum thread count
>>> in the pool.
>>>
>>> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
>>> in the pool to equal the number of still waiting in queue or unfinished work.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   include/block/thread-pool.h |   9 +++
>>>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>>>   2 files changed, 118 insertions(+)
>>>
>>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>>> index 6f27eb085b45..3f9f66307b65 100644
>>> --- a/include/block/thread-pool.h
>>> +++ b/include/block/thread-pool.h
>>> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>>> +typedef struct ThreadPool ThreadPool;
>>> +
>>> +ThreadPool *thread_pool_new(void);
>>> +void thread_pool_free(ThreadPool *pool);
>>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>> +                        void *opaque, GDestroyNotify opaque_destroy);
>>> +void thread_pool_wait(ThreadPool *pool);
>>> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
>>> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>>>   #endif
>>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>>> index 908194dc070f..d80c4181c897 100644
>>> --- a/util/thread-pool.c
>>> +++ b/util/thread-pool.c
>>> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>>>       qemu_mutex_destroy(&pool->lock);
>>>       g_free(pool);
>>>   }
>>> +
>>> +struct ThreadPool { /* type safety */
>>> +    GThreadPool *t;
>>> +    size_t unfinished_el_ctr;
>>> +    QemuMutex unfinished_el_ctr_mutex;
>>> +    QemuCond unfinished_el_ctr_zero_cond;
>>> +};
>>> +
>>> +typedef struct {
>>> +    ThreadPoolFunc *func;
>>> +    void *opaque;
>>> +    GDestroyNotify opaque_destroy;
>>> +} ThreadPoolElement;
>>> +
>>> +static void thread_pool_func(gpointer data, gpointer user_data)
>>> +{
>>> +    ThreadPool *pool = user_data;
>>> +    g_autofree ThreadPoolElement *el = data;
>>> +
>>> +    el->func(el->opaque);
>>> +
>>> +    if (el->opaque_destroy) {
>>> +        el->opaque_destroy(el->opaque);
>>> +    }
>>> +
>>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>>> +
>>> +    assert(pool->unfinished_el_ctr > 0);
>>> +    pool->unfinished_el_ctr--;
>>> +
>>> +    if (pool->unfinished_el_ctr == 0) {
>>> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
>>> +    }
>>> +}
>>> +
>>> +ThreadPool *thread_pool_new(void)
>>> +{
>>> +    ThreadPool *pool = g_new(ThreadPool, 1);
>>> +
>>> +    pool->unfinished_el_ctr = 0;
>>> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
>>> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
>>> +
>>> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
>>> +    /*
>>> +     * g_thread_pool_new() can only return errors if initial thread(s)
>>> +     * creation fails but we ask for 0 initial threads above.
>>> +     */
>>> +    assert(pool->t);
>>> +
>>> +    return pool;
>>> +}
>>> +
>>> +void thread_pool_free(ThreadPool *pool)
>>> +{
>>> +    g_thread_pool_free(pool->t, FALSE, TRUE);
>>
>> Should we make it an error to call thread_poll_free without first
>> calling thread_poll_wait? I worry the current usage will lead to having
>> two different ways of waiting with one of them (this one) being quite
>> implicit.
>>
> 
> thread_pool_wait() can be used as a barrier between two sets of
> tasks executed on a thread pool without destroying it or in a performance
> sensitive path where we want to just wait for task completion while
> deferring the free operation for later, less sensitive time.

A comment above g_thread_pool_free() would be good to have since
the wait_ argument is TRUE and g_thread_pool_free() effectively
waits for all threads to complete.


Thanks,

C.



> 
> I don't think requiring explicit thread_pool_wait() before
> thread_pool_free() actually gives any advantage, while at the same
> time it's making this API usage slightly more complex in cases
> where the consumer is fine with having combined wait+free semantics
> for thread_pool_free().
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-26 19:25       ` Cédric Le Goater
@ 2024-11-26 21:21         ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-26 21:21 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Peter Xu

On 26.11.2024 20:25, Cédric Le Goater wrote:
> On 11/25/24 20:55, Maciej S. Szmigiero wrote:
>> On 25.11.2024 20:41, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Migration code wants to manage device data sending threads in one place.
>>>>
>>>> QEMU has an existing thread pool implementation, however it is limited
>>>> to queuing AIO operations only and essentially has a 1:1 mapping between
>>>> the current AioContext and the AIO ThreadPool in use.
>>>>
>>>> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
>>>> GThreadPool.
>>>>
>>>> This brings a few new operations on a pool:
>>>> * thread_pool_wait() operation waits until all the submitted work requests
>>>> have finished.
>>>>
>>>> * thread_pool_set_max_threads() explicitly sets the maximum thread count
>>>> in the pool.
>>>>
>>>> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
>>>> in the pool to equal the number of still waiting in queue or unfinished work.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   include/block/thread-pool.h |   9 +++
>>>>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>>>>   2 files changed, 118 insertions(+)
>>>>
>>>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>>>> index 6f27eb085b45..3f9f66307b65 100644
>>>> --- a/include/block/thread-pool.h
>>>> +++ b/include/block/thread-pool.h
>>>> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>>>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>>>> +typedef struct ThreadPool ThreadPool;
>>>> +
>>>> +ThreadPool *thread_pool_new(void);
>>>> +void thread_pool_free(ThreadPool *pool);
>>>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>>> +                        void *opaque, GDestroyNotify opaque_destroy);
>>>> +void thread_pool_wait(ThreadPool *pool);
>>>> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
>>>> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>>>>   #endif
>>>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>>>> index 908194dc070f..d80c4181c897 100644
>>>> --- a/util/thread-pool.c
>>>> +++ b/util/thread-pool.c
>>>> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>>>>       qemu_mutex_destroy(&pool->lock);
>>>>       g_free(pool);
>>>>   }
>>>> +
>>>> +struct ThreadPool { /* type safety */
>>>> +    GThreadPool *t;
>>>> +    size_t unfinished_el_ctr;
>>>> +    QemuMutex unfinished_el_ctr_mutex;
>>>> +    QemuCond unfinished_el_ctr_zero_cond;
>>>> +};
>>>> +
>>>> +typedef struct {
>>>> +    ThreadPoolFunc *func;
>>>> +    void *opaque;
>>>> +    GDestroyNotify opaque_destroy;
>>>> +} ThreadPoolElement;
>>>> +
>>>> +static void thread_pool_func(gpointer data, gpointer user_data)
>>>> +{
>>>> +    ThreadPool *pool = user_data;
>>>> +    g_autofree ThreadPoolElement *el = data;
>>>> +
>>>> +    el->func(el->opaque);
>>>> +
>>>> +    if (el->opaque_destroy) {
>>>> +        el->opaque_destroy(el->opaque);
>>>> +    }
>>>> +
>>>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>>>> +
>>>> +    assert(pool->unfinished_el_ctr > 0);
>>>> +    pool->unfinished_el_ctr--;
>>>> +
>>>> +    if (pool->unfinished_el_ctr == 0) {
>>>> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
>>>> +    }
>>>> +}
>>>> +
>>>> +ThreadPool *thread_pool_new(void)
>>>> +{
>>>> +    ThreadPool *pool = g_new(ThreadPool, 1);
>>>> +
>>>> +    pool->unfinished_el_ctr = 0;
>>>> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
>>>> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
>>>> +
>>>> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
>>>> +    /*
>>>> +     * g_thread_pool_new() can only return errors if initial thread(s)
>>>> +     * creation fails but we ask for 0 initial threads above.
>>>> +     */
>>>> +    assert(pool->t);
>>>> +
>>>> +    return pool;
>>>> +}
>>>> +
>>>> +void thread_pool_free(ThreadPool *pool)
>>>> +{
>>>> +    g_thread_pool_free(pool->t, FALSE, TRUE);
>>>
>>> Should we make it an error to call thread_poll_free without first
>>> calling thread_poll_wait? I worry the current usage will lead to having
>>> two different ways of waiting with one of them (this one) being quite
>>> implicit.
>>>
>>
>> thread_pool_wait() can be used as a barrier between two sets of
>> tasks executed on a thread pool without destroying it or in a performance
>> sensitive path where we want to just wait for task completion while
>> deferring the free operation for later, less sensitive time.
> 
> A comment above g_thread_pool_free() would be good to have since
> the wait_ argument is TRUE and g_thread_pool_free() effectively
> waits for all threads to complete.
> 

Will add an appropriate comment there.

> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-17 19:19 ` [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
  2024-11-25 19:41   ` Fabiano Rosas
@ 2024-11-26 19:29   ` Cédric Le Goater
  2024-11-26 21:22     ` Maciej S. Szmigiero
  2024-11-28 10:08   ` Avihai Horon
  2024-12-04 20:04   ` Peter Xu
  3 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-26 19:29 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:19, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Migration code wants to manage device data sending threads in one place.
> 
> QEMU has an existing thread pool implementation, however it is limited
> to queuing AIO operations only and essentially has a 1:1 mapping between
> the current AioContext and the AIO ThreadPool in use.
> 
> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
> GThreadPool.
> 
> This brings a few new operations on a pool:
> * thread_pool_wait() operation waits until all the submitted work requests
> have finished.
> 
> * thread_pool_set_max_threads() explicitly sets the maximum thread count
> in the pool.
> 
> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
> in the pool to equal the number of still waiting in queue or unfinished work.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/block/thread-pool.h |   9 +++
>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 118 insertions(+)
> 
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index 6f27eb085b45..3f9f66307b65 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>   
> +typedef struct ThreadPool ThreadPool;
> +
> +ThreadPool *thread_pool_new(void);
> +void thread_pool_free(ThreadPool *pool);
> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                        void *opaque, GDestroyNotify opaque_destroy);
> +void thread_pool_wait(ThreadPool *pool);
> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);

We should add documentation for these routines.

>   #endif
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 908194dc070f..d80c4181c897 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>       qemu_mutex_destroy(&pool->lock);
>       g_free(pool);
>   }
> +
> +struct ThreadPool { /* type safety */
> +    GThreadPool *t;
> +    size_t unfinished_el_ctr;
> +    QemuMutex unfinished_el_ctr_mutex;
> +    QemuCond unfinished_el_ctr_zero_cond;
> +};


I find the naming of the attributes a little confusing. Could we
use names similar to ThreadPoolAio. Something like :

struct ThreadPool { /* type safety */
     GThreadPool *t;
     int cur_threads;
     QemuMutex lock;
     QemuCond finished_cond;
};



> +
> +typedef struct {
> +    ThreadPoolFunc *func;
> +    void *opaque;
> +    GDestroyNotify opaque_destroy;
> +} ThreadPoolElement;
> +
> +static void thread_pool_func(gpointer data, gpointer user_data)
> +{
> +    ThreadPool *pool = user_data;
> +    g_autofree ThreadPoolElement *el = data;
> +
> +    el->func(el->opaque);
> +
> +    if (el->opaque_destroy) {
> +        el->opaque_destroy(el->opaque);
> +    }
> +
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    assert(pool->unfinished_el_ctr > 0);
> +    pool->unfinished_el_ctr--;
> +
> +    if (pool->unfinished_el_ctr == 0) {
> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
> +    }
> +}
> +
> +ThreadPool *thread_pool_new(void)
> +{
> +    ThreadPool *pool = g_new(ThreadPool, 1);
> +
> +    pool->unfinished_el_ctr = 0;
> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
> +
> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
> +    /*
> +     * g_thread_pool_new() can only return errors if initial thread(s)
> +     * creation fails but we ask for 0 initial threads above.
> +     */
> +    assert(pool->t);
> +
> +    return pool;
> +}
> +
> +void thread_pool_free(ThreadPool *pool)
> +{
> +    g_thread_pool_free(pool->t, FALSE, TRUE);
> +
> +    qemu_cond_destroy(&pool->unfinished_el_ctr_zero_cond);
> +    qemu_mutex_destroy(&pool->unfinished_el_ctr_mutex);
> +
> +    g_free(pool);
> +}
> +
> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                        void *opaque, GDestroyNotify opaque_destroy)
> +{
> +    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);

Where are the ThreadPool elements freed ? I am missing something
may be.

Thanks,

C.


> +
> +    el->func = func;
> +    el->opaque = opaque;
> +    el->opaque_destroy = opaque_destroy;
> +
> +    WITH_QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex) {
> +        pool->unfinished_el_ctr++;
> +    }
> +
> +    /*
> +     * Ignore the return value since this function can only return errors
> +     * if creation of an additional thread fails but even in this case the
> +     * provided work is still getting queued (just for the existing threads).
> +     */
> +    g_thread_pool_push(pool->t, el, NULL);
> +}
> +
> +void thread_pool_wait(ThreadPool *pool)
> +{
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    if (pool->unfinished_el_ctr > 0) {
> +        qemu_cond_wait(&pool->unfinished_el_ctr_zero_cond,
> +                       &pool->unfinished_el_ctr_mutex);
> +        assert(pool->unfinished_el_ctr == 0);
> +    }
> +}
> +
> +bool thread_pool_set_max_threads(ThreadPool *pool,
> +                                 int max_threads)
> +{
> +    assert(max_threads > 0);
> +
> +    return g_thread_pool_set_max_threads(pool->t, max_threads, NULL);
> +}
> +
> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool)
> +{
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    return thread_pool_set_max_threads(pool, pool->unfinished_el_ctr);
> +}
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-26 19:29   ` Cédric Le Goater
@ 2024-11-26 21:22     ` Maciej S. Szmigiero
  2024-12-05 13:10       ` Cédric Le Goater
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-26 21:22 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.11.2024 20:29, Cédric Le Goater wrote:
> On 11/17/24 20:19, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Migration code wants to manage device data sending threads in one place.
>>
>> QEMU has an existing thread pool implementation, however it is limited
>> to queuing AIO operations only and essentially has a 1:1 mapping between
>> the current AioContext and the AIO ThreadPool in use.
>>
>> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
>> GThreadPool.
>>
>> This brings a few new operations on a pool:
>> * thread_pool_wait() operation waits until all the submitted work requests
>> have finished.
>>
>> * thread_pool_set_max_threads() explicitly sets the maximum thread count
>> in the pool.
>>
>> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
>> in the pool to equal the number of still waiting in queue or unfinished work.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/block/thread-pool.h |   9 +++
>>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 118 insertions(+)
>>
>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>> index 6f27eb085b45..3f9f66307b65 100644
>> --- a/include/block/thread-pool.h
>> +++ b/include/block/thread-pool.h
>> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>> +typedef struct ThreadPool ThreadPool;
>> +
>> +ThreadPool *thread_pool_new(void);
>> +void thread_pool_free(ThreadPool *pool);
>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>> +                        void *opaque, GDestroyNotify opaque_destroy);
>> +void thread_pool_wait(ThreadPool *pool);
>> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
>> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
> 
> We should add documentation for these routines.

Ack.

>>   #endif
>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>> index 908194dc070f..d80c4181c897 100644
>> --- a/util/thread-pool.c
>> +++ b/util/thread-pool.c
>> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>>       qemu_mutex_destroy(&pool->lock);
>>       g_free(pool);
>>   }
>> +
>> +struct ThreadPool { /* type safety */
>> +    GThreadPool *t;
>> +    size_t unfinished_el_ctr;
>> +    QemuMutex unfinished_el_ctr_mutex;
>> +    QemuCond unfinished_el_ctr_zero_cond;
>> +};
> 
> 
> I find the naming of the attributes a little confusing. Could we
> use names similar to ThreadPoolAio. Something like :
> 
> struct ThreadPool { /* type safety */
>      GThreadPool *t;
>      int cur_threads;

"cur_work" would probably be more accurate since the code that
decrements this counter is still running inside a worker thread
so by the time this reaches zero technically there are still
threads running.

>      QemuMutex lock;

This lock only protects the counter above, not the rest of the
structure so I guess "cur_work_lock" would be more accurate.

>      QemuCond finished_cond;

I would go for "all_finished_cond", since it's only signaled once
all of the work is finished (the counter above reaches zero).

> };
> 
> 
> 
>> +
>> +typedef struct {
>> +    ThreadPoolFunc *func;
>> +    void *opaque;
>> +    GDestroyNotify opaque_destroy;
>> +} ThreadPoolElement;
>> +
>> +static void thread_pool_func(gpointer data, gpointer user_data)
>> +{
>> +    ThreadPool *pool = user_data;
>> +    g_autofree ThreadPoolElement *el = data;
>> +
>> +    el->func(el->opaque);
>> +
>> +    if (el->opaque_destroy) {
>> +        el->opaque_destroy(el->opaque);
>> +    }
>> +
>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>> +
>> +    assert(pool->unfinished_el_ctr > 0);
>> +    pool->unfinished_el_ctr--;
>> +
>> +    if (pool->unfinished_el_ctr == 0) {
>> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
>> +    }
>> +}
>> +
>> +ThreadPool *thread_pool_new(void)
>> +{
>> +    ThreadPool *pool = g_new(ThreadPool, 1);
>> +
>> +    pool->unfinished_el_ctr = 0;
>> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
>> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
>> +
>> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
>> +    /*
>> +     * g_thread_pool_new() can only return errors if initial thread(s)
>> +     * creation fails but we ask for 0 initial threads above.
>> +     */
>> +    assert(pool->t);
>> +
>> +    return pool;
>> +}
>> +
>> +void thread_pool_free(ThreadPool *pool)
>> +{
>> +    g_thread_pool_free(pool->t, FALSE, TRUE);
>> +
>> +    qemu_cond_destroy(&pool->unfinished_el_ctr_zero_cond);
>> +    qemu_mutex_destroy(&pool->unfinished_el_ctr_mutex);
>> +
>> +    g_free(pool);
>> +}
>> +
>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>> +                        void *opaque, GDestroyNotify opaque_destroy)
>> +{
>> +    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
> 
> Where are the ThreadPool elements freed ? I am missing something
> may be.

At the entry to thread_pool_func(), the initialization of
automatic storage duration variable "ThreadPoolElement *el" takes
ownership of this object (RAII) and frees it when this variable
goes out of scope (that is, when this function exits) since it is
marked as a g_autofree.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-26 21:22     ` Maciej S. Szmigiero
@ 2024-12-05 13:10       ` Cédric Le Goater
  0 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-05 13:10 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Fabiano Rosas, Peter Xu, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 11/26/24 22:22, Maciej S. Szmigiero wrote:
> On 26.11.2024 20:29, Cédric Le Goater wrote:
>> On 11/17/24 20:19, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Migration code wants to manage device data sending threads in one place.
>>>
>>> QEMU has an existing thread pool implementation, however it is limited
>>> to queuing AIO operations only and essentially has a 1:1 mapping between
>>> the current AioContext and the AIO ThreadPool in use.
>>>
>>> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
>>> GThreadPool.
>>>
>>> This brings a few new operations on a pool:
>>> * thread_pool_wait() operation waits until all the submitted work requests
>>> have finished.
>>>
>>> * thread_pool_set_max_threads() explicitly sets the maximum thread count
>>> in the pool.
>>>
>>> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
>>> in the pool to equal the number of still waiting in queue or unfinished work.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   include/block/thread-pool.h |   9 +++
>>>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>>>   2 files changed, 118 insertions(+)
>>>
>>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>>> index 6f27eb085b45..3f9f66307b65 100644
>>> --- a/include/block/thread-pool.h
>>> +++ b/include/block/thread-pool.h
>>> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>>> +typedef struct ThreadPool ThreadPool;
>>> +
>>> +ThreadPool *thread_pool_new(void);
>>> +void thread_pool_free(ThreadPool *pool);
>>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>> +                        void *opaque, GDestroyNotify opaque_destroy);
>>> +void thread_pool_wait(ThreadPool *pool);
>>> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
>>> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>>
>> We should add documentation for these routines.
> 
> Ack.
> 
>>>   #endif
>>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>>> index 908194dc070f..d80c4181c897 100644
>>> --- a/util/thread-pool.c
>>> +++ b/util/thread-pool.c
>>> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>>>       qemu_mutex_destroy(&pool->lock);
>>>       g_free(pool);
>>>   }
>>> +
>>> +struct ThreadPool { /* type safety */
>>> +    GThreadPool *t;
>>> +    size_t unfinished_el_ctr;
>>> +    QemuMutex unfinished_el_ctr_mutex;
>>> +    QemuCond unfinished_el_ctr_zero_cond;
>>> +};
>>
>>
>> I find the naming of the attributes a little confusing. Could we
>> use names similar to ThreadPoolAio. Something like :
>>
>> struct ThreadPool { /* type safety */
>>      GThreadPool *t;
>>      int cur_threads;
> 
> "cur_work" would probably be more accurate since the code that
> decrements this counter is still running inside a worker thread
> so by the time this reaches zero technically there are still
> threads running.
> 
>>      QemuMutex lock;
> 
> This lock only protects the counter above, not the rest of the
> structure so I guess "cur_work_lock" would be more accurate.
> 
>>      QemuCond finished_cond;
> 
> I would go for "all_finished_cond", since it's only signaled once
> all of the work is finished (the counter above reaches zero).


All good for me.

> 
>> };
>>
>>
>>
>>> +
>>> +typedef struct {
>>> +    ThreadPoolFunc *func;
>>> +    void *opaque;
>>> +    GDestroyNotify opaque_destroy;
>>> +} ThreadPoolElement;
>>> +
>>> +static void thread_pool_func(gpointer data, gpointer user_data)
>>> +{
>>> +    ThreadPool *pool = user_data;
>>> +    g_autofree ThreadPoolElement *el = data;
>>> +
>>> +    el->func(el->opaque);
>>> +
>>> +    if (el->opaque_destroy) {
>>> +        el->opaque_destroy(el->opaque);
>>> +    }
>>> +
>>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>>> +
>>> +    assert(pool->unfinished_el_ctr > 0);
>>> +    pool->unfinished_el_ctr--;
>>> +
>>> +    if (pool->unfinished_el_ctr == 0) {
>>> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
>>> +    }
>>> +}
>>> +
>>> +ThreadPool *thread_pool_new(void)
>>> +{
>>> +    ThreadPool *pool = g_new(ThreadPool, 1);
>>> +
>>> +    pool->unfinished_el_ctr = 0;
>>> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
>>> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
>>> +
>>> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
>>> +    /*
>>> +     * g_thread_pool_new() can only return errors if initial thread(s)
>>> +     * creation fails but we ask for 0 initial threads above.
>>> +     */
>>> +    assert(pool->t);
>>> +
>>> +    return pool;
>>> +}
>>> +
>>> +void thread_pool_free(ThreadPool *pool)
>>> +{
>>> +    g_thread_pool_free(pool->t, FALSE, TRUE);
>>> +
>>> +    qemu_cond_destroy(&pool->unfinished_el_ctr_zero_cond);
>>> +    qemu_mutex_destroy(&pool->unfinished_el_ctr_mutex);
>>> +
>>> +    g_free(pool);
>>> +}
>>> +
>>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>>> +                        void *opaque, GDestroyNotify opaque_destroy)
>>> +{
>>> +    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
>>
>> Where are the ThreadPool elements freed ? I am missing something
>> may be.
> 
> At the entry to thread_pool_func(), the initialization of
> automatic storage duration variable "ThreadPoolElement *el" takes
> ownership of this object (RAII) and frees it when this variable
> goes out of scope (that is, when this function exits) since it is
> marked as a g_autofree.

OK. I missed it.

Thanks,

C.




^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-17 19:19 ` [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
  2024-11-25 19:41   ` Fabiano Rosas
  2024-11-26 19:29   ` Cédric Le Goater
@ 2024-11-28 10:08   ` Avihai Horon
  2024-11-28 12:11     ` Maciej S. Szmigiero
  2024-12-04 20:04   ` Peter Xu
  3 siblings, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-11-28 10:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

Hi Maciej,

On 17/11/2024 21:19, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Migration code wants to manage device data sending threads in one place.
>
> QEMU has an existing thread pool implementation, however it is limited
> to queuing AIO operations only and essentially has a 1:1 mapping between
> the current AioContext and the AIO ThreadPool in use.
>
> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
> GThreadPool.
>
> This brings a few new operations on a pool:
> * thread_pool_wait() operation waits until all the submitted work requests
> have finished.
>
> * thread_pool_set_max_threads() explicitly sets the maximum thread count
> in the pool.
>
> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
> in the pool to equal the number of still waiting in queue or unfinished work.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/block/thread-pool.h |   9 +++
>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 118 insertions(+)
>
> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
> index 6f27eb085b45..3f9f66307b65 100644
> --- a/include/block/thread-pool.h
> +++ b/include/block/thread-pool.h
> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>
> +typedef struct ThreadPool ThreadPool;
> +
> +ThreadPool *thread_pool_new(void);
> +void thread_pool_free(ThreadPool *pool);
> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                        void *opaque, GDestroyNotify opaque_destroy);
> +void thread_pool_wait(ThreadPool *pool);
> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>
>   #endif
> diff --git a/util/thread-pool.c b/util/thread-pool.c
> index 908194dc070f..d80c4181c897 100644
> --- a/util/thread-pool.c
> +++ b/util/thread-pool.c
> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>       qemu_mutex_destroy(&pool->lock);
>       g_free(pool);
>   }
> +
> +struct ThreadPool { /* type safety */
> +    GThreadPool *t;
> +    size_t unfinished_el_ctr;
> +    QemuMutex unfinished_el_ctr_mutex;
> +    QemuCond unfinished_el_ctr_zero_cond;
> +};
> +
> +typedef struct {
> +    ThreadPoolFunc *func;
> +    void *opaque;
> +    GDestroyNotify opaque_destroy;
> +} ThreadPoolElement;
> +
> +static void thread_pool_func(gpointer data, gpointer user_data)
> +{
> +    ThreadPool *pool = user_data;
> +    g_autofree ThreadPoolElement *el = data;
> +
> +    el->func(el->opaque);
> +
> +    if (el->opaque_destroy) {
> +        el->opaque_destroy(el->opaque);
> +    }
> +
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    assert(pool->unfinished_el_ctr > 0);
> +    pool->unfinished_el_ctr--;
> +
> +    if (pool->unfinished_el_ctr == 0) {
> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
> +    }
> +}
> +
> +ThreadPool *thread_pool_new(void)
> +{
> +    ThreadPool *pool = g_new(ThreadPool, 1);
> +
> +    pool->unfinished_el_ctr = 0;
> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
> +
> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
> +    /*
> +     * g_thread_pool_new() can only return errors if initial thread(s)
> +     * creation fails but we ask for 0 initial threads above.
> +     */
> +    assert(pool->t);
> +
> +    return pool;
> +}
> +
> +void thread_pool_free(ThreadPool *pool)
> +{
> +    g_thread_pool_free(pool->t, FALSE, TRUE);
> +
> +    qemu_cond_destroy(&pool->unfinished_el_ctr_zero_cond);
> +    qemu_mutex_destroy(&pool->unfinished_el_ctr_mutex);
> +
> +    g_free(pool);
> +}
> +
> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
> +                        void *opaque, GDestroyNotify opaque_destroy)
> +{
> +    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
> +
> +    el->func = func;
> +    el->opaque = opaque;
> +    el->opaque_destroy = opaque_destroy;
> +
> +    WITH_QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex) {
> +        pool->unfinished_el_ctr++;
> +    }
> +
> +    /*
> +     * Ignore the return value since this function can only return errors
> +     * if creation of an additional thread fails but even in this case the
> +     * provided work is still getting queued (just for the existing threads).
> +     */
> +    g_thread_pool_push(pool->t, el, NULL);
> +}
> +
> +void thread_pool_wait(ThreadPool *pool)
> +{
> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
> +
> +    if (pool->unfinished_el_ctr > 0) {
> +        qemu_cond_wait(&pool->unfinished_el_ctr_zero_cond,
> +                       &pool->unfinished_el_ctr_mutex);
> +        assert(pool->unfinished_el_ctr == 0);
> +    }

Shouldn't we put the condition in a while loop and remove the assert (as 
the wait may wake up spuriously)?

Thanks.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-28 10:08   ` Avihai Horon
@ 2024-11-28 12:11     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-28 12:11 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Fabiano Rosas, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel, Peter Xu

On 28.11.2024 11:08, Avihai Horon wrote:
> Hi Maciej,
> 
> On 17/11/2024 21:19, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Migration code wants to manage device data sending threads in one place.
>>
>> QEMU has an existing thread pool implementation, however it is limited
>> to queuing AIO operations only and essentially has a 1:1 mapping between
>> the current AioContext and the AIO ThreadPool in use.
>>
>> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
>> GThreadPool.
>>
>> This brings a few new operations on a pool:
>> * thread_pool_wait() operation waits until all the submitted work requests
>> have finished.
>>
>> * thread_pool_set_max_threads() explicitly sets the maximum thread count
>> in the pool.
>>
>> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
>> in the pool to equal the number of still waiting in queue or unfinished work.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/block/thread-pool.h |   9 +++
>>   util/thread-pool.c          | 109 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 118 insertions(+)
>>
>> diff --git a/include/block/thread-pool.h b/include/block/thread-pool.h
>> index 6f27eb085b45..3f9f66307b65 100644
>> --- a/include/block/thread-pool.h
>> +++ b/include/block/thread-pool.h
>> @@ -38,5 +38,14 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPoolFunc *func, void *arg,
>>   int coroutine_fn thread_pool_submit_co(ThreadPoolFunc *func, void *arg);
>>   void thread_pool_update_params(ThreadPoolAio *pool, struct AioContext *ctx);
>>
>> +typedef struct ThreadPool ThreadPool;
>> +
>> +ThreadPool *thread_pool_new(void);
>> +void thread_pool_free(ThreadPool *pool);
>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>> +                        void *opaque, GDestroyNotify opaque_destroy);
>> +void thread_pool_wait(ThreadPool *pool);
>> +bool thread_pool_set_max_threads(ThreadPool *pool, int max_threads);
>> +bool thread_pool_adjust_max_threads_to_work(ThreadPool *pool);
>>
>>   #endif
>> diff --git a/util/thread-pool.c b/util/thread-pool.c
>> index 908194dc070f..d80c4181c897 100644
>> --- a/util/thread-pool.c
>> +++ b/util/thread-pool.c
>> @@ -374,3 +374,112 @@ void thread_pool_free_aio(ThreadPoolAio *pool)
>>       qemu_mutex_destroy(&pool->lock);
>>       g_free(pool);
>>   }
>> +
>> +struct ThreadPool { /* type safety */
>> +    GThreadPool *t;
>> +    size_t unfinished_el_ctr;
>> +    QemuMutex unfinished_el_ctr_mutex;
>> +    QemuCond unfinished_el_ctr_zero_cond;
>> +};
>> +
>> +typedef struct {
>> +    ThreadPoolFunc *func;
>> +    void *opaque;
>> +    GDestroyNotify opaque_destroy;
>> +} ThreadPoolElement;
>> +
>> +static void thread_pool_func(gpointer data, gpointer user_data)
>> +{
>> +    ThreadPool *pool = user_data;
>> +    g_autofree ThreadPoolElement *el = data;
>> +
>> +    el->func(el->opaque);
>> +
>> +    if (el->opaque_destroy) {
>> +        el->opaque_destroy(el->opaque);
>> +    }
>> +
>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>> +
>> +    assert(pool->unfinished_el_ctr > 0);
>> +    pool->unfinished_el_ctr--;
>> +
>> +    if (pool->unfinished_el_ctr == 0) {
>> +        qemu_cond_signal(&pool->unfinished_el_ctr_zero_cond);
>> +    }
>> +}
>> +
>> +ThreadPool *thread_pool_new(void)
>> +{
>> +    ThreadPool *pool = g_new(ThreadPool, 1);
>> +
>> +    pool->unfinished_el_ctr = 0;
>> +    qemu_mutex_init(&pool->unfinished_el_ctr_mutex);
>> +    qemu_cond_init(&pool->unfinished_el_ctr_zero_cond);
>> +
>> +    pool->t = g_thread_pool_new(thread_pool_func, pool, 0, TRUE, NULL);
>> +    /*
>> +     * g_thread_pool_new() can only return errors if initial thread(s)
>> +     * creation fails but we ask for 0 initial threads above.
>> +     */
>> +    assert(pool->t);
>> +
>> +    return pool;
>> +}
>> +
>> +void thread_pool_free(ThreadPool *pool)
>> +{
>> +    g_thread_pool_free(pool->t, FALSE, TRUE);
>> +
>> +    qemu_cond_destroy(&pool->unfinished_el_ctr_zero_cond);
>> +    qemu_mutex_destroy(&pool->unfinished_el_ctr_mutex);
>> +
>> +    g_free(pool);
>> +}
>> +
>> +void thread_pool_submit(ThreadPool *pool, ThreadPoolFunc *func,
>> +                        void *opaque, GDestroyNotify opaque_destroy)
>> +{
>> +    ThreadPoolElement *el = g_new(ThreadPoolElement, 1);
>> +
>> +    el->func = func;
>> +    el->opaque = opaque;
>> +    el->opaque_destroy = opaque_destroy;
>> +
>> +    WITH_QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex) {
>> +        pool->unfinished_el_ctr++;
>> +    }
>> +
>> +    /*
>> +     * Ignore the return value since this function can only return errors
>> +     * if creation of an additional thread fails but even in this case the
>> +     * provided work is still getting queued (just for the existing threads).
>> +     */
>> +    g_thread_pool_push(pool->t, el, NULL);
>> +}
>> +
>> +void thread_pool_wait(ThreadPool *pool)
>> +{
>> +    QEMU_LOCK_GUARD(&pool->unfinished_el_ctr_mutex);
>> +
>> +    if (pool->unfinished_el_ctr > 0) {
>> +        qemu_cond_wait(&pool->unfinished_el_ctr_zero_cond,
>> +                       &pool->unfinished_el_ctr_mutex);
>> +        assert(pool->unfinished_el_ctr == 0);
>> +    }
> 
> Shouldn't we put the condition in a while loop and remove the assert (as the wait may wake up spuriously)?

You're right - spurious wake-ups can theoretically happen.

> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support
  2024-11-17 19:19 ` [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
                     ` (2 preceding siblings ...)
  2024-11-28 10:08   ` Avihai Horon
@ 2024-12-04 20:04   ` Peter Xu
  3 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-04 20:04 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:19:59PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Migration code wants to manage device data sending threads in one place.
> 
> QEMU has an existing thread pool implementation, however it is limited
> to queuing AIO operations only and essentially has a 1:1 mapping between
> the current AioContext and the AIO ThreadPool in use.
> 
> Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
> GThreadPool.
> 
> This brings a few new operations on a pool:
> * thread_pool_wait() operation waits until all the submitted work requests
> have finished.
> 
> * thread_pool_set_max_threads() explicitly sets the maximum thread count
> in the pool.
> 
> * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
> in the pool to equal the number of still waiting in queue or unfinished work.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

All the comments so far make sense to me too, so if you address all of
them, feel free to take this alone:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2024-11-17 19:19 ` [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-25 19:46   ` Fabiano Rosas
                     ` (2 more replies)
  2024-11-17 19:20 ` [PATCH v3 06/24] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
                   ` (20 subsequent siblings)
  25 siblings, 3 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
used to mark the switchover point in main migration stream.

It can be used to inform the destination that all pre-switchover main
migration stream data has been sent/received so it can start to process
post-switchover data that it might have received via other migration
channels like the multifd ones.

Add also the relevant MigrationState bit stream compatibility property and
its hw_compat entry.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/machine.c                  |  1 +
 include/migration/client-options.h |  4 +++
 include/migration/register.h       | 12 +++++++++
 migration/colo.c                   |  3 +++
 migration/migration-hmp-cmds.c     |  2 ++
 migration/migration.c              |  3 +++
 migration/migration.h              |  2 ++
 migration/options.c                |  9 +++++++
 migration/savevm.c                 | 39 ++++++++++++++++++++++++++++++
 migration/savevm.h                 |  1 +
 migration/trace-events             |  1 +
 scripts/analyze-migration.py       | 11 +++++++++
 12 files changed, 88 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index a35c4a8faecb..ed8d39fd769f 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -38,6 +38,7 @@
 
 GlobalProperty hw_compat_9_1[] = {
     { TYPE_PCI_DEVICE, "x-pcie-ext-tag", "false" },
+    { "migration", "send-switchover-start", "off"},
 };
 const size_t hw_compat_9_1_len = G_N_ELEMENTS(hw_compat_9_1);
 
diff --git a/include/migration/client-options.h b/include/migration/client-options.h
index 59f4b55cf4f7..289c9d776221 100644
--- a/include/migration/client-options.h
+++ b/include/migration/client-options.h
@@ -10,6 +10,10 @@
 #ifndef QEMU_MIGRATION_CLIENT_OPTIONS_H
 #define QEMU_MIGRATION_CLIENT_OPTIONS_H
 
+
+/* properties */
+bool migrate_send_switchover_start(void);
+
 /* capabilities */
 
 bool migrate_background_snapshot(void);
diff --git a/include/migration/register.h b/include/migration/register.h
index 0b0292738320..ff0faf5f68c8 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -279,6 +279,18 @@ typedef struct SaveVMHandlers {
      * otherwise
      */
     bool (*switchover_ack_needed)(void *opaque);
+
+    /**
+     * @switchover_start
+     *
+     * Notifies that the switchover has started. Called only on
+     * the destination.
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*switchover_start)(void *opaque);
 } SaveVMHandlers;
 
 /**
diff --git a/migration/colo.c b/migration/colo.c
index 9590f281d0f1..a75c2c41b464 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         bql_unlock();
         goto out;
     }
+
+    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
+
     /* Note: device state is saved into buffer */
     ret = qemu_save_device_state(fb);
 
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 20d1a6e21948..59d0c48a3e0d 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -46,6 +46,8 @@ static void migration_global_dump(Monitor *mon)
                    ms->send_configuration ? "on" : "off");
     monitor_printf(mon, "send-section-footer: %s\n",
                    ms->send_section_footer ? "on" : "off");
+    monitor_printf(mon, "send-switchover-start: %s\n",
+                   ms->send_switchover_start ? "on" : "off");
     monitor_printf(mon, "clear-bitmap-shift: %u\n",
                    ms->clear_bitmap_shift);
 }
diff --git a/migration/migration.c b/migration/migration.c
index 8c5bd0a75c85..2e9d6d5087d7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2543,6 +2543,8 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     }
     restart_block = true;
 
+    qemu_savevm_maybe_send_switchover_start(ms->to_dst_file);
+
     /*
      * Cause any non-postcopiable, but iterative devices to
      * send out their final data.
@@ -2742,6 +2744,7 @@ static int migration_completion_precopy(MigrationState *s,
      */
     s->block_inactive = !migrate_colo();
     migration_rate_set(RATE_LIMIT_DISABLED);
+    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
     ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
                                              s->block_inactive);
 out_unlock:
diff --git a/migration/migration.h b/migration/migration.h
index 0956e9274b2c..2a18349cfec2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -403,6 +403,8 @@ struct MigrationState {
     bool send_configuration;
     /* Whether we send section footer during migration */
     bool send_section_footer;
+    /* Whether we send switchover start notification during migration */
+    bool send_switchover_start;
 
     /* Needed by postcopy-pause state */
     QemuSemaphore postcopy_pause_sem;
diff --git a/migration/options.c b/migration/options.c
index ad8d6989a807..f916c8ed4e09 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -92,6 +92,8 @@ Property migration_properties[] = {
                      send_configuration, true),
     DEFINE_PROP_BOOL("send-section-footer", MigrationState,
                      send_section_footer, true),
+    DEFINE_PROP_BOOL("send-switchover-start", MigrationState,
+                     send_switchover_start, true),
     DEFINE_PROP_BOOL("multifd-flush-after-each-section", MigrationState,
                       multifd_flush_after_each_section, false),
     DEFINE_PROP_UINT8("x-clear-bitmap-shift", MigrationState,
@@ -206,6 +208,13 @@ bool migrate_auto_converge(void)
     return s->capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
 }
 
+bool migrate_send_switchover_start(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->send_switchover_start;
+}
+
 bool migrate_background_snapshot(void)
 {
     MigrationState *s = migrate_get_current();
diff --git a/migration/savevm.c b/migration/savevm.c
index f4e4876f7202..a254c38edcca 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -90,6 +90,7 @@ enum qemu_vm_cmd {
     MIG_CMD_ENABLE_COLO,       /* Enable COLO */
     MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
     MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
+    MIG_CMD_SWITCHOVER_START,  /* Switchover start notification */
     MIG_CMD_MAX
 };
 
@@ -109,6 +110,7 @@ static struct mig_cmd_args {
     [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
     [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
     [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
+    [MIG_CMD_SWITCHOVER_START] = { .len =  0, .name = "SWITCHOVER_START" },
     [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
 };
 
@@ -1201,6 +1203,19 @@ void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name)
     qemu_savevm_command_send(f, MIG_CMD_RECV_BITMAP, len + 1, (uint8_t *)buf);
 }
 
+static void qemu_savevm_send_switchover_start(QEMUFile *f)
+{
+    trace_savevm_send_switchover_start();
+    qemu_savevm_command_send(f, MIG_CMD_SWITCHOVER_START, 0, NULL);
+}
+
+void qemu_savevm_maybe_send_switchover_start(QEMUFile *f)
+{
+    if (migrate_send_switchover_start()) {
+        qemu_savevm_send_switchover_start(f);
+    }
+}
+
 bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
@@ -1713,6 +1728,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
 
     ret = qemu_file_get_error(f);
     if (ret == 0) {
+        qemu_savevm_maybe_send_switchover_start(f);
         qemu_savevm_state_complete_precopy(f, false, false);
         ret = qemu_file_get_error(f);
     }
@@ -2413,6 +2429,26 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
     return ret;
 }
 
+static int loadvm_postcopy_handle_switchover_start(void)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        int ret;
+
+        if (!se->ops || !se->ops->switchover_start) {
+            continue;
+        }
+
+        ret = se->ops->switchover_start(se->opaque);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * 0           just a normal return
@@ -2511,6 +2547,9 @@ static int loadvm_process_command(QEMUFile *f)
 
     case MIG_CMD_ENABLE_COLO:
         return loadvm_process_enable_colo(mis);
+
+    case MIG_CMD_SWITCHOVER_START:
+        return loadvm_postcopy_handle_switchover_start();
     }
 
     return 0;
diff --git a/migration/savevm.h b/migration/savevm.h
index 9ec96a995c93..4d402723bc3c 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -53,6 +53,7 @@ void qemu_savevm_send_postcopy_listen(QEMUFile *f);
 void qemu_savevm_send_postcopy_run(QEMUFile *f);
 void qemu_savevm_send_postcopy_resume(QEMUFile *f);
 void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name);
+void qemu_savevm_maybe_send_switchover_start(QEMUFile *f);
 
 void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
                                            uint16_t len,
diff --git a/migration/trace-events b/migration/trace-events
index bb0e0cc6dcfe..551f5af0740f 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -39,6 +39,7 @@ savevm_send_postcopy_run(void) ""
 savevm_send_postcopy_resume(void) ""
 savevm_send_colo_enable(void) ""
 savevm_send_recv_bitmap(char *name) "%s"
+savevm_send_switchover_start(void) ""
 savevm_state_setup(void) ""
 savevm_state_resume_prepare(void) ""
 savevm_state_header(void) ""
diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
index 8a254a5b6a2e..a4d4042584c0 100755
--- a/scripts/analyze-migration.py
+++ b/scripts/analyze-migration.py
@@ -564,7 +564,9 @@ class MigrationDump(object):
     QEMU_VM_SUBSECTION    = 0x05
     QEMU_VM_VMDESCRIPTION = 0x06
     QEMU_VM_CONFIGURATION = 0x07
+    QEMU_VM_COMMAND       = 0x08
     QEMU_VM_SECTION_FOOTER= 0x7e
+    QEMU_MIG_CMD_SWITCHOVER_START = 0x0b
 
     def __init__(self, filename):
         self.section_classes = {
@@ -626,6 +628,15 @@ def read(self, desc_only = False, dump_memory = False, write_memory = False):
             elif section_type == self.QEMU_VM_SECTION_PART or section_type == self.QEMU_VM_SECTION_END:
                 section_id = file.read32()
                 self.sections[section_id].read()
+            elif section_type == self.QEMU_VM_COMMAND:
+                command_type = file.read16()
+                command_data_len = file.read16()
+                if command_type != self.QEMU_MIG_CMD_SWITCHOVER_START:
+                    raise Exception("Unknown QEMU_VM_COMMAND: %x" %
+                                    (command_type))
+                if command_data_len != 0:
+                    raise Exception("Invalid SWITCHOVER_START length: %x" %
+                                    (command_data_len))
             elif section_type == self.QEMU_VM_SECTION_FOOTER:
                 read_section_id = file.read32()
                 if read_section_id != section_id:


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-11-17 19:20 ` [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
@ 2024-11-25 19:46   ` Fabiano Rosas
  2024-11-26 19:37   ` Cédric Le Goater
  2024-12-04 21:29   ` Peter Xu
  2 siblings, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-25 19:46 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
> used to mark the switchover point in main migration stream.
>
> It can be used to inform the destination that all pre-switchover main
> migration stream data has been sent/received so it can start to process
> post-switchover data that it might have received via other migration
> channels like the multifd ones.
>
> Add also the relevant MigrationState bit stream compatibility property and
> its hw_compat entry.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-11-17 19:20 ` [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
  2024-11-25 19:46   ` Fabiano Rosas
@ 2024-11-26 19:37   ` Cédric Le Goater
  2024-11-26 21:22     ` Maciej S. Szmigiero
  2024-12-04 21:29   ` Peter Xu
  2 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-26 19:37 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
> used to mark the switchover point in main migration stream.
> 
> It can be used to inform the destination that all pre-switchover main
> migration stream data has been sent/received so it can start to process
> post-switchover data that it might have received via other migration
> channels like the multifd ones.
> 
> Add also the relevant MigrationState bit stream compatibility property and
> its hw_compat entry.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/core/machine.c                  |  1 +
>   include/migration/client-options.h |  4 +++
>   include/migration/register.h       | 12 +++++++++
>   migration/colo.c                   |  3 +++
>   migration/migration-hmp-cmds.c     |  2 ++
>   migration/migration.c              |  3 +++
>   migration/migration.h              |  2 ++
>   migration/options.c                |  9 +++++++
>   migration/savevm.c                 | 39 ++++++++++++++++++++++++++++++
>   migration/savevm.h                 |  1 +
>   migration/trace-events             |  1 +
>   scripts/analyze-migration.py       | 11 +++++++++
>   12 files changed, 88 insertions(+)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index a35c4a8faecb..ed8d39fd769f 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -38,6 +38,7 @@
>   
>   GlobalProperty hw_compat_9_1[] = {
>       { TYPE_PCI_DEVICE, "x-pcie-ext-tag", "false" },
> +    { "migration", "send-switchover-start", "off"},
>   };
>   const size_t hw_compat_9_1_len = G_N_ELEMENTS(hw_compat_9_1);
>   
> diff --git a/include/migration/client-options.h b/include/migration/client-options.h
> index 59f4b55cf4f7..289c9d776221 100644
> --- a/include/migration/client-options.h
> +++ b/include/migration/client-options.h
> @@ -10,6 +10,10 @@
>   #ifndef QEMU_MIGRATION_CLIENT_OPTIONS_H
>   #define QEMU_MIGRATION_CLIENT_OPTIONS_H
>   
> +
> +/* properties */
> +bool migrate_send_switchover_start(void);
> +
>   /* capabilities */
>   
>   bool migrate_background_snapshot(void);
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 0b0292738320..ff0faf5f68c8 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -279,6 +279,18 @@ typedef struct SaveVMHandlers {
>        * otherwise
>        */
>       bool (*switchover_ack_needed)(void *opaque);
> +
> +    /**
> +     * @switchover_start
> +     *
> +     * Notifies that the switchover has started. Called only on
> +     * the destination.
> +     *
> +     * @opaque: data pointer passed to register_savevm_live()
> +     *
> +     * Returns zero to indicate success and negative for error
> +     */
> +    int (*switchover_start)(void *opaque);

We don't need an 'Error **' parameter  ? Just asking.

>   } SaveVMHandlers;
>   
>   /**
> diff --git a/migration/colo.c b/migration/colo.c
> index 9590f281d0f1..a75c2c41b464 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
>           bql_unlock();
>           goto out;
>       }
> +
> +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);

I would drop '_maybe_' from the name.


Thanks,

C.


> +
>       /* Note: device state is saved into buffer */
>       ret = qemu_save_device_state(fb);
>   
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 20d1a6e21948..59d0c48a3e0d 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -46,6 +46,8 @@ static void migration_global_dump(Monitor *mon)
>                      ms->send_configuration ? "on" : "off");
>       monitor_printf(mon, "send-section-footer: %s\n",
>                      ms->send_section_footer ? "on" : "off");
> +    monitor_printf(mon, "send-switchover-start: %s\n",
> +                   ms->send_switchover_start ? "on" : "off");
>       monitor_printf(mon, "clear-bitmap-shift: %u\n",
>                      ms->clear_bitmap_shift);
>   }
> diff --git a/migration/migration.c b/migration/migration.c
> index 8c5bd0a75c85..2e9d6d5087d7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2543,6 +2543,8 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>       }
>       restart_block = true;
>   
> +    qemu_savevm_maybe_send_switchover_start(ms->to_dst_file);
> +
>       /*
>        * Cause any non-postcopiable, but iterative devices to
>        * send out their final data.
> @@ -2742,6 +2744,7 @@ static int migration_completion_precopy(MigrationState *s,
>        */
>       s->block_inactive = !migrate_colo();
>       migration_rate_set(RATE_LIMIT_DISABLED);
> +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
>       ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>                                                s->block_inactive);
>   out_unlock:
> diff --git a/migration/migration.h b/migration/migration.h
> index 0956e9274b2c..2a18349cfec2 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -403,6 +403,8 @@ struct MigrationState {
>       bool send_configuration;
>       /* Whether we send section footer during migration */
>       bool send_section_footer;
> +    /* Whether we send switchover start notification during migration */
> +    bool send_switchover_start;
>   
>       /* Needed by postcopy-pause state */
>       QemuSemaphore postcopy_pause_sem;
> diff --git a/migration/options.c b/migration/options.c
> index ad8d6989a807..f916c8ed4e09 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -92,6 +92,8 @@ Property migration_properties[] = {
>                        send_configuration, true),
>       DEFINE_PROP_BOOL("send-section-footer", MigrationState,
>                        send_section_footer, true),
> +    DEFINE_PROP_BOOL("send-switchover-start", MigrationState,
> +                     send_switchover_start, true),
>       DEFINE_PROP_BOOL("multifd-flush-after-each-section", MigrationState,
>                         multifd_flush_after_each_section, false),
>       DEFINE_PROP_UINT8("x-clear-bitmap-shift", MigrationState,
> @@ -206,6 +208,13 @@ bool migrate_auto_converge(void)
>       return s->capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
>   }
>   
> +bool migrate_send_switchover_start(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->send_switchover_start;
> +}
> +
>   bool migrate_background_snapshot(void)
>   {
>       MigrationState *s = migrate_get_current();
> diff --git a/migration/savevm.c b/migration/savevm.c
> index f4e4876f7202..a254c38edcca 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -90,6 +90,7 @@ enum qemu_vm_cmd {
>       MIG_CMD_ENABLE_COLO,       /* Enable COLO */
>       MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
>       MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
> +    MIG_CMD_SWITCHOVER_START,  /* Switchover start notification */
>       MIG_CMD_MAX
>   };
>   
> @@ -109,6 +110,7 @@ static struct mig_cmd_args {
>       [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
>       [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
>       [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
> +    [MIG_CMD_SWITCHOVER_START] = { .len =  0, .name = "SWITCHOVER_START" },
>       [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
>   };
>   
> @@ -1201,6 +1203,19 @@ void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name)
>       qemu_savevm_command_send(f, MIG_CMD_RECV_BITMAP, len + 1, (uint8_t *)buf);
>   }
>   
> +static void qemu_savevm_send_switchover_start(QEMUFile *f)
> +{
> +    trace_savevm_send_switchover_start();
> +    qemu_savevm_command_send(f, MIG_CMD_SWITCHOVER_START, 0, NULL);
> +}
> +
> +void qemu_savevm_maybe_send_switchover_start(QEMUFile *f)
> +{
> +    if (migrate_send_switchover_start()) {
> +        qemu_savevm_send_switchover_start(f);
> +    }
> +}
> +
>   bool qemu_savevm_state_blocked(Error **errp)
>   {
>       SaveStateEntry *se;
> @@ -1713,6 +1728,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
>   
>       ret = qemu_file_get_error(f);
>       if (ret == 0) {
> +        qemu_savevm_maybe_send_switchover_start(f);
>           qemu_savevm_state_complete_precopy(f, false, false);
>           ret = qemu_file_get_error(f);
>       }
> @@ -2413,6 +2429,26 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
>       return ret;
>   }
>   
> +static int loadvm_postcopy_handle_switchover_start(void)
> +{
> +    SaveStateEntry *se;
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        int ret;
> +
> +        if (!se->ops || !se->ops->switchover_start) {
> +            continue;
> +        }
> +
> +        ret = se->ops->switchover_start(se->opaque);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>   /*
>    * Process an incoming 'QEMU_VM_COMMAND'
>    * 0           just a normal return
> @@ -2511,6 +2547,9 @@ static int loadvm_process_command(QEMUFile *f)
>   
>       case MIG_CMD_ENABLE_COLO:
>           return loadvm_process_enable_colo(mis);
> +
> +    case MIG_CMD_SWITCHOVER_START:
> +        return loadvm_postcopy_handle_switchover_start();
>       }
>   
>       return 0;
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 9ec96a995c93..4d402723bc3c 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -53,6 +53,7 @@ void qemu_savevm_send_postcopy_listen(QEMUFile *f);
>   void qemu_savevm_send_postcopy_run(QEMUFile *f);
>   void qemu_savevm_send_postcopy_resume(QEMUFile *f);
>   void qemu_savevm_send_recv_bitmap(QEMUFile *f, char *block_name);
> +void qemu_savevm_maybe_send_switchover_start(QEMUFile *f);
>   
>   void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
>                                              uint16_t len,
> diff --git a/migration/trace-events b/migration/trace-events
> index bb0e0cc6dcfe..551f5af0740f 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -39,6 +39,7 @@ savevm_send_postcopy_run(void) ""
>   savevm_send_postcopy_resume(void) ""
>   savevm_send_colo_enable(void) ""
>   savevm_send_recv_bitmap(char *name) "%s"
> +savevm_send_switchover_start(void) ""
>   savevm_state_setup(void) ""
>   savevm_state_resume_prepare(void) ""
>   savevm_state_header(void) ""
> diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
> index 8a254a5b6a2e..a4d4042584c0 100755
> --- a/scripts/analyze-migration.py
> +++ b/scripts/analyze-migration.py
> @@ -564,7 +564,9 @@ class MigrationDump(object):
>       QEMU_VM_SUBSECTION    = 0x05
>       QEMU_VM_VMDESCRIPTION = 0x06
>       QEMU_VM_CONFIGURATION = 0x07
> +    QEMU_VM_COMMAND       = 0x08
>       QEMU_VM_SECTION_FOOTER= 0x7e
> +    QEMU_MIG_CMD_SWITCHOVER_START = 0x0b
>   
>       def __init__(self, filename):
>           self.section_classes = {
> @@ -626,6 +628,15 @@ def read(self, desc_only = False, dump_memory = False, write_memory = False):
>               elif section_type == self.QEMU_VM_SECTION_PART or section_type == self.QEMU_VM_SECTION_END:
>                   section_id = file.read32()
>                   self.sections[section_id].read()
> +            elif section_type == self.QEMU_VM_COMMAND:
> +                command_type = file.read16()
> +                command_data_len = file.read16()
> +                if command_type != self.QEMU_MIG_CMD_SWITCHOVER_START:
> +                    raise Exception("Unknown QEMU_VM_COMMAND: %x" %
> +                                    (command_type))
> +                if command_data_len != 0:
> +                    raise Exception("Invalid SWITCHOVER_START length: %x" %
> +                                    (command_data_len))
>               elif section_type == self.QEMU_VM_SECTION_FOOTER:
>                   read_section_id = file.read32()
>                   if read_section_id != section_id:
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-11-26 19:37   ` Cédric Le Goater
@ 2024-11-26 21:22     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-26 21:22 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Fabiano Rosas

On 26.11.2024 20:37, Cédric Le Goater wrote:
> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
>> used to mark the switchover point in main migration stream.
>>
>> It can be used to inform the destination that all pre-switchover main
>> migration stream data has been sent/received so it can start to process
>> post-switchover data that it might have received via other migration
>> channels like the multifd ones.
>>
>> Add also the relevant MigrationState bit stream compatibility property and
>> its hw_compat entry.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/core/machine.c                  |  1 +
>>   include/migration/client-options.h |  4 +++
>>   include/migration/register.h       | 12 +++++++++
>>   migration/colo.c                   |  3 +++
>>   migration/migration-hmp-cmds.c     |  2 ++
>>   migration/migration.c              |  3 +++
>>   migration/migration.h              |  2 ++
>>   migration/options.c                |  9 +++++++
>>   migration/savevm.c                 | 39 ++++++++++++++++++++++++++++++
>>   migration/savevm.h                 |  1 +
>>   migration/trace-events             |  1 +
>>   scripts/analyze-migration.py       | 11 +++++++++
>>   12 files changed, 88 insertions(+)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index a35c4a8faecb..ed8d39fd769f 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -38,6 +38,7 @@
>>   GlobalProperty hw_compat_9_1[] = {
>>       { TYPE_PCI_DEVICE, "x-pcie-ext-tag", "false" },
>> +    { "migration", "send-switchover-start", "off"},
>>   };
>>   const size_t hw_compat_9_1_len = G_N_ELEMENTS(hw_compat_9_1);
>> diff --git a/include/migration/client-options.h b/include/migration/client-options.h
>> index 59f4b55cf4f7..289c9d776221 100644
>> --- a/include/migration/client-options.h
>> +++ b/include/migration/client-options.h
>> @@ -10,6 +10,10 @@
>>   #ifndef QEMU_MIGRATION_CLIENT_OPTIONS_H
>>   #define QEMU_MIGRATION_CLIENT_OPTIONS_H
>> +
>> +/* properties */
>> +bool migrate_send_switchover_start(void);
>> +
>>   /* capabilities */
>>   bool migrate_background_snapshot(void);
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 0b0292738320..ff0faf5f68c8 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -279,6 +279,18 @@ typedef struct SaveVMHandlers {
>>        * otherwise
>>        */
>>       bool (*switchover_ack_needed)(void *opaque);
>> +
>> +    /**
>> +     * @switchover_start
>> +     *
>> +     * Notifies that the switchover has started. Called only on
>> +     * the destination.
>> +     *
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     */
>> +    int (*switchover_start)(void *opaque);
> 
> We don't need an 'Error **' parameter  ? Just asking.

This is only called from "loadvm_process_command(QEMUFile *f)",
which does not support "Error" returns.

>>   } SaveVMHandlers;
>>   /**
>> diff --git a/migration/colo.c b/migration/colo.c
>> index 9590f281d0f1..a75c2c41b464 100644
>> --- a/migration/colo.c
>> +++ b/migration/colo.c
>> @@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
>>           bql_unlock();
>>           goto out;
>>       }
>> +
>> +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
> 
> I would drop '_maybe_' from the name.

I can drop it, but then there will be no hint in this function
name that this sending is conditional on the relevant migration
property (rather than unconditional).

> 
> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-11-17 19:20 ` [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
  2024-11-25 19:46   ` Fabiano Rosas
  2024-11-26 19:37   ` Cédric Le Goater
@ 2024-12-04 21:29   ` Peter Xu
  2024-12-05 19:46     ` Zhang Chen
  2 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-04 21:29 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:00PM +0100, Maciej S. Szmigiero wrote:
> diff --git a/migration/colo.c b/migration/colo.c
> index 9590f281d0f1..a75c2c41b464 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
>          bql_unlock();
>          goto out;
>      }
> +
> +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
> +
>      /* Note: device state is saved into buffer */
>      ret = qemu_save_device_state(fb);

Looks all good, except I'm not sure whether we should touch colo.  IIUC it
should be safer to remove it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-12-04 21:29   ` Peter Xu
@ 2024-12-05 19:46     ` Zhang Chen
  2024-12-06 18:24       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Zhang Chen @ 2024-12-05 19:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On Thu, Dec 5, 2024 at 5:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Sun, Nov 17, 2024 at 08:20:00PM +0100, Maciej S. Szmigiero wrote:
> > diff --git a/migration/colo.c b/migration/colo.c
> > index 9590f281d0f1..a75c2c41b464 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
> >          bql_unlock();
> >          goto out;
> >      }
> > +
> > +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
> > +
> >      /* Note: device state is saved into buffer */
> >      ret = qemu_save_device_state(fb);
>
> Looks all good, except I'm not sure whether we should touch colo.  IIUC it
> should be safer to remove it.
>

Agree with Peter's comments.
If I understand correctly, the current COLO doesn't support multifd migration.

Thanks
Chen




> --
> Peter Xu
>
>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-12-05 19:46     ` Zhang Chen
@ 2024-12-06 18:24       ` Maciej S. Szmigiero
  2024-12-06 22:12         ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 18:24 UTC (permalink / raw)
  To: Zhang Chen, Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 5.12.2024 20:46, Zhang Chen wrote:
> On Thu, Dec 5, 2024 at 5:30 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Sun, Nov 17, 2024 at 08:20:00PM +0100, Maciej S. Szmigiero wrote:
>>> diff --git a/migration/colo.c b/migration/colo.c
>>> index 9590f281d0f1..a75c2c41b464 100644
>>> --- a/migration/colo.c
>>> +++ b/migration/colo.c
>>> @@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
>>>           bql_unlock();
>>>           goto out;
>>>       }
>>> +
>>> +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
>>> +
>>>       /* Note: device state is saved into buffer */
>>>       ret = qemu_save_device_state(fb);
>>
>> Looks all good, except I'm not sure whether we should touch colo.  IIUC it
>> should be safer to remove it.
>>
> 
> Agree with Peter's comments.
> If I understand correctly, the current COLO doesn't support multifd migration.

This patch adds a generic migration bit stream command, which could be used
for other purposes than multifd device state migration too.

It just so happens we make use of it for VFIO driver multifd device state
migration currently since we need a way to achieve the same functionality
as save_live_complete_precopy_{begin,end} handlers did in the previous
versions of this patch set.

Since adding this bit stream command to COLO does not cost anything
(it's already behind a compatibility migration property) and it may be
useful in the future I would advise to keep it there.

On the other hand, if we don't add it to COLO now but it turns out it
will be needed there to implement some functionality in the future then
we'll need to add yet another compatibility migration property for that.

> Thanks
> Chen
> 

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-12-06 18:24       ` Maciej S. Szmigiero
@ 2024-12-06 22:12         ` Peter Xu
  2024-12-09  1:43           ` Zhang Chen
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-06 22:12 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Zhang Chen, Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Avihai Horon, Joao Martins, qemu-devel

On Fri, Dec 06, 2024 at 07:24:58PM +0100, Maciej S. Szmigiero wrote:
> On 5.12.2024 20:46, Zhang Chen wrote:
> > On Thu, Dec 5, 2024 at 5:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > On Sun, Nov 17, 2024 at 08:20:00PM +0100, Maciej S. Szmigiero wrote:
> > > > diff --git a/migration/colo.c b/migration/colo.c
> > > > index 9590f281d0f1..a75c2c41b464 100644
> > > > --- a/migration/colo.c
> > > > +++ b/migration/colo.c
> > > > @@ -452,6 +452,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
> > > >           bql_unlock();
> > > >           goto out;
> > > >       }
> > > > +
> > > > +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
> > > > +
> > > >       /* Note: device state is saved into buffer */
> > > >       ret = qemu_save_device_state(fb);
> > > 
> > > Looks all good, except I'm not sure whether we should touch colo.  IIUC it
> > > should be safer to remove it.
> > > 
> > 
> > Agree with Peter's comments.
> > If I understand correctly, the current COLO doesn't support multifd migration.
> 
> This patch adds a generic migration bit stream command, which could be used
> for other purposes than multifd device state migration too.
> 
> It just so happens we make use of it for VFIO driver multifd device state
> migration currently since we need a way to achieve the same functionality
> as save_live_complete_precopy_{begin,end} handlers did in the previous
> versions of this patch set.
> 
> Since adding this bit stream command to COLO does not cost anything
> (it's already behind a compatibility migration property) and it may be
> useful in the future I would advise to keep it there.
> 
> On the other hand, if we don't add it to COLO now but it turns out it
> will be needed there to implement some functionality in the future then
> we'll need to add yet another compatibility migration property for that.

There's one thing still slightly off for COLO, where IIUC COLO runs that in
a loop to synchronize device states (colo_do_checkpoint_transaction()) to
the other side, so that's not exactly where the "switchover" (in COLO's
wording, I think it's called "failover") happens for COLO..  Hence the name
qemu_savevm_maybe_send_switchover_start() may be slightly misleading in
COLO's case..

But that's not a huge deal.  At least I checked and I agree the code should
work for COLO too, and I think COLO should need something like machine type
to work properly across upgrades, in that case I think COLO is safe.  So
I'm OK with keeping this, as long as Chen doesn't object.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler
  2024-12-06 22:12         ` Peter Xu
@ 2024-12-09  1:43           ` Zhang Chen
  0 siblings, 0 replies; 140+ messages in thread
From: Zhang Chen @ 2024-12-09  1:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins,
	qemu-devel@nongnu.org

[-- Attachment #1: Type: text/plain, Size: 2964 bytes --]

On Sat, Dec 7, 2024, 6:12 AM Peter Xu <peterx@redhat.com> wrote:

> On Fri, Dec 06, 2024 at 07:24:58PM +0100, Maciej S. Szmigiero wrote:
> > On 5.12.2024 20:46, Zhang Chen wrote:
> > > On Thu, Dec 5, 2024 at 5:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Sun, Nov 17, 2024 at 08:20:00PM +0100, Maciej S. Szmigiero wrote:
> > > > > diff --git a/migration/colo.c b/migration/colo.c
> > > > > index 9590f281d0f1..a75c2c41b464 100644
> > > > > --- a/migration/colo.c
> > > > > +++ b/migration/colo.c
> > > > > @@ -452,6 +452,9 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
> > > > >           bql_unlock();
> > > > >           goto out;
> > > > >       }
> > > > > +
> > > > > +    qemu_savevm_maybe_send_switchover_start(s->to_dst_file);
> > > > > +
> > > > >       /* Note: device state is saved into buffer */
> > > > >       ret = qemu_save_device_state(fb);
> > > >
> > > > Looks all good, except I'm not sure whether we should touch colo.
> IIUC it
> > > > should be safer to remove it.
> > > >
> > >
> > > Agree with Peter's comments.
> > > If I understand correctly, the current COLO doesn't support multifd
> migration.
> >
> > This patch adds a generic migration bit stream command, which could be
> used
> > for other purposes than multifd device state migration too.
> >
> > It just so happens we make use of it for VFIO driver multifd device state
> > migration currently since we need a way to achieve the same functionality
> > as save_live_complete_precopy_{begin,end} handlers did in the previous
> > versions of this patch set.
> >
> > Since adding this bit stream command to COLO does not cost anything
> > (it's already behind a compatibility migration property) and it may be
> > useful in the future I would advise to keep it there.
> >
> > On the other hand, if we don't add it to COLO now but it turns out it
> > will be needed there to implement some functionality in the future then
> > we'll need to add yet another compatibility migration property for that.
>
> There's one thing still slightly off for COLO, where IIUC COLO runs that in
> a loop to synchronize device states (colo_do_checkpoint_transaction()) to
> the other side, so that's not exactly where the "switchover" (in COLO's
> wording, I think it's called "failover") happens for COLO..  Hence the name
> qemu_savevm_maybe_send_switchover_start() may be slightly misleading in
> COLO's case..
>
> But that's not a huge deal.  At least I checked and I agree the code should
> work for COLO too, and I think COLO should need something like machine type
> to work properly across upgrades, in that case I think COLO is safe.  So
> I'm OK with keeping this, as long as Chen doesn't object.
>
>
Thanks for explaining the details of this series. I think it's OK after
rechecked COLO code, feel free to add my reviewed-by for COLO part.

Thanks
Chen


> --
> Peter Xu
>
>

[-- Attachment #2: Type: text/html, Size: 4125 bytes --]

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 06/24] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-12-04 21:32   ` Peter Xu
  2024-11-17 19:20 ` [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers Maciej S. Szmigiero
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 17 +++++++++++++++++
 migration/savevm.c           | 23 +++++++++++++++++++++++
 migration/savevm.h           |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index ff0faf5f68c8..39991f3cc5d0 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -229,6 +229,23 @@ typedef struct SaveVMHandlers {
      */
     int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+    /* This runs outside the BQL. */
+
+    /**
+     * @load_state_buffer
+     *
+     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @buf: the data buffer to load
+     * @len: the data length in buffer
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    int (*load_state_buffer)(void *opaque, char *buf, size_t len,
+                             Error **errp);
+
     /**
      * @load_setup
      *
diff --git a/migration/savevm.c b/migration/savevm.c
index a254c38edcca..1f58a2fa54ae 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3085,6 +3085,29 @@ int qemu_loadvm_approve_switchover(void)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp)
+{
+    SaveStateEntry *se;
+
+    se = find_se(idstr, instance_id);
+    if (!se) {
+        error_setg(errp,
+                   "Unknown idstr %s or instance id %u for load state buffer",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    if (!se->ops || !se->ops->load_state_buffer) {
+        error_setg(errp,
+                   "idstr %s / instance %u has no load state buffer operation",
+                   idstr, instance_id);
+        return -1;
+    }
+
+    return se->ops->load_state_buffer(se->opaque, buf, len, errp);
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 4d402723bc3c..b5a4f8c8b440 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -71,4 +71,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+                                  char *buf, size_t len, Error **errp);
+
 #endif


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 06/24] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-11-17 19:20 ` [PATCH v3 06/24] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2024-12-04 21:32   ` Peter Xu
  2024-12-06 21:12     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-04 21:32 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:01PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> qemu_loadvm_load_state_buffer() and its load_state_buffer
> SaveVMHandler allow providing device state buffer to explicitly
> specified device via its idstr and instance id.
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

One nitpick:

> ---
>  include/migration/register.h | 17 +++++++++++++++++
>  migration/savevm.c           | 23 +++++++++++++++++++++++
>  migration/savevm.h           |  3 +++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index ff0faf5f68c8..39991f3cc5d0 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -229,6 +229,23 @@ typedef struct SaveVMHandlers {
>       */
>      int (*load_state)(QEMUFile *f, void *opaque, int version_id);
>  
> +    /* This runs outside the BQL. */
> +
> +    /**
> +     * @load_state_buffer
> +     *
> +     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
> +     *
> +     * @opaque: data pointer passed to register_savevm_live()
> +     * @buf: the data buffer to load
> +     * @len: the data length in buffer
> +     * @errp: pointer to Error*, to store an error if it happens.
> +     *
> +     * Returns zero to indicate success and negative for error
> +     */
> +    int (*load_state_buffer)(void *opaque, char *buf, size_t len,
> +                             Error **errp);
> +
>      /**
>       * @load_setup
>       *
> diff --git a/migration/savevm.c b/migration/savevm.c
> index a254c38edcca..1f58a2fa54ae 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3085,6 +3085,29 @@ int qemu_loadvm_approve_switchover(void)
>      return migrate_send_rp_switchover_ack(mis);
>  }
>  
> +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
> +                                  char *buf, size_t len, Error **errp)

Suggest to always return bool as success/fail, especially when using
Error**.

> +{
> +    SaveStateEntry *se;
> +
> +    se = find_se(idstr, instance_id);
> +    if (!se) {
> +        error_setg(errp,
> +                   "Unknown idstr %s or instance id %u for load state buffer",
> +                   idstr, instance_id);
> +        return -1;
> +    }
> +
> +    if (!se->ops || !se->ops->load_state_buffer) {
> +        error_setg(errp,
> +                   "idstr %s / instance %u has no load state buffer operation",
> +                   idstr, instance_id);
> +        return -1;
> +    }
> +
> +    return se->ops->load_state_buffer(se->opaque, buf, len, errp);
> +}
> +
>  bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>                    bool has_devices, strList *devices, Error **errp)
>  {
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 4d402723bc3c..b5a4f8c8b440 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -71,4 +71,7 @@ int qemu_loadvm_approve_switchover(void);
>  int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>          bool in_postcopy, bool inactivate_disks);
>  
> +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
> +                                  char *buf, size_t len, Error **errp);
> +
>  #endif
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 06/24] migration: Add qemu_loadvm_load_state_buffer() and its handler
  2024-12-04 21:32   ` Peter Xu
@ 2024-12-06 21:12     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 21:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.12.2024 22:32, Peter Xu wrote:
> On Sun, Nov 17, 2024 at 08:20:01PM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> qemu_loadvm_load_state_buffer() and its load_state_buffer
>> SaveVMHandler allow providing device state buffer to explicitly
>> specified device via its idstr and instance id.
>>
>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> One nitpick:
> 
>> ---
>>   include/migration/register.h | 17 +++++++++++++++++
>>   migration/savevm.c           | 23 +++++++++++++++++++++++
>>   migration/savevm.h           |  3 +++
>>   3 files changed, 43 insertions(+)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index ff0faf5f68c8..39991f3cc5d0 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -229,6 +229,23 @@ typedef struct SaveVMHandlers {
>>        */
>>       int (*load_state)(QEMUFile *f, void *opaque, int version_id);
>>   
>> +    /* This runs outside the BQL. */
>> +
>> +    /**
>> +     * @load_state_buffer
>> +     *
>> +     * Load device state buffer provided to qemu_loadvm_load_state_buffer().
>> +     *
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     * @buf: the data buffer to load
>> +     * @len: the data length in buffer
>> +     * @errp: pointer to Error*, to store an error if it happens.
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     */
>> +    int (*load_state_buffer)(void *opaque, char *buf, size_t len,
>> +                             Error **errp);
>> +
>>       /**
>>        * @load_setup
>>        *
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index a254c38edcca..1f58a2fa54ae 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -3085,6 +3085,29 @@ int qemu_loadvm_approve_switchover(void)
>>       return migrate_send_rp_switchover_ack(mis);
>>   }
>>   
>> +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
>> +                                  char *buf, size_t len, Error **errp)
> 
> Suggest to always return bool as success/fail, especially when using
> Error**.
> 

Will change the return type to bool then.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (5 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 06/24] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-12-04 21:38   ` Peter Xu
  2024-11-17 19:20 ` [PATCH v3 08/24] migration: Add thread pool of optional load threads Maciej S. Szmigiero
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Some of these SaveVMHandlers were missing the BQL behavior annotation,
making people wonder what it exactly is.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/register.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 39991f3cc5d0..761e4e4d8bcb 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -212,6 +212,8 @@ typedef struct SaveVMHandlers {
     void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
                                 uint64_t *can_postcopy);
 
+    /* This runs inside the BQL. */
+
     /**
      * @load_state
      *
@@ -246,6 +248,8 @@ typedef struct SaveVMHandlers {
     int (*load_state_buffer)(void *opaque, char *buf, size_t len,
                              Error **errp);
 
+    /* The following handlers run inside the BQL. */
+
     /**
      * @load_setup
      *
@@ -272,6 +276,9 @@ typedef struct SaveVMHandlers {
      */
     int (*load_cleanup)(void *opaque);
 
+
+    /* This runs outside the BQL. */
+
     /**
      * @resume_prepare
      *
@@ -284,6 +291,8 @@ typedef struct SaveVMHandlers {
      */
     int (*resume_prepare)(MigrationState *s, void *opaque);
 
+    /* The following handlers run inside the BQL. */
+
     /**
      * @switchover_ack_needed
      *


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers
  2024-11-17 19:20 ` [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers Maciej S. Szmigiero
@ 2024-12-04 21:38   ` Peter Xu
  2024-12-06 18:40     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-04 21:38 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:02PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Some of these SaveVMHandlers were missing the BQL behavior annotation,
> making people wonder what it exactly is.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/migration/register.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 39991f3cc5d0..761e4e4d8bcb 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -212,6 +212,8 @@ typedef struct SaveVMHandlers {
>      void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
>                                  uint64_t *can_postcopy);
>  
> +    /* This runs inside the BQL. */
> +
>      /**
>       * @load_state
>       *
> @@ -246,6 +248,8 @@ typedef struct SaveVMHandlers {
>      int (*load_state_buffer)(void *opaque, char *buf, size_t len,
>                               Error **errp);
>  
> +    /* The following handlers run inside the BQL. */
> +
>      /**
>       * @load_setup
>       *
> @@ -272,6 +276,9 @@ typedef struct SaveVMHandlers {
>       */
>      int (*load_cleanup)(void *opaque);
>  
> +
> +    /* This runs outside the BQL. */
> +
>      /**
>       * @resume_prepare
>       *
> @@ -284,6 +291,8 @@ typedef struct SaveVMHandlers {
>       */
>      int (*resume_prepare)(MigrationState *s, void *opaque);
>  
> +    /* The following handlers run inside the BQL. */
> +
>      /**
>       * @switchover_ack_needed
>       *
> 

Such change is not only error prone when adding new hooks, it's also hard
to review..

If we do care about that, I suggest we attach that info to every command.
For example, changing from:

    /**
     * @save_state
     * ...

To:

    /**
     * @save_state (invoked with BQL)
     * ...

Or somewhere in the doc lines of each hook.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers
  2024-12-04 21:38   ` Peter Xu
@ 2024-12-06 18:40     ` Maciej S. Szmigiero
  2024-12-06 22:15       ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 18:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.12.2024 22:38, Peter Xu wrote:
> On Sun, Nov 17, 2024 at 08:20:02PM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Some of these SaveVMHandlers were missing the BQL behavior annotation,
>> making people wonder what it exactly is.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/register.h | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 39991f3cc5d0..761e4e4d8bcb 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -212,6 +212,8 @@ typedef struct SaveVMHandlers {
>>       void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
>>                                   uint64_t *can_postcopy);
>>   
>> +    /* This runs inside the BQL. */
>> +
>>       /**
>>        * @load_state
>>        *
>> @@ -246,6 +248,8 @@ typedef struct SaveVMHandlers {
>>       int (*load_state_buffer)(void *opaque, char *buf, size_t len,
>>                                Error **errp);
>>   
>> +    /* The following handlers run inside the BQL. */
>> +
>>       /**
>>        * @load_setup
>>        *
>> @@ -272,6 +276,9 @@ typedef struct SaveVMHandlers {
>>        */
>>       int (*load_cleanup)(void *opaque);
>>   
>> +
>> +    /* This runs outside the BQL. */
>> +
>>       /**
>>        * @resume_prepare
>>        *
>> @@ -284,6 +291,8 @@ typedef struct SaveVMHandlers {
>>        */
>>       int (*resume_prepare)(MigrationState *s, void *opaque);
>>   
>> +    /* The following handlers run inside the BQL. */
>> +
>>       /**
>>        * @switchover_ack_needed
>>        *
>>
> 
> Such change is not only error prone when adding new hooks, it's also hard
> to review..
> 
> If we do care about that, I suggest we attach that info to every command.
> For example, changing from:
> 
>      /**
>       * @save_state
>       * ...
> 
> To:
> 
>      /**
>       * @save_state (invoked with BQL)
>       * ...
> 
> Or somewhere in the doc lines of each hook.
> 

This would need rewriting all the existing BQL comments/annotations
in SaveVMHandlers since all of these are of the "separator" form as
these introduced in this patch.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers
  2024-12-06 18:40     ` Maciej S. Szmigiero
@ 2024-12-06 22:15       ` Peter Xu
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-06 22:15 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Dec 06, 2024 at 07:40:19PM +0100, Maciej S. Szmigiero wrote:
> On 4.12.2024 22:38, Peter Xu wrote:
> > On Sun, Nov 17, 2024 at 08:20:02PM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Some of these SaveVMHandlers were missing the BQL behavior annotation,
> > > making people wonder what it exactly is.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > ---
> > >   include/migration/register.h | 9 +++++++++
> > >   1 file changed, 9 insertions(+)
> > > 
> > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > index 39991f3cc5d0..761e4e4d8bcb 100644
> > > --- a/include/migration/register.h
> > > +++ b/include/migration/register.h
> > > @@ -212,6 +212,8 @@ typedef struct SaveVMHandlers {
> > >       void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> > >                                   uint64_t *can_postcopy);
> > > +    /* This runs inside the BQL. */
> > > +
> > >       /**
> > >        * @load_state
> > >        *
> > > @@ -246,6 +248,8 @@ typedef struct SaveVMHandlers {
> > >       int (*load_state_buffer)(void *opaque, char *buf, size_t len,
> > >                                Error **errp);
> > > +    /* The following handlers run inside the BQL. */
> > > +
> > >       /**
> > >        * @load_setup
> > >        *
> > > @@ -272,6 +276,9 @@ typedef struct SaveVMHandlers {
> > >        */
> > >       int (*load_cleanup)(void *opaque);
> > > +
> > > +    /* This runs outside the BQL. */
> > > +
> > >       /**
> > >        * @resume_prepare
> > >        *
> > > @@ -284,6 +291,8 @@ typedef struct SaveVMHandlers {
> > >        */
> > >       int (*resume_prepare)(MigrationState *s, void *opaque);
> > > +    /* The following handlers run inside the BQL. */
> > > +
> > >       /**
> > >        * @switchover_ack_needed
> > >        *
> > > 
> > 
> > Such change is not only error prone when adding new hooks, it's also hard
> > to review..
> > 
> > If we do care about that, I suggest we attach that info to every command.
> > For example, changing from:
> > 
> >      /**
> >       * @save_state
> >       * ...
> > 
> > To:
> > 
> >      /**
> >       * @save_state (invoked with BQL)
> >       * ...
> > 
> > Or somewhere in the doc lines of each hook.
> > 
> 
> This would need rewriting all the existing BQL comments/annotations
> in SaveVMHandlers since all of these are of the "separator" form as
> these introduced in this patch.

Yeah, I'd go for it if I'm touching it.  But it's your call, either use
this (it'll need 5 extra minutes for me to review such, but it's ok), or go
with what I said, or drop this patch and leave it for later.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (6 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-25 19:58   ` Fabiano Rosas
                     ` (2 more replies)
  2024-11-17 19:20 ` [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
                   ` (17 subsequent siblings)
  25 siblings, 3 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Some drivers might want to make use of auxiliary helper threads during VM
state loading, for example to make sure that their blocking (sync) I/O
operations don't block the rest of the migration process.

Add a migration core managed thread pool to facilitate this use case.

The migration core will wait for these threads to finish before
(re)starting the VM at destination.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h |  3 ++
 include/qemu/typedefs.h  |  1 +
 migration/savevm.c       | 77 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 804eb23c0607..c92ca018ab3b 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
 /* migration/block.c */
 
 AnnounceParameters *migrate_announce_params(void);
+
 /* migration/savevm.c */
 
 void dump_vmstate_json_to_file(FILE *out_fp);
+void qemu_loadvm_start_load_thread(MigrationLoadThread function,
+                                   void *opaque);
 
 /* migration/migration.c */
 void migration_object_init(void);
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 3d84efcac47a..8c8ea5c2840d 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -131,5 +131,6 @@ typedef struct IRQState *qemu_irq;
  * Function types
  */
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
+typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
 
 #endif /* QEMU_TYPEDEFS_H */
diff --git a/migration/savevm.c b/migration/savevm.c
index 1f58a2fa54ae..6ea9054c4083 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -54,6 +54,7 @@
 #include "qemu/job.h"
 #include "qemu/main-loop.h"
 #include "block/snapshot.h"
+#include "block/thread-pool.h"
 #include "qemu/cutils.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
@@ -71,6 +72,10 @@
 
 const unsigned int postcopy_ram_discard_version;
 
+static ThreadPool *load_threads;
+static int load_threads_ret;
+static bool load_threads_abort;
+
 /* Subcommands for QEMU_VM_COMMAND */
 enum qemu_vm_cmd {
     MIG_CMD_INVALID = 0,   /* Must be 0 */
@@ -2788,6 +2793,12 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
     int ret;
 
     trace_loadvm_state_setup();
+
+    assert(!load_threads);
+    load_threads = thread_pool_new();
+    load_threads_ret = 0;
+    load_threads_abort = false;
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops || !se->ops->load_setup) {
             continue;
@@ -2806,19 +2817,72 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
             return ret;
         }
     }
+
+    return 0;
+}
+
+struct LoadThreadData {
+    MigrationLoadThread function;
+    void *opaque;
+};
+
+static int qemu_loadvm_load_thread(void *thread_opaque)
+{
+    struct LoadThreadData *data = thread_opaque;
+    int ret;
+
+    ret = data->function(&load_threads_abort, data->opaque);
+    if (ret && !qatomic_read(&load_threads_ret)) {
+        /*
+         * Racy with the above read but that's okay - which thread error
+         * return we report is purely arbitrary anyway.
+         */
+        qatomic_set(&load_threads_ret, ret);
+    }
+
     return 0;
 }
 
+void qemu_loadvm_start_load_thread(MigrationLoadThread function,
+                                   void *opaque)
+{
+    struct LoadThreadData *data;
+
+    /* We only set it from this thread so it's okay to read it directly */
+    assert(!load_threads_abort);
+
+    data = g_new(struct LoadThreadData, 1);
+    data->function = function;
+    data->opaque = opaque;
+
+    thread_pool_submit(load_threads, qemu_loadvm_load_thread,
+                       data, g_free);
+    thread_pool_adjust_max_threads_to_work(load_threads);
+}
+
 void qemu_loadvm_state_cleanup(void)
 {
     SaveStateEntry *se;
 
     trace_loadvm_state_cleanup();
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (se->ops && se->ops->load_cleanup) {
             se->ops->load_cleanup(se->opaque);
         }
     }
+
+    /*
+     * We might be called even without earlier qemu_loadvm_state_setup()
+     * call if qemu_loadvm_state() fails very early.
+     */
+    if (load_threads) {
+        qatomic_set(&load_threads_abort, true);
+        bql_unlock(); /* Load threads might be waiting for BQL */
+        thread_pool_wait(load_threads);
+        bql_lock();
+        g_clear_pointer(&load_threads, thread_pool_free);
+    }
 }
 
 /* Return true if we should continue the migration, or false. */
@@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
         return ret;
     }
 
+    if (ret == 0) {
+        bql_unlock(); /* Let load threads do work requiring BQL */
+        thread_pool_wait(load_threads);
+        bql_lock();
+
+        ret = load_threads_ret;
+    }
+    /*
+     * Set this flag unconditionally so we'll catch further attempts to
+     * start additional threads via an appropriate assert()
+     */
+    qatomic_set(&load_threads_abort, true);
+
     if (ret == 0) {
         ret = qemu_file_get_error(f);
     }


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-17 19:20 ` [PATCH v3 08/24] migration: Add thread pool of optional load threads Maciej S. Szmigiero
@ 2024-11-25 19:58   ` Fabiano Rosas
  2024-11-27  9:13   ` Cédric Le Goater
  2024-11-28 10:26   ` Avihai Horon
  2 siblings, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-25 19:58 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Some drivers might want to make use of auxiliary helper threads during VM
> state loading, for example to make sure that their blocking (sync) I/O
> operations don't block the rest of the migration process.
>
> Add a migration core managed thread pool to facilitate this use case.
>
> The migration core will wait for these threads to finish before
> (re)starting the VM at destination.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-17 19:20 ` [PATCH v3 08/24] migration: Add thread pool of optional load threads Maciej S. Szmigiero
  2024-11-25 19:58   ` Fabiano Rosas
@ 2024-11-27  9:13   ` Cédric Le Goater
  2024-11-27 20:16     ` Maciej S. Szmigiero
  2024-11-28 10:26   ` Avihai Horon
  2 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-27  9:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Some drivers might want to make use of auxiliary helper threads during VM
> state loading, for example to make sure that their blocking (sync) I/O
> operations don't block the rest of the migration process.
> 
> Add a migration core managed thread pool to facilitate this use case.
> 
> The migration core will wait for these threads to finish before
> (re)starting the VM at destination.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/misc.h |  3 ++
>   include/qemu/typedefs.h  |  1 +
>   migration/savevm.c       | 77 ++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 81 insertions(+)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 804eb23c0607..c92ca018ab3b 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
>   /* migration/block.c */
>   
>   AnnounceParameters *migrate_announce_params(void);
> +
>   /* migration/savevm.c */
>   
>   void dump_vmstate_json_to_file(FILE *out_fp);
> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
> +                                   void *opaque);
>   
>   /* migration/migration.c */
>   void migration_object_init(void);
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index 3d84efcac47a..8c8ea5c2840d 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -131,5 +131,6 @@ typedef struct IRQState *qemu_irq;
>    * Function types
>    */
>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
> +typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
>   
>   #endif /* QEMU_TYPEDEFS_H */
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 1f58a2fa54ae..6ea9054c4083 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -54,6 +54,7 @@
>   #include "qemu/job.h"
>   #include "qemu/main-loop.h"
>   #include "block/snapshot.h"
> +#include "block/thread-pool.h"
>   #include "qemu/cutils.h"
>   #include "io/channel-buffer.h"
>   #include "io/channel-file.h"
> @@ -71,6 +72,10 @@
>   
>   const unsigned int postcopy_ram_discard_version;
>   
> +static ThreadPool *load_threads;
> +static int load_threads_ret;
> +static bool load_threads_abort;
> +
>   /* Subcommands for QEMU_VM_COMMAND */
>   enum qemu_vm_cmd {
>       MIG_CMD_INVALID = 0,   /* Must be 0 */
> @@ -2788,6 +2793,12 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>       int ret;
>   
>       trace_loadvm_state_setup();
> +
> +    assert(!load_threads);
> +    load_threads = thread_pool_new();
> +    load_threads_ret = 0;
> +    load_threads_abort = false;

I would introduce a qemu_loadvm_thread_pool_create() helper.

Why is the thead pool always created ? Might be OK.


> +
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           if (!se->ops || !se->ops->load_setup) {
>               continue;
> @@ -2806,19 +2817,72 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>               return ret;
>           }
>       }
> +
> +    return 0;
> +}
> +
> +struct LoadThreadData {
> +    MigrationLoadThread function;
> +    void *opaque;
> +};
> +
> +static int qemu_loadvm_load_thread(void *thread_opaque)
> +{
> +    struct LoadThreadData *data = thread_opaque;
> +    int ret;
> +
> +    ret = data->function(&load_threads_abort, data->opaque);
> +    if (ret && !qatomic_read(&load_threads_ret)) {
> +        /*
> +         * Racy with the above read but that's okay - which thread error
> +         * return we report is purely arbitrary anyway.
> +         */
> +        qatomic_set(&load_threads_ret, ret);
> +    }
> +
>       return 0;>   }
>   
> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
> +                                   void *opaque)
> +{> +    struct LoadThreadData *data;
> +
> +    /* We only set it from this thread so it's okay to read it directly */
> +    assert(!load_threads_abort);
> +
> +    data = g_new(struct LoadThreadData, 1);
> +    data->function = function;
> +    data->opaque = opaque;
> +
> +    thread_pool_submit(load_threads, qemu_loadvm_load_thread,
> +                       data, g_free);
> +    thread_pool_adjust_max_threads_to_work(load_threads);
> +}> +>   void qemu_loadvm_state_cleanup(void)
>   {
>       SaveStateEntry *se;
>   
>       trace_loadvm_state_cleanup();
> +
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           if (se->ops && se->ops->load_cleanup) {
>               se->ops->load_cleanup(se->opaque);
>           }
>       }
> +
> +    /*
> +     * We might be called even without earlier qemu_loadvm_state_setup()
> +     * call if qemu_loadvm_state() fails very early.
> +     */
> +    if (load_threads) {
> +        qatomic_set(&load_threads_abort, true);
> +        bql_unlock(); /* Load threads might be waiting for BQL */
> +        thread_pool_wait(load_threads);
> +        bql_lock();
> +        g_clear_pointer(&load_threads, thread_pool_free);
> +    }

I would introduce a qemu_loadvm_thread_pool_destroy() helper

>   }
>   
>   /* Return true if we should continue the migration, or false. */
> @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
>           return ret;
>       }
>   
> +    if (ret == 0) {
> +        bql_unlock(); /* Let load threads do work requiring BQL */
> +        thread_pool_wait(load_threads);
> +        bql_lock();
> +
> +        ret = load_threads_ret;
> +    }
> +    /*
> +     * Set this flag unconditionally so we'll catch further attempts to
> +     * start additional threads via an appropriate assert()
> +     */
> +    qatomic_set(&load_threads_abort, true);
> +


I would introduce a qemu_loadvm_thread_pool_wait() helper

>       if (ret == 0) {
>           ret = qemu_file_get_error(f);
>       }
> 

I think we could hide the implementation in a new component of
the migration subsystem or, at least, we could group the
implementation at the top the file. It would help the uninitiated
reader to become familiar with the migration area.

Thanks,

C.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-27  9:13   ` Cédric Le Goater
@ 2024-11-27 20:16     ` Maciej S. Szmigiero
  2024-12-04 22:48       ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-27 20:16 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 27.11.2024 10:13, Cédric Le Goater wrote:
> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Some drivers might want to make use of auxiliary helper threads during VM
>> state loading, for example to make sure that their blocking (sync) I/O
>> operations don't block the rest of the migration process.
>>
>> Add a migration core managed thread pool to facilitate this use case.
>>
>> The migration core will wait for these threads to finish before
>> (re)starting the VM at destination.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h |  3 ++
>>   include/qemu/typedefs.h  |  1 +
>>   migration/savevm.c       | 77 ++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 81 insertions(+)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 804eb23c0607..c92ca018ab3b 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
>>   /* migration/block.c */
>>   AnnounceParameters *migrate_announce_params(void);
>> +
>>   /* migration/savevm.c */
>>   void dump_vmstate_json_to_file(FILE *out_fp);
>> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
>> +                                   void *opaque);
>>   /* migration/migration.c */
>>   void migration_object_init(void);
>> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
>> index 3d84efcac47a..8c8ea5c2840d 100644
>> --- a/include/qemu/typedefs.h
>> +++ b/include/qemu/typedefs.h
>> @@ -131,5 +131,6 @@ typedef struct IRQState *qemu_irq;
>>    * Function types
>>    */
>>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>> +typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
>>   #endif /* QEMU_TYPEDEFS_H */
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 1f58a2fa54ae..6ea9054c4083 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -54,6 +54,7 @@
>>   #include "qemu/job.h"
>>   #include "qemu/main-loop.h"
>>   #include "block/snapshot.h"
>> +#include "block/thread-pool.h"
>>   #include "qemu/cutils.h"
>>   #include "io/channel-buffer.h"
>>   #include "io/channel-file.h"
>> @@ -71,6 +72,10 @@
>>   const unsigned int postcopy_ram_discard_version;
>> +static ThreadPool *load_threads;
>> +static int load_threads_ret;
>> +static bool load_threads_abort;
>> +
>>   /* Subcommands for QEMU_VM_COMMAND */
>>   enum qemu_vm_cmd {
>>       MIG_CMD_INVALID = 0,   /* Must be 0 */
>> @@ -2788,6 +2793,12 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>>       int ret;
>>       trace_loadvm_state_setup();
>> +
>> +    assert(!load_threads);
>> +    load_threads = thread_pool_new();
>> +    load_threads_ret = 0;
>> +    load_threads_abort = false;
> 
> I would introduce a qemu_loadvm_thread_pool_create() helper.

Will do.

> Why is the thead pool always created ? Might be OK.
> 

This functionality provides a generic auxiliary load helper threads
pool, not necessarily tied to the multifd device state transfer.

That's why the pool is created unconditionally.

>> +
>>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>>           if (!se->ops || !se->ops->load_setup) {
>>               continue;
>> @@ -2806,19 +2817,72 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>>               return ret;
>>           }
>>       }
>> +
>> +    return 0;
>> +}
>> +
>> +struct LoadThreadData {
>> +    MigrationLoadThread function;
>> +    void *opaque;
>> +};
>> +
>> +static int qemu_loadvm_load_thread(void *thread_opaque)
>> +{
>> +    struct LoadThreadData *data = thread_opaque;
>> +    int ret;
>> +
>> +    ret = data->function(&load_threads_abort, data->opaque);
>> +    if (ret && !qatomic_read(&load_threads_ret)) {
>> +        /*
>> +         * Racy with the above read but that's okay - which thread error
>> +         * return we report is purely arbitrary anyway.
>> +         */
>> +        qatomic_set(&load_threads_ret, ret);
>> +    }
>> +
>>       return 0;>   }
>> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
>> +                                   void *opaque)
>> +{> +    struct LoadThreadData *data;
>> +
>> +    /* We only set it from this thread so it's okay to read it directly */
>> +    assert(!load_threads_abort);
>> +
>> +    data = g_new(struct LoadThreadData, 1);
>> +    data->function = function;
>> +    data->opaque = opaque;
>> +
>> +    thread_pool_submit(load_threads, qemu_loadvm_load_thread,
>> +                       data, g_free);
>> +    thread_pool_adjust_max_threads_to_work(load_threads);
>> +}> +>   void qemu_loadvm_state_cleanup(void)
>>   {
>>       SaveStateEntry *se;
>>       trace_loadvm_state_cleanup();
>> +
>>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>>           if (se->ops && se->ops->load_cleanup) {
>>               se->ops->load_cleanup(se->opaque);
>>           }
>>       }
>> +
>> +    /*
>> +     * We might be called even without earlier qemu_loadvm_state_setup()
>> +     * call if qemu_loadvm_state() fails very early.
>> +     */
>> +    if (load_threads) {
>> +        qatomic_set(&load_threads_abort, true);
>> +        bql_unlock(); /* Load threads might be waiting for BQL */
>> +        thread_pool_wait(load_threads);
>> +        bql_lock();
>> +        g_clear_pointer(&load_threads, thread_pool_free);
>> +    }
> 
> I would introduce a qemu_loadvm_thread_pool_destroy() helper

Will do.

>>   }
>>   /* Return true if we should continue the migration, or false. */
>> @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
>>           return ret;
>>       }
>> +    if (ret == 0) {
>> +        bql_unlock(); /* Let load threads do work requiring BQL */
>> +        thread_pool_wait(load_threads);
>> +        bql_lock();
>> +
>> +        ret = load_threads_ret;
>> +    }
>> +    /*
>> +     * Set this flag unconditionally so we'll catch further attempts to
>> +     * start additional threads via an appropriate assert()
>> +     */
>> +    qatomic_set(&load_threads_abort, true);
>> +
> 
> 
> I would introduce a qemu_loadvm_thread_pool_wait() helper

Will do.

>>       if (ret == 0) {
>>           ret = qemu_file_get_error(f);
>>       }
>>
> 
> I think we could hide the implementation in a new component of
> the migration subsystem or, at least, we could group the
> implementation at the top the file. It would help the uninitiated
> reader to become familiar with the migration area.

I will move these new helpers to a separate area of the "savevm.c"
file, marked/separated by an appropriate comment.

> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-27 20:16     ` Maciej S. Szmigiero
@ 2024-12-04 22:48       ` Peter Xu
  2024-12-05 16:15         ` Peter Xu
  2024-12-10 23:05         ` Maciej S. Szmigiero
  0 siblings, 2 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-04 22:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Nov 27, 2024 at 09:16:49PM +0100, Maciej S. Szmigiero wrote:
> On 27.11.2024 10:13, Cédric Le Goater wrote:
> > On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Some drivers might want to make use of auxiliary helper threads during VM
> > > state loading, for example to make sure that their blocking (sync) I/O
> > > operations don't block the rest of the migration process.
> > > 
> > > Add a migration core managed thread pool to facilitate this use case.
> > > 
> > > The migration core will wait for these threads to finish before
> > > (re)starting the VM at destination.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > ---
> > >   include/migration/misc.h |  3 ++
> > >   include/qemu/typedefs.h  |  1 +
> > >   migration/savevm.c       | 77 ++++++++++++++++++++++++++++++++++++++++
> > >   3 files changed, 81 insertions(+)
> > > 
> > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > index 804eb23c0607..c92ca018ab3b 100644
> > > --- a/include/migration/misc.h
> > > +++ b/include/migration/misc.h
> > > @@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
> > >   /* migration/block.c */
> > >   AnnounceParameters *migrate_announce_params(void);
> > > +
> > >   /* migration/savevm.c */
> > >   void dump_vmstate_json_to_file(FILE *out_fp);
> > > +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
> > > +                                   void *opaque);
> > >   /* migration/migration.c */
> > >   void migration_object_init(void);
> > > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > > index 3d84efcac47a..8c8ea5c2840d 100644
> > > --- a/include/qemu/typedefs.h
> > > +++ b/include/qemu/typedefs.h
> > > @@ -131,5 +131,6 @@ typedef struct IRQState *qemu_irq;
> > >    * Function types
> > >    */
> > >   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
> > > +typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
> > >   #endif /* QEMU_TYPEDEFS_H */
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 1f58a2fa54ae..6ea9054c4083 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -54,6 +54,7 @@
> > >   #include "qemu/job.h"
> > >   #include "qemu/main-loop.h"
> > >   #include "block/snapshot.h"
> > > +#include "block/thread-pool.h"
> > >   #include "qemu/cutils.h"
> > >   #include "io/channel-buffer.h"
> > >   #include "io/channel-file.h"
> > > @@ -71,6 +72,10 @@
> > >   const unsigned int postcopy_ram_discard_version;
> > > +static ThreadPool *load_threads;
> > > +static int load_threads_ret;
> > > +static bool load_threads_abort;
> > > +
> > >   /* Subcommands for QEMU_VM_COMMAND */
> > >   enum qemu_vm_cmd {
> > >       MIG_CMD_INVALID = 0,   /* Must be 0 */
> > > @@ -2788,6 +2793,12 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
> > >       int ret;
> > >       trace_loadvm_state_setup();
> > > +
> > > +    assert(!load_threads);
> > > +    load_threads = thread_pool_new();
> > > +    load_threads_ret = 0;
> > > +    load_threads_abort = false;
> > 
> > I would introduce a qemu_loadvm_thread_pool_create() helper.
> 
> Will do.

On top of Cedric's suggestion..

Maybe move it over to migration_object_init()?  Then we keep
qemu_loadvm_state_setup() only invoke the load_setup()s.

> 
> > Why is the thead pool always created ? Might be OK.
> > 
> 
> This functionality provides a generic auxiliary load helper threads
> pool, not necessarily tied to the multifd device state transfer.
> 
> That's why the pool is created unconditionally.
> 
> > > +
> > >       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> > >           if (!se->ops || !se->ops->load_setup) {
> > >               continue;
> > > @@ -2806,19 +2817,72 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
> > >               return ret;
> > >           }
> > >       }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +struct LoadThreadData {
> > > +    MigrationLoadThread function;
> > > +    void *opaque;
> > > +};
> > > +
> > > +static int qemu_loadvm_load_thread(void *thread_opaque)
> > > +{
> > > +    struct LoadThreadData *data = thread_opaque;
> > > +    int ret;
> > > +
> > > +    ret = data->function(&load_threads_abort, data->opaque);
> > > +    if (ret && !qatomic_read(&load_threads_ret)) {
> > > +        /*
> > > +         * Racy with the above read but that's okay - which thread error
> > > +         * return we report is purely arbitrary anyway.
> > > +         */
> > > +        qatomic_set(&load_threads_ret, ret);
> > > +    }
> > > +
> > >       return 0;>   }
> > > +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
> > > +                                   void *opaque)
> > > +{> +    struct LoadThreadData *data;
> > > +
> > > +    /* We only set it from this thread so it's okay to read it directly */
> > > +    assert(!load_threads_abort);
> > > +
> > > +    data = g_new(struct LoadThreadData, 1);
> > > +    data->function = function;
> > > +    data->opaque = opaque;
> > > +
> > > +    thread_pool_submit(load_threads, qemu_loadvm_load_thread,
> > > +                       data, g_free);
> > > +    thread_pool_adjust_max_threads_to_work(load_threads);
> > > +}> +>   void qemu_loadvm_state_cleanup(void)
> > >   {
> > >       SaveStateEntry *se;
> > >       trace_loadvm_state_cleanup();
> > > +
> > >       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> > >           if (se->ops && se->ops->load_cleanup) {
> > >               se->ops->load_cleanup(se->opaque);
> > >           }
> > >       }
> > > +
> > > +    /*
> > > +     * We might be called even without earlier qemu_loadvm_state_setup()
> > > +     * call if qemu_loadvm_state() fails very early.
> > > +     */
> > > +    if (load_threads) {
> > > +        qatomic_set(&load_threads_abort, true);
> > > +        bql_unlock(); /* Load threads might be waiting for BQL */
> > > +        thread_pool_wait(load_threads);
> > > +        bql_lock();
> > > +        g_clear_pointer(&load_threads, thread_pool_free);
> > > +    }
> > 
> > I would introduce a qemu_loadvm_thread_pool_destroy() helper
> 
> Will do.

Then this one may belong to migration_incoming_state_destroy().

> 
> > >   }
> > >   /* Return true if we should continue the migration, or false. */
> > > @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
> > >           return ret;
> > >       }
> > > +    if (ret == 0) {
> > > +        bql_unlock(); /* Let load threads do work requiring BQL */
> > > +        thread_pool_wait(load_threads);
> > > +        bql_lock();
> > > +
> > > +        ret = load_threads_ret;
> > > +    }
> > > +    /*
> > > +     * Set this flag unconditionally so we'll catch further attempts to
> > > +     * start additional threads via an appropriate assert()
> > > +     */
> > > +    qatomic_set(&load_threads_abort, true);

I assume this is only for debugging purpose and not required.

Setting "abort all threads" to make sure "nobody will add more thread
tasks" is pretty awkward, IMHO.  If we really want to protect against it
and fail hard, it might be easier after the thread_pool_wait() we free the
pool directly (destroy() will see NULL so it'll skip; still need to free
there in case migration failed before this).  Then any enqueue will access
null pointer on the pool.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-04 22:48       ` Peter Xu
@ 2024-12-05 16:15         ` Peter Xu
  2024-12-10 23:05           ` Maciej S. Szmigiero
  2024-12-10 23:05         ` Maciej S. Szmigiero
  1 sibling, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-05 16:15 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Dec 04, 2024 at 05:48:52PM -0500, Peter Xu wrote:
> > > > @@ -71,6 +72,10 @@
> > > >   const unsigned int postcopy_ram_discard_version;
> > > > +static ThreadPool *load_threads;
> > > > +static int load_threads_ret;
> > > > +static bool load_threads_abort;

One thing I forgot to mention in the previous reply..

We should avoid adding random global vars.  I hope we can still move these
into MigrationIncomingState.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-05 16:15         ` Peter Xu
@ 2024-12-10 23:05           ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 5.12.2024 17:15, Peter Xu wrote:
> On Wed, Dec 04, 2024 at 05:48:52PM -0500, Peter Xu wrote:
>>>>> @@ -71,6 +72,10 @@
>>>>>    const unsigned int postcopy_ram_discard_version;
>>>>> +static ThreadPool *load_threads;
>>>>> +static int load_threads_ret;
>>>>> +static bool load_threads_abort;
> 
> One thing I forgot to mention in the previous reply..
> 
> We should avoid adding random global vars.  I hope we can still move these
> into MigrationIncomingState.
> 

Sure, this should be possible even if the thread pool
initialization happens in qemu_loadvm_state_setup().

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-04 22:48       ` Peter Xu
  2024-12-05 16:15         ` Peter Xu
@ 2024-12-10 23:05         ` Maciej S. Szmigiero
  2024-12-12 16:38           ` Peter Xu
  1 sibling, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.12.2024 23:48, Peter Xu wrote:
> On Wed, Nov 27, 2024 at 09:16:49PM +0100, Maciej S. Szmigiero wrote:
>> On 27.11.2024 10:13, Cédric Le Goater wrote:
>>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Some drivers might want to make use of auxiliary helper threads during VM
>>>> state loading, for example to make sure that their blocking (sync) I/O
>>>> operations don't block the rest of the migration process.
>>>>
>>>> Add a migration core managed thread pool to facilitate this use case.
>>>>
>>>> The migration core will wait for these threads to finish before
>>>> (re)starting the VM at destination.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>    include/migration/misc.h |  3 ++
>>>>    include/qemu/typedefs.h  |  1 +
>>>>    migration/savevm.c       | 77 ++++++++++++++++++++++++++++++++++++++++
>>>>    3 files changed, 81 insertions(+)
>>>>
>>>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>>>> index 804eb23c0607..c92ca018ab3b 100644
>>>> --- a/include/migration/misc.h
>>>> +++ b/include/migration/misc.h
>>>> @@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
>>>>    /* migration/block.c */
>>>>    AnnounceParameters *migrate_announce_params(void);
>>>> +
>>>>    /* migration/savevm.c */
>>>>    void dump_vmstate_json_to_file(FILE *out_fp);
>>>> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
>>>> +                                   void *opaque);
>>>>    /* migration/migration.c */
>>>>    void migration_object_init(void);
>>>> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
>>>> index 3d84efcac47a..8c8ea5c2840d 100644
>>>> --- a/include/qemu/typedefs.h
>>>> +++ b/include/qemu/typedefs.h
>>>> @@ -131,5 +131,6 @@ typedef struct IRQState *qemu_irq;
>>>>     * Function types
>>>>     */
>>>>    typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>>>> +typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
>>>>    #endif /* QEMU_TYPEDEFS_H */
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index 1f58a2fa54ae..6ea9054c4083 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -54,6 +54,7 @@
>>>>    #include "qemu/job.h"
>>>>    #include "qemu/main-loop.h"
>>>>    #include "block/snapshot.h"
>>>> +#include "block/thread-pool.h"
>>>>    #include "qemu/cutils.h"
>>>>    #include "io/channel-buffer.h"
>>>>    #include "io/channel-file.h"
>>>> @@ -71,6 +72,10 @@
>>>>    const unsigned int postcopy_ram_discard_version;
>>>> +static ThreadPool *load_threads;
>>>> +static int load_threads_ret;
>>>> +static bool load_threads_abort;
>>>> +
>>>>    /* Subcommands for QEMU_VM_COMMAND */
>>>>    enum qemu_vm_cmd {
>>>>        MIG_CMD_INVALID = 0,   /* Must be 0 */
>>>> @@ -2788,6 +2793,12 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>>>>        int ret;
>>>>        trace_loadvm_state_setup();
>>>> +
>>>> +    assert(!load_threads);
>>>> +    load_threads = thread_pool_new();
>>>> +    load_threads_ret = 0;
>>>> +    load_threads_abort = false;
>>>
>>> I would introduce a qemu_loadvm_thread_pool_create() helper.
>>
>> Will do.
> 
> On top of Cedric's suggestion..
> 
> Maybe move it over to migration_object_init()?  Then we keep
> qemu_loadvm_state_setup() only invoke the load_setup()s.

AFAIK migration_object_init() is called unconditionally
at QEMU startup even if there won't me any migration done?

Creating a load thread pool there seems wasteful if no
incoming migration will ever take place (or will but only
much later).

>>
>>> Why is the thead pool always created ? Might be OK.
>>>
>>
>> This functionality provides a generic auxiliary load helper threads
>> pool, not necessarily tied to the multifd device state transfer.
>>
>> That's why the pool is created unconditionally.
>>
>>>> +
>>>>        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>>>>            if (!se->ops || !se->ops->load_setup) {
>>>>                continue;
>>>> @@ -2806,19 +2817,72 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>>>>                return ret;
>>>>            }
>>>>        }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +struct LoadThreadData {
>>>> +    MigrationLoadThread function;
>>>> +    void *opaque;
>>>> +};
>>>> +
>>>> +static int qemu_loadvm_load_thread(void *thread_opaque)
>>>> +{
>>>> +    struct LoadThreadData *data = thread_opaque;
>>>> +    int ret;
>>>> +
>>>> +    ret = data->function(&load_threads_abort, data->opaque);
>>>> +    if (ret && !qatomic_read(&load_threads_ret)) {
>>>> +        /*
>>>> +         * Racy with the above read but that's okay - which thread error
>>>> +         * return we report is purely arbitrary anyway.
>>>> +         */
>>>> +        qatomic_set(&load_threads_ret, ret);
>>>> +    }
>>>> +
>>>>        return 0;>   }
>>>> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
>>>> +                                   void *opaque)
>>>> +{> +    struct LoadThreadData *data;
>>>> +
>>>> +    /* We only set it from this thread so it's okay to read it directly */
>>>> +    assert(!load_threads_abort);
>>>> +
>>>> +    data = g_new(struct LoadThreadData, 1);
>>>> +    data->function = function;
>>>> +    data->opaque = opaque;
>>>> +
>>>> +    thread_pool_submit(load_threads, qemu_loadvm_load_thread,
>>>> +                       data, g_free);
>>>> +    thread_pool_adjust_max_threads_to_work(load_threads);
>>>> +}> +>   void qemu_loadvm_state_cleanup(void)
>>>>    {
>>>>        SaveStateEntry *se;
>>>>        trace_loadvm_state_cleanup();
>>>> +
>>>>        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>>>>            if (se->ops && se->ops->load_cleanup) {
>>>>                se->ops->load_cleanup(se->opaque);
>>>>            }
>>>>        }
>>>> +
>>>> +    /*
>>>> +     * We might be called even without earlier qemu_loadvm_state_setup()
>>>> +     * call if qemu_loadvm_state() fails very early.
>>>> +     */
>>>> +    if (load_threads) {
>>>> +        qatomic_set(&load_threads_abort, true);
>>>> +        bql_unlock(); /* Load threads might be waiting for BQL */
>>>> +        thread_pool_wait(load_threads);
>>>> +        bql_lock();
>>>> +        g_clear_pointer(&load_threads, thread_pool_free);
>>>> +    }
>>>
>>> I would introduce a qemu_loadvm_thread_pool_destroy() helper
>>
>> Will do.
> 
> Then this one may belong to migration_incoming_state_destroy().
> 
>>
>>>>    }
>>>>    /* Return true if we should continue the migration, or false. */
>>>> @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
>>>>            return ret;
>>>>        }
>>>> +    if (ret == 0) {
>>>> +        bql_unlock(); /* Let load threads do work requiring BQL */
>>>> +        thread_pool_wait(load_threads);
>>>> +        bql_lock();
>>>> +
>>>> +        ret = load_threads_ret;
>>>> +    }
>>>> +    /*
>>>> +     * Set this flag unconditionally so we'll catch further attempts to
>>>> +     * start additional threads via an appropriate assert()
>>>> +     */
>>>> +    qatomic_set(&load_threads_abort, true);
> 
> I assume this is only for debugging purpose and not required.
> 
> Setting "abort all threads" to make sure "nobody will add more thread
> tasks" is pretty awkward, IMHO.  If we really want to protect against it
> and fail hard, it might be easier after the thread_pool_wait() we free the
> pool directly (destroy() will see NULL so it'll skip; still need to free
> there in case migration failed before this).  Then any enqueue will access
> null pointer on the pool.

We don't want to destroy the thread pool in the path where the downtime
is still counting.

That's why we only do cleanup after the migration is complete.

The above setting of load_threads_abort flag also makes sure that we abort
load threads if the migration is going to fail for other reasons (non-load
threads related) - in other words, when the above block with thread_pool_wait()
isn't even entered due to ret already containing an earlier error.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-10 23:05         ` Maciej S. Szmigiero
@ 2024-12-12 16:38           ` Peter Xu
  2024-12-12 22:53             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-12 16:38 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Dec 11, 2024 at 12:05:23AM +0100, Maciej S. Szmigiero wrote:
> > Maybe move it over to migration_object_init()?  Then we keep
> > qemu_loadvm_state_setup() only invoke the load_setup()s.
> 
> AFAIK migration_object_init() is called unconditionally
> at QEMU startup even if there won't me any migration done?
> 
> Creating a load thread pool there seems wasteful if no
> incoming migration will ever take place (or will but only
> much later).

I was expecting an empty pool to not be a major resource, but if that's a
concern, yes we can do that until later.

[...]

> > > > > @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
> > > > >            return ret;
> > > > >        }
> > > > > +    if (ret == 0) {
> > > > > +        bql_unlock(); /* Let load threads do work requiring BQL */
> > > > > +        thread_pool_wait(load_threads);
> > > > > +        bql_lock();
> > > > > +
> > > > > +        ret = load_threads_ret;
> > > > > +    }
> > > > > +    /*
> > > > > +     * Set this flag unconditionally so we'll catch further attempts to
> > > > > +     * start additional threads via an appropriate assert()
> > > > > +     */
> > > > > +    qatomic_set(&load_threads_abort, true);
> > 
> > I assume this is only for debugging purpose and not required.
> > 
> > Setting "abort all threads" to make sure "nobody will add more thread
> > tasks" is pretty awkward, IMHO.  If we really want to protect against it
> > and fail hard, it might be easier after the thread_pool_wait() we free the
> > pool directly (destroy() will see NULL so it'll skip; still need to free
> > there in case migration failed before this).  Then any enqueue will access
> > null pointer on the pool.
> 
> We don't want to destroy the thread pool in the path where the downtime
> is still counting.

Yeah this makes sense.

> 
> That's why we only do cleanup after the migration is complete.
> 
> The above setting of load_threads_abort flag also makes sure that we abort
> load threads if the migration is going to fail for other reasons (non-load
> threads related) - in other words, when the above block with thread_pool_wait()
> isn't even entered due to ret already containing an earlier error.

In that case IIUC we should cleanup the load threads in destroy(), not
here?  Especially with the comment that's even more confusing.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-12 16:38           ` Peter Xu
@ 2024-12-12 22:53             ` Maciej S. Szmigiero
  2024-12-16 16:29               ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-12 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12.12.2024 17:38, Peter Xu wrote:
> On Wed, Dec 11, 2024 at 12:05:23AM +0100, Maciej S. Szmigiero wrote:
>>> Maybe move it over to migration_object_init()?  Then we keep
>>> qemu_loadvm_state_setup() only invoke the load_setup()s.
>>
>> AFAIK migration_object_init() is called unconditionally
>> at QEMU startup even if there won't me any migration done?
>>
>> Creating a load thread pool there seems wasteful if no
>> incoming migration will ever take place (or will but only
>> much later).
> 
> I was expecting an empty pool to not be a major resource, but if that's a
> concern, yes we can do that until later.
> 
> [...]
> 
>>>>>> @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
>>>>>>             return ret;
>>>>>>         }
>>>>>> +    if (ret == 0) {
>>>>>> +        bql_unlock(); /* Let load threads do work requiring BQL */
>>>>>> +        thread_pool_wait(load_threads);
>>>>>> +        bql_lock();
>>>>>> +
>>>>>> +        ret = load_threads_ret;
>>>>>> +    }
>>>>>> +    /*
>>>>>> +     * Set this flag unconditionally so we'll catch further attempts to
>>>>>> +     * start additional threads via an appropriate assert()
>>>>>> +     */
>>>>>> +    qatomic_set(&load_threads_abort, true);
>>>
>>> I assume this is only for debugging purpose and not required.
>>>
>>> Setting "abort all threads" to make sure "nobody will add more thread
>>> tasks" is pretty awkward, IMHO.  If we really want to protect against it
>>> and fail hard, it might be easier after the thread_pool_wait() we free the
>>> pool directly (destroy() will see NULL so it'll skip; still need to free
>>> there in case migration failed before this).  Then any enqueue will access
>>> null pointer on the pool.
>>
>> We don't want to destroy the thread pool in the path where the downtime
>> is still counting.
> 
> Yeah this makes sense.
> 
>>
>> That's why we only do cleanup after the migration is complete.
>>
>> The above setting of load_threads_abort flag also makes sure that we abort
>> load threads if the migration is going to fail for other reasons (non-load
>> threads related) - in other words, when the above block with thread_pool_wait()
>> isn't even entered due to ret already containing an earlier error.
> 
> In that case IIUC we should cleanup the load threads in destroy(), not
> here?  Especially with the comment that's even more confusing.
> 

This flag only asks the threads in pool which are still running to exit ASAP
(without waiting for them in the "fail for other reasons"
qemu_loadvm_state() code flow).

Setting this flag does *not* do the cleanup of the whole thread pool - this
only happens in qemu_loadvm_state_cleanup().

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-12 22:53             ` Maciej S. Szmigiero
@ 2024-12-16 16:29               ` Peter Xu
  2024-12-16 23:15                 ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-16 16:29 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Dec 12, 2024 at 11:53:24PM +0100, Maciej S. Szmigiero wrote:
> On 12.12.2024 17:38, Peter Xu wrote:
> > On Wed, Dec 11, 2024 at 12:05:23AM +0100, Maciej S. Szmigiero wrote:
> > > > Maybe move it over to migration_object_init()?  Then we keep
> > > > qemu_loadvm_state_setup() only invoke the load_setup()s.
> > > 
> > > AFAIK migration_object_init() is called unconditionally
> > > at QEMU startup even if there won't me any migration done?
> > > 
> > > Creating a load thread pool there seems wasteful if no
> > > incoming migration will ever take place (or will but only
> > > much later).
> > 
> > I was expecting an empty pool to not be a major resource, but if that's a
> > concern, yes we can do that until later.
> > 
> > [...]
> > 
> > > > > > > @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
> > > > > > >             return ret;
> > > > > > >         }
> > > > > > > +    if (ret == 0) {
> > > > > > > +        bql_unlock(); /* Let load threads do work requiring BQL */
> > > > > > > +        thread_pool_wait(load_threads);
> > > > > > > +        bql_lock();
> > > > > > > +
> > > > > > > +        ret = load_threads_ret;
> > > > > > > +    }
> > > > > > > +    /*
> > > > > > > +     * Set this flag unconditionally so we'll catch further attempts to
> > > > > > > +     * start additional threads via an appropriate assert()
> > > > > > > +     */
> > > > > > > +    qatomic_set(&load_threads_abort, true);
> > > > 
> > > > I assume this is only for debugging purpose and not required.
> > > > 
> > > > Setting "abort all threads" to make sure "nobody will add more thread
> > > > tasks" is pretty awkward, IMHO.  If we really want to protect against it
> > > > and fail hard, it might be easier after the thread_pool_wait() we free the
> > > > pool directly (destroy() will see NULL so it'll skip; still need to free
> > > > there in case migration failed before this).  Then any enqueue will access
> > > > null pointer on the pool.
> > > 
> > > We don't want to destroy the thread pool in the path where the downtime
> > > is still counting.
> > 
> > Yeah this makes sense.
> > 
> > > 
> > > That's why we only do cleanup after the migration is complete.
> > > 
> > > The above setting of load_threads_abort flag also makes sure that we abort
> > > load threads if the migration is going to fail for other reasons (non-load
> > > threads related) - in other words, when the above block with thread_pool_wait()
> > > isn't even entered due to ret already containing an earlier error.
> > 
> > In that case IIUC we should cleanup the load threads in destroy(), not
> > here?  Especially with the comment that's even more confusing.
> > 
> 
> This flag only asks the threads in pool which are still running to exit ASAP
> (without waiting for them in the "fail for other reasons"
> qemu_loadvm_state() code flow).

I thought we could switch to an Error** model as we talked elsewhere, then
the thread who hits the error should set the quit flag, IIUC.

Even without it..

> 
> Setting this flag does *not* do the cleanup of the whole thread pool - this
> only happens in qemu_loadvm_state_cleanup().

... we have two cases here:

Either no error at all, then thread_pool_wait() will wait for all threads
until finished.  When reaching here setting this flag shouldn't matter for
the threads because they're all finished.

Or there's error in some thread, then QEMU should be stuck at
thread_pool_wait() anyway, until all threads quit.  Again, I thought it
could be the qemu_loadvm_load_thread() that sets the quit flag (rather than
here) so the failed thread will notify all threads to quit.

I just still don't see what's the help of setting it after
thread_pool_wait(), which already marked all threads finished at its
return.  That goes back to my question on whether it was only for debugging
(so no new threads to be created after this), rather than the flag to tell
all threads to quit.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-16 16:29               ` Peter Xu
@ 2024-12-16 23:15                 ` Maciej S. Szmigiero
  2024-12-17 14:50                   ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-16 23:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 16.12.2024 17:29, Peter Xu wrote:
> On Thu, Dec 12, 2024 at 11:53:24PM +0100, Maciej S. Szmigiero wrote:
>> On 12.12.2024 17:38, Peter Xu wrote:
>>> On Wed, Dec 11, 2024 at 12:05:23AM +0100, Maciej S. Szmigiero wrote:
>>>>> Maybe move it over to migration_object_init()?  Then we keep
>>>>> qemu_loadvm_state_setup() only invoke the load_setup()s.
>>>>
>>>> AFAIK migration_object_init() is called unconditionally
>>>> at QEMU startup even if there won't me any migration done?
>>>>
>>>> Creating a load thread pool there seems wasteful if no
>>>> incoming migration will ever take place (or will but only
>>>> much later).
>>>
>>> I was expecting an empty pool to not be a major resource, but if that's a
>>> concern, yes we can do that until later.
>>>
>>> [...]
>>>
>>>>>>>> @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
>>>>>>>>              return ret;
>>>>>>>>          }
>>>>>>>> +    if (ret == 0) {
>>>>>>>> +        bql_unlock(); /* Let load threads do work requiring BQL */
>>>>>>>> +        thread_pool_wait(load_threads);
>>>>>>>> +        bql_lock();
>>>>>>>> +
>>>>>>>> +        ret = load_threads_ret;
>>>>>>>> +    }
>>>>>>>> +    /*
>>>>>>>> +     * Set this flag unconditionally so we'll catch further attempts to
>>>>>>>> +     * start additional threads via an appropriate assert()
>>>>>>>> +     */
>>>>>>>> +    qatomic_set(&load_threads_abort, true);
>>>>>
>>>>> I assume this is only for debugging purpose and not required.
>>>>>
>>>>> Setting "abort all threads" to make sure "nobody will add more thread
>>>>> tasks" is pretty awkward, IMHO.  If we really want to protect against it
>>>>> and fail hard, it might be easier after the thread_pool_wait() we free the
>>>>> pool directly (destroy() will see NULL so it'll skip; still need to free
>>>>> there in case migration failed before this).  Then any enqueue will access
>>>>> null pointer on the pool.
>>>>
>>>> We don't want to destroy the thread pool in the path where the downtime
>>>> is still counting.
>>>
>>> Yeah this makes sense.
>>>
>>>>
>>>> That's why we only do cleanup after the migration is complete.
>>>>
>>>> The above setting of load_threads_abort flag also makes sure that we abort
>>>> load threads if the migration is going to fail for other reasons (non-load
>>>> threads related) - in other words, when the above block with thread_pool_wait()
>>>> isn't even entered due to ret already containing an earlier error.
>>>
>>> In that case IIUC we should cleanup the load threads in destroy(), not
>>> here?  Especially with the comment that's even more confusing.
>>>
>>
>> This flag only asks the threads in pool which are still running to exit ASAP
>> (without waiting for them in the "fail for other reasons"
>> qemu_loadvm_state() code flow).
> 
> I thought we could switch to an Error** model as we talked elsewhere, then
> the thread who hits the error should set the quit flag, IIUC.
> 
> Even without it..
> 
>>
>> Setting this flag does *not* do the cleanup of the whole thread pool - this
>> only happens in qemu_loadvm_state_cleanup().
> 
> ... we have two cases here:
> 
> Either no error at all, then thread_pool_wait() will wait for all threads
> until finished.  When reaching here setting this flag shouldn't matter for
> the threads because they're all finished.
> 
> Or there's error in some thread, then QEMU should be stuck at
> thread_pool_wait() anyway, until all threads quit.  Again, I thought it
> could be the qemu_loadvm_load_thread() that sets the quit flag (rather than
> here) so the failed thread will notify all threads to quit.
> 
> I just still don't see what's the help of setting it after
> thread_pool_wait(), which already marked all threads finished at its
> return.  That goes back to my question on whether it was only for debugging
> (so no new threads to be created after this), rather than the flag to tell
> all threads to quit.

There's also a possibility of earlier error in qemu_loadvm_state()
(not in the load threads themselves), for example if qemu_loadvm_state_main()
returns an error.

In this case thread_pool_wait() *won't* be called but the load threads
would still be running needlessly - setting load_threads_abort flag makes
them stop.

The debugging benefit of assert()ing when someone tries to create
a load thread after that point comes essentially for free then.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-16 23:15                 ` Maciej S. Szmigiero
@ 2024-12-17 14:50                   ` Peter Xu
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-17 14:50 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Fabiano Rosas,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Dec 17, 2024 at 12:15:36AM +0100, Maciej S. Szmigiero wrote:
> On 16.12.2024 17:29, Peter Xu wrote:
> > On Thu, Dec 12, 2024 at 11:53:24PM +0100, Maciej S. Szmigiero wrote:
> > > On 12.12.2024 17:38, Peter Xu wrote:
> > > > On Wed, Dec 11, 2024 at 12:05:23AM +0100, Maciej S. Szmigiero wrote:
> > > > > > Maybe move it over to migration_object_init()?  Then we keep
> > > > > > qemu_loadvm_state_setup() only invoke the load_setup()s.
> > > > > 
> > > > > AFAIK migration_object_init() is called unconditionally
> > > > > at QEMU startup even if there won't me any migration done?
> > > > > 
> > > > > Creating a load thread pool there seems wasteful if no
> > > > > incoming migration will ever take place (or will but only
> > > > > much later).
> > > > 
> > > > I was expecting an empty pool to not be a major resource, but if that's a
> > > > concern, yes we can do that until later.
> > > > 
> > > > [...]
> > > > 
> > > > > > > > > @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
> > > > > > > > >              return ret;
> > > > > > > > >          }
> > > > > > > > > +    if (ret == 0) {
> > > > > > > > > +        bql_unlock(); /* Let load threads do work requiring BQL */
> > > > > > > > > +        thread_pool_wait(load_threads);
> > > > > > > > > +        bql_lock();
> > > > > > > > > +
> > > > > > > > > +        ret = load_threads_ret;
> > > > > > > > > +    }
> > > > > > > > > +    /*
> > > > > > > > > +     * Set this flag unconditionally so we'll catch further attempts to
> > > > > > > > > +     * start additional threads via an appropriate assert()
> > > > > > > > > +     */
> > > > > > > > > +    qatomic_set(&load_threads_abort, true);
> > > > > > 
> > > > > > I assume this is only for debugging purpose and not required.
> > > > > > 
> > > > > > Setting "abort all threads" to make sure "nobody will add more thread
> > > > > > tasks" is pretty awkward, IMHO.  If we really want to protect against it
> > > > > > and fail hard, it might be easier after the thread_pool_wait() we free the
> > > > > > pool directly (destroy() will see NULL so it'll skip; still need to free
> > > > > > there in case migration failed before this).  Then any enqueue will access
> > > > > > null pointer on the pool.
> > > > > 
> > > > > We don't want to destroy the thread pool in the path where the downtime
> > > > > is still counting.
> > > > 
> > > > Yeah this makes sense.
> > > > 
> > > > > 
> > > > > That's why we only do cleanup after the migration is complete.
> > > > > 
> > > > > The above setting of load_threads_abort flag also makes sure that we abort
> > > > > load threads if the migration is going to fail for other reasons (non-load
> > > > > threads related) - in other words, when the above block with thread_pool_wait()
> > > > > isn't even entered due to ret already containing an earlier error.
> > > > 
> > > > In that case IIUC we should cleanup the load threads in destroy(), not
> > > > here?  Especially with the comment that's even more confusing.
> > > > 
> > > 
> > > This flag only asks the threads in pool which are still running to exit ASAP
> > > (without waiting for them in the "fail for other reasons"
> > > qemu_loadvm_state() code flow).
> > 
> > I thought we could switch to an Error** model as we talked elsewhere, then
> > the thread who hits the error should set the quit flag, IIUC.
> > 
> > Even without it..
> > 
> > > 
> > > Setting this flag does *not* do the cleanup of the whole thread pool - this
> > > only happens in qemu_loadvm_state_cleanup().
> > 
> > ... we have two cases here:
> > 
> > Either no error at all, then thread_pool_wait() will wait for all threads
> > until finished.  When reaching here setting this flag shouldn't matter for
> > the threads because they're all finished.
> > 
> > Or there's error in some thread, then QEMU should be stuck at
> > thread_pool_wait() anyway, until all threads quit.  Again, I thought it
> > could be the qemu_loadvm_load_thread() that sets the quit flag (rather than
> > here) so the failed thread will notify all threads to quit.
> > 
> > I just still don't see what's the help of setting it after
> > thread_pool_wait(), which already marked all threads finished at its
> > return.  That goes back to my question on whether it was only for debugging
> > (so no new threads to be created after this), rather than the flag to tell
> > all threads to quit.
> 
> There's also a possibility of earlier error in qemu_loadvm_state()
> (not in the load threads themselves), for example if qemu_loadvm_state_main()
> returns an error.
> 
> In this case thread_pool_wait() *won't* be called but the load threads
> would still be running needlessly - setting load_threads_abort flag makes
> them stop.
> 
> The debugging benefit of assert()ing when someone tries to create
> a load thread after that point comes essentially for free then.

In that case, IMHO we should put all cleanup stuff into the cleanup
function, like migration_incoming_state_destroy().  I'd hope not having
that only for such debug purpose, OTOH.

This step can be easily overlook when this function adds more things.
Personally, I'm totally not a fan of using "if (ret==0) {do something}" and
keep writting like that.. but that's not the issue of this patch alone, so
we can do that for later.  Even so, having it here is still error prone
(e.g. consider one "goto" in the future before this step, logically it
should skip all next steps if a prior one fails).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-17 19:20 ` [PATCH v3 08/24] migration: Add thread pool of optional load threads Maciej S. Szmigiero
  2024-11-25 19:58   ` Fabiano Rosas
  2024-11-27  9:13   ` Cédric Le Goater
@ 2024-11-28 10:26   ` Avihai Horon
  2024-11-28 12:11     ` Maciej S. Szmigiero
  2 siblings, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-11-28 10:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Some drivers might want to make use of auxiliary helper threads during VM
> state loading, for example to make sure that their blocking (sync) I/O
> operations don't block the rest of the migration process.
>
> Add a migration core managed thread pool to facilitate this use case.
>
> The migration core will wait for these threads to finish before
> (re)starting the VM at destination.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/misc.h |  3 ++
>   include/qemu/typedefs.h  |  1 +
>   migration/savevm.c       | 77 ++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 81 insertions(+)
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 804eb23c0607..c92ca018ab3b 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
>   /* migration/block.c */
>
>   AnnounceParameters *migrate_announce_params(void);
> +
>   /* migration/savevm.c */
>
>   void dump_vmstate_json_to_file(FILE *out_fp);
> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
> +                                   void *opaque);
>
>   /* migration/migration.c */
>   void migration_object_init(void);
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index 3d84efcac47a..8c8ea5c2840d 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -131,5 +131,6 @@ typedef struct IRQState *qemu_irq;
>    * Function types
>    */
>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
> +typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
>
>   #endif /* QEMU_TYPEDEFS_H */
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 1f58a2fa54ae..6ea9054c4083 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -54,6 +54,7 @@
>   #include "qemu/job.h"
>   #include "qemu/main-loop.h"
>   #include "block/snapshot.h"
> +#include "block/thread-pool.h"
>   #include "qemu/cutils.h"
>   #include "io/channel-buffer.h"
>   #include "io/channel-file.h"
> @@ -71,6 +72,10 @@
>
>   const unsigned int postcopy_ram_discard_version;
>
> +static ThreadPool *load_threads;
> +static int load_threads_ret;
> +static bool load_threads_abort;
> +
>   /* Subcommands for QEMU_VM_COMMAND */
>   enum qemu_vm_cmd {
>       MIG_CMD_INVALID = 0,   /* Must be 0 */
> @@ -2788,6 +2793,12 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>       int ret;
>
>       trace_loadvm_state_setup();
> +
> +    assert(!load_threads);
> +    load_threads = thread_pool_new();
> +    load_threads_ret = 0;
> +    load_threads_abort = false;
> +
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           if (!se->ops || !se->ops->load_setup) {
>               continue;
> @@ -2806,19 +2817,72 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>               return ret;
>           }
>       }
> +
> +    return 0;
> +}
> +
> +struct LoadThreadData {
> +    MigrationLoadThread function;
> +    void *opaque;
> +};
> +
> +static int qemu_loadvm_load_thread(void *thread_opaque)
> +{
> +    struct LoadThreadData *data = thread_opaque;
> +    int ret;
> +
> +    ret = data->function(&load_threads_abort, data->opaque);
> +    if (ret && !qatomic_read(&load_threads_ret)) {
> +        /*
> +         * Racy with the above read but that's okay - which thread error
> +         * return we report is purely arbitrary anyway.
> +         */
> +        qatomic_set(&load_threads_ret, ret);
> +    }

Can we use cmpxchg instead? E.g.:

if (ret) {
     qatomic_cmpxchg(&load_threads_ret, 0, ret);
}

> +
>       return 0;
>   }
>
> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
> +                                   void *opaque)
> +{
> +    struct LoadThreadData *data;
> +
> +    /* We only set it from this thread so it's okay to read it directly */
> +    assert(!load_threads_abort);
> +
> +    data = g_new(struct LoadThreadData, 1);
> +    data->function = function;
> +    data->opaque = opaque;
> +
> +    thread_pool_submit(load_threads, qemu_loadvm_load_thread,
> +                       data, g_free);
> +    thread_pool_adjust_max_threads_to_work(load_threads);
> +}
> +
>   void qemu_loadvm_state_cleanup(void)
>   {
>       SaveStateEntry *se;
>
>       trace_loadvm_state_cleanup();
> +
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           if (se->ops && se->ops->load_cleanup) {
>               se->ops->load_cleanup(se->opaque);
>           }
>       }
> +
> +    /*
> +     * We might be called even without earlier qemu_loadvm_state_setup()
> +     * call if qemu_loadvm_state() fails very early.
> +     */
> +    if (load_threads) {
> +        qatomic_set(&load_threads_abort, true);
> +        bql_unlock(); /* Load threads might be waiting for BQL */
> +        thread_pool_wait(load_threads);
> +        bql_lock();
> +        g_clear_pointer(&load_threads, thread_pool_free);

Since thread_pool_free() also waits for pending jobs before returning, 
can we drop the explicit thread_pool_wait()? E.g.:

qatomic_set(&load_threads_abort, true);
bql_unlock(); /* Load threads might be waiting for BQL */
g_clear_pointer(&load_threads, thread_pool_free);
bql_lock();

Thanks.

> +    }
>   }
>
>   /* Return true if we should continue the migration, or false. */
> @@ -3007,6 +3071,19 @@ int qemu_loadvm_state(QEMUFile *f)
>           return ret;
>       }
>
> +    if (ret == 0) {
> +        bql_unlock(); /* Let load threads do work requiring BQL */
> +        thread_pool_wait(load_threads);
> +        bql_lock();
> +
> +        ret = load_threads_ret;
> +    }
> +    /*
> +     * Set this flag unconditionally so we'll catch further attempts to
> +     * start additional threads via an appropriate assert()
> +     */
> +    qatomic_set(&load_threads_abort, true);
> +
>       if (ret == 0) {
>           ret = qemu_file_get_error(f);
>       }


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-28 10:26   ` Avihai Horon
@ 2024-11-28 12:11     ` Maciej S. Szmigiero
  2024-12-04 22:43       ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-28 12:11 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 28.11.2024 11:26, Avihai Horon wrote:
> 
> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Some drivers might want to make use of auxiliary helper threads during VM
>> state loading, for example to make sure that their blocking (sync) I/O
>> operations don't block the rest of the migration process.
>>
>> Add a migration core managed thread pool to facilitate this use case.
>>
>> The migration core will wait for these threads to finish before
>> (re)starting the VM at destination.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h |  3 ++
>>   include/qemu/typedefs.h  |  1 +
>>   migration/savevm.c       | 77 ++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 81 insertions(+)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 804eb23c0607..c92ca018ab3b 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -45,9 +45,12 @@ bool migrate_ram_is_ignored(RAMBlock *block);
>>   /* migration/block.c */
>>
>>   AnnounceParameters *migrate_announce_params(void);
>> +
>>   /* migration/savevm.c */
>>
>>   void dump_vmstate_json_to_file(FILE *out_fp);
>> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
>> +                                   void *opaque);
>>
>>   /* migration/migration.c */
>>   void migration_object_init(void);
>> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
>> index 3d84efcac47a..8c8ea5c2840d 100644
>> --- a/include/qemu/typedefs.h
>> +++ b/include/qemu/typedefs.h
>> @@ -131,5 +131,6 @@ typedef struct IRQState *qemu_irq;
>>    * Function types
>>    */
>>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>> +typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
>>
>>   #endif /* QEMU_TYPEDEFS_H */
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 1f58a2fa54ae..6ea9054c4083 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -54,6 +54,7 @@
>>   #include "qemu/job.h"
>>   #include "qemu/main-loop.h"
>>   #include "block/snapshot.h"
>> +#include "block/thread-pool.h"
>>   #include "qemu/cutils.h"
>>   #include "io/channel-buffer.h"
>>   #include "io/channel-file.h"
>> @@ -71,6 +72,10 @@
>>
>>   const unsigned int postcopy_ram_discard_version;
>>
>> +static ThreadPool *load_threads;
>> +static int load_threads_ret;
>> +static bool load_threads_abort;
>> +
>>   /* Subcommands for QEMU_VM_COMMAND */
>>   enum qemu_vm_cmd {
>>       MIG_CMD_INVALID = 0,   /* Must be 0 */
>> @@ -2788,6 +2793,12 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>>       int ret;
>>
>>       trace_loadvm_state_setup();
>> +
>> +    assert(!load_threads);
>> +    load_threads = thread_pool_new();
>> +    load_threads_ret = 0;
>> +    load_threads_abort = false;
>> +
>>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>>           if (!se->ops || !se->ops->load_setup) {
>>               continue;
>> @@ -2806,19 +2817,72 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>>               return ret;
>>           }
>>       }
>> +
>> +    return 0;
>> +}
>> +
>> +struct LoadThreadData {
>> +    MigrationLoadThread function;
>> +    void *opaque;
>> +};
>> +
>> +static int qemu_loadvm_load_thread(void *thread_opaque)
>> +{
>> +    struct LoadThreadData *data = thread_opaque;
>> +    int ret;
>> +
>> +    ret = data->function(&load_threads_abort, data->opaque);
>> +    if (ret && !qatomic_read(&load_threads_ret)) {
>> +        /*
>> +         * Racy with the above read but that's okay - which thread error
>> +         * return we report is purely arbitrary anyway.
>> +         */
>> +        qatomic_set(&load_threads_ret, ret);
>> +    }
> 
> Can we use cmpxchg instead? E.g.:
> 
> if (ret) {
>      qatomic_cmpxchg(&load_threads_ret, 0, ret);
> }

cmpxchg always forces sequentially consistent ordering
while qatomic_read() and qatomic_set() have relaxed ordering.

As the comment above describes, there's no need for sequential
consistency since which thread error is returned is arbitrary
anyway.

>> +
>>       return 0;
>>   }
>>
>> +void qemu_loadvm_start_load_thread(MigrationLoadThread function,
>> +                                   void *opaque)
>> +{
>> +    struct LoadThreadData *data;
>> +
>> +    /* We only set it from this thread so it's okay to read it directly */
>> +    assert(!load_threads_abort);
>> +
>> +    data = g_new(struct LoadThreadData, 1);
>> +    data->function = function;
>> +    data->opaque = opaque;
>> +
>> +    thread_pool_submit(load_threads, qemu_loadvm_load_thread,
>> +                       data, g_free);
>> +    thread_pool_adjust_max_threads_to_work(load_threads);
>> +}
>> +
>>   void qemu_loadvm_state_cleanup(void)
>>   {
>>       SaveStateEntry *se;
>>
>>       trace_loadvm_state_cleanup();
>> +
>>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>>           if (se->ops && se->ops->load_cleanup) {
>>               se->ops->load_cleanup(se->opaque);
>>           }
>>       }
>> +
>> +    /*
>> +     * We might be called even without earlier qemu_loadvm_state_setup()
>> +     * call if qemu_loadvm_state() fails very early.
>> +     */
>> +    if (load_threads) {
>> +        qatomic_set(&load_threads_abort, true);
>> +        bql_unlock(); /* Load threads might be waiting for BQL */
>> +        thread_pool_wait(load_threads);
>> +        bql_lock();
>> +        g_clear_pointer(&load_threads, thread_pool_free);
> 
> Since thread_pool_free() also waits for pending jobs before returning, can we drop the explicit thread_pool_wait()? E.g.:
> 
> qatomic_set(&load_threads_abort, true);
> bql_unlock(); /* Load threads might be waiting for BQL */
> g_clear_pointer(&load_threads, thread_pool_free);
> bql_lock();

If we document that thread_pool_free() has also wait semantics
as Cédric has suggested then we can indeed avoid the explicit
wait on cleanup.
  
> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-11-28 12:11     ` Maciej S. Szmigiero
@ 2024-12-04 22:43       ` Peter Xu
  2024-12-10 23:05         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-04 22:43 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel

On Thu, Nov 28, 2024 at 01:11:53PM +0100, Maciej S. Szmigiero wrote:
> > > +static int qemu_loadvm_load_thread(void *thread_opaque)
> > > +{
> > > +    struct LoadThreadData *data = thread_opaque;
> > > +    int ret;
> > > +
> > > +    ret = data->function(&load_threads_abort, data->opaque);
> > > +    if (ret && !qatomic_read(&load_threads_ret)) {
> > > +        /*
> > > +         * Racy with the above read but that's okay - which thread error
> > > +         * return we report is purely arbitrary anyway.
> > > +         */
> > > +        qatomic_set(&load_threads_ret, ret);
> > > +    }
> > 
> > Can we use cmpxchg instead? E.g.:
> > 
> > if (ret) {
> >      qatomic_cmpxchg(&load_threads_ret, 0, ret);
> > }
> 
> cmpxchg always forces sequentially consistent ordering
> while qatomic_read() and qatomic_set() have relaxed ordering.
> 
> As the comment above describes, there's no need for sequential
> consistency since which thread error is returned is arbitrary
> anyway.

IMHO this is not a hot path, so mem ordering isn't an issue.  If we could
avoid any data race we still should try to.

I do feel uneasy on the current design where everybody shares the "whether
to quit" via one bool, and any thread can set it... meanwhile we can't
stablize the first error to report later.

E.g., ideally we want to capture the first error no matter where it came
from, then keep it with migrate_set_error() so that "query-migrate" on dest
later can tell us what was wrong.  I think libvirt generally uses that.

So as to support a string error, at least we'll need to allow Error** in
the thread fn:

typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
                                    Error **errp);

I also changed retval to bool, as I mentioned elsewhere QEMU tries to stick
with "bool SOME_FUNCTION(..., Error **errp)" kind of error reporting.

Then any thread should only report error to qemu_loadvm_load_thread(), and
the report should always be a local Error**, then it further reports to the
global error.  Something like:

static int qemu_loadvm_load_thread(void *thread_opaque)
{
    MigrationIncomingState *mis = migration_incoming_get_current();
    struct LoadThreadData *data = thread_opaque;
    Error *error = NULL;

    if (!data->function(data->opaque, &mis->should_quit, &error)) {
       migrate_set_error(migrate_get_current(), error);
    }

    return 0;
}

migrate_set_error() is thread-safe, and it'll only record the 1st error.
Then the thread should only read &should_quit, and only set &error.  If we
want, migrate_set_error() can set &should_quit.

PS: I wished we have an unified place to tell whether we should quit
incoming migration - we already have multifd_recv_state->exiting, we could
have had a global flag like that then we can already use.  But I know I'm
asking too much.. However would you think it make sense to still have at
least Error** report the error and record it?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-04 22:43       ` Peter Xu
@ 2024-12-10 23:05         ` Maciej S. Szmigiero
  2024-12-12 16:55           ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On 4.12.2024 23:43, Peter Xu wrote:
> On Thu, Nov 28, 2024 at 01:11:53PM +0100, Maciej S. Szmigiero wrote:
>>>> +static int qemu_loadvm_load_thread(void *thread_opaque)
>>>> +{
>>>> +    struct LoadThreadData *data = thread_opaque;
>>>> +    int ret;
>>>> +
>>>> +    ret = data->function(&load_threads_abort, data->opaque);
>>>> +    if (ret && !qatomic_read(&load_threads_ret)) {
>>>> +        /*
>>>> +         * Racy with the above read but that's okay - which thread error
>>>> +         * return we report is purely arbitrary anyway.
>>>> +         */
>>>> +        qatomic_set(&load_threads_ret, ret);
>>>> +    }
>>>
>>> Can we use cmpxchg instead? E.g.:
>>>
>>> if (ret) {
>>>       qatomic_cmpxchg(&load_threads_ret, 0, ret);
>>> }
>>
>> cmpxchg always forces sequentially consistent ordering
>> while qatomic_read() and qatomic_set() have relaxed ordering.
>>
>> As the comment above describes, there's no need for sequential
>> consistency since which thread error is returned is arbitrary
>> anyway.
> 
> IMHO this is not a hot path, so mem ordering isn't an issue.  If we could
> avoid any data race we still should try to.
> 
> I do feel uneasy on the current design where everybody shares the "whether
> to quit" via one bool, and any thread can set it... meanwhile we can't
> stablize the first error to report later.
> 
> E.g., ideally we want to capture the first error no matter where it came
> from, then keep it with migrate_set_error() so that "query-migrate" on dest
> later can tell us what was wrong.  I think libvirt generally uses that.
> 
> So as to support a string error, at least we'll need to allow Error** in
> the thread fn:
> 
> typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
>                                      Error **errp);
> 
> I also changed retval to bool, as I mentioned elsewhere QEMU tries to stick
> with "bool SOME_FUNCTION(..., Error **errp)" kind of error reporting.
> 
> Then any thread should only report error to qemu_loadvm_load_thread(), and
> the report should always be a local Error**, then it further reports to the
> global error.  Something like:
> 
> static int qemu_loadvm_load_thread(void *thread_opaque)
> {
>      MigrationIncomingState *mis = migration_incoming_get_current();
>      struct LoadThreadData *data = thread_opaque;
>      Error *error = NULL;
> 
>      if (!data->function(data->opaque, &mis->should_quit, &error)) {
>         migrate_set_error(migrate_get_current(), error);
>      }
> 
>      return 0;
> }
> 
> migrate_set_error() is thread-safe, and it'll only record the 1st error.
>
> Then the thread should only read &should_quit, and only set &error.  If we
> want, migrate_set_error() can set &should_quit.
> 
> PS: I wished we have an unified place to tell whether we should quit
> incoming migration - we already have multifd_recv_state->exiting, we could
> have had a global flag like that then we can already use.  But I know I'm
> asking too much.. However would you think it make sense to still have at
> least Error** report the error and record it?
> 

This could work with the following changes/caveats:
* Needs g_autoptr(Error) otherwise these Error objects will leak.

* "1st error" here is as arbitrary as with my current code since which
thread first acquires the mutex in migrate_set_error() is unspecified.

* We still need to test this new error flag (now as migrate_has_error())
in qemu_loadvm_state() to see whether we proceed forward with the
migration.

-------------------------------------------------------------------

Also, I am not in favor of replacing load_threads_abort with something
else since we still want to ask threads to quit for other reasons, like
earlier (non-load threads related) failure in the migration process.

That's why we set this flag unconditionally in qemu_loadvm_state() -
see also my answer about that flag in the next message.

Thanks,
Maciej




^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-10 23:05         ` Maciej S. Szmigiero
@ 2024-12-12 16:55           ` Peter Xu
  2024-12-12 22:53             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-12 16:55 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On Wed, Dec 11, 2024 at 12:05:03AM +0100, Maciej S. Szmigiero wrote:
> On 4.12.2024 23:43, Peter Xu wrote:
> > On Thu, Nov 28, 2024 at 01:11:53PM +0100, Maciej S. Szmigiero wrote:
> > > > > +static int qemu_loadvm_load_thread(void *thread_opaque)
> > > > > +{
> > > > > +    struct LoadThreadData *data = thread_opaque;
> > > > > +    int ret;
> > > > > +
> > > > > +    ret = data->function(&load_threads_abort, data->opaque);
> > > > > +    if (ret && !qatomic_read(&load_threads_ret)) {
> > > > > +        /*
> > > > > +         * Racy with the above read but that's okay - which thread error
> > > > > +         * return we report is purely arbitrary anyway.
> > > > > +         */
> > > > > +        qatomic_set(&load_threads_ret, ret);
> > > > > +    }
> > > > 
> > > > Can we use cmpxchg instead? E.g.:
> > > > 
> > > > if (ret) {
> > > >       qatomic_cmpxchg(&load_threads_ret, 0, ret);
> > > > }
> > > 
> > > cmpxchg always forces sequentially consistent ordering
> > > while qatomic_read() and qatomic_set() have relaxed ordering.
> > > 
> > > As the comment above describes, there's no need for sequential
> > > consistency since which thread error is returned is arbitrary
> > > anyway.
> > 
> > IMHO this is not a hot path, so mem ordering isn't an issue.  If we could
> > avoid any data race we still should try to.
> > 
> > I do feel uneasy on the current design where everybody shares the "whether
> > to quit" via one bool, and any thread can set it... meanwhile we can't
> > stablize the first error to report later.
> > 
> > E.g., ideally we want to capture the first error no matter where it came
> > from, then keep it with migrate_set_error() so that "query-migrate" on dest
> > later can tell us what was wrong.  I think libvirt generally uses that.
> > 
> > So as to support a string error, at least we'll need to allow Error** in
> > the thread fn:
> > 
> > typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
> >                                      Error **errp);
> > 
> > I also changed retval to bool, as I mentioned elsewhere QEMU tries to stick
> > with "bool SOME_FUNCTION(..., Error **errp)" kind of error reporting.
> > 
> > Then any thread should only report error to qemu_loadvm_load_thread(), and
> > the report should always be a local Error**, then it further reports to the
> > global error.  Something like:
> > 
> > static int qemu_loadvm_load_thread(void *thread_opaque)
> > {
> >      MigrationIncomingState *mis = migration_incoming_get_current();
> >      struct LoadThreadData *data = thread_opaque;
> >      Error *error = NULL;
> > 
> >      if (!data->function(data->opaque, &mis->should_quit, &error)) {
> >         migrate_set_error(migrate_get_current(), error);
> >      }
> > 
> >      return 0;
> > }
> > 
> > migrate_set_error() is thread-safe, and it'll only record the 1st error.
> > 
> > Then the thread should only read &should_quit, and only set &error.  If we
> > want, migrate_set_error() can set &should_quit.
> > 
> > PS: I wished we have an unified place to tell whether we should quit
> > incoming migration - we already have multifd_recv_state->exiting, we could
> > have had a global flag like that then we can already use.  But I know I'm
> > asking too much.. However would you think it make sense to still have at
> > least Error** report the error and record it?
> > 
> 
> This could work with the following changes/caveats:
> * Needs g_autoptr(Error) otherwise these Error objects will leak.

True.. or just error_free() it after set.

> 
> * "1st error" here is as arbitrary as with my current code since which
> thread first acquires the mutex in migrate_set_error() is unspecified.

Yes that's still a step forward on being verbose of errors, which is almost
always more helpful than a bool..

Not exactly the 1st error in time sequence is ok - we don't strongly ask
for that, e.g. if two threads error at merely the same time it's ok we only
record one of them no matter which one is first.  That's unusual to start
with.

OTOH it matters on that we fail other threads only _after_ we set_error()
for the first error.  If so it's mostly always the case the captured error
will be valid and the real 1st error.

> 
> * We still need to test this new error flag (now as migrate_has_error())
> in qemu_loadvm_state() to see whether we proceed forward with the
> migration.

Yes, or just to work like what this patch does: set mis->should_quit within
the 1st setup of migrate_set_error().  For the longer term, maybe we need
to do more to put together all error setup/detection for migration.. but
for now we can at least do that for this series to set should_quit=true
there only.  It should work like your series, only that the boolean won't
be writable to data->function() but read-only there, for the sake of
capturing the Error string.

> 
> -------------------------------------------------------------------
> 
> Also, I am not in favor of replacing load_threads_abort with something
> else since we still want to ask threads to quit for other reasons, like
> earlier (non-load threads related) failure in the migration process.
> 
> That's why we set this flag unconditionally in qemu_loadvm_state() -
> see also my answer about that flag in the next message.

I'm not against having a boolean to say quit, maybe we should have that for
!vfio use case too, and I'm ok we introduce one.  But I hope two things can
work out:

  - Capture Error* and persist it in query-migrate (aka, use
    migrate_set_error).

  - Avoid setting load_threads_abort explicitly in vmstate load path.  It
    should really be part of destroy(), IMHO, as I mentioned in the other
    email, to recycle load threads in a failure case.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-12 16:55           ` Peter Xu
@ 2024-12-12 22:53             ` Maciej S. Szmigiero
  2024-12-16 16:33               ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-12 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On 12.12.2024 17:55, Peter Xu wrote:
> On Wed, Dec 11, 2024 at 12:05:03AM +0100, Maciej S. Szmigiero wrote:
>> On 4.12.2024 23:43, Peter Xu wrote:
>>> On Thu, Nov 28, 2024 at 01:11:53PM +0100, Maciej S. Szmigiero wrote:
>>>>>> +static int qemu_loadvm_load_thread(void *thread_opaque)
>>>>>> +{
>>>>>> +    struct LoadThreadData *data = thread_opaque;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = data->function(&load_threads_abort, data->opaque);
>>>>>> +    if (ret && !qatomic_read(&load_threads_ret)) {
>>>>>> +        /*
>>>>>> +         * Racy with the above read but that's okay - which thread error
>>>>>> +         * return we report is purely arbitrary anyway.
>>>>>> +         */
>>>>>> +        qatomic_set(&load_threads_ret, ret);
>>>>>> +    }
>>>>>
>>>>> Can we use cmpxchg instead? E.g.:
>>>>>
>>>>> if (ret) {
>>>>>        qatomic_cmpxchg(&load_threads_ret, 0, ret);
>>>>> }
>>>>
>>>> cmpxchg always forces sequentially consistent ordering
>>>> while qatomic_read() and qatomic_set() have relaxed ordering.
>>>>
>>>> As the comment above describes, there's no need for sequential
>>>> consistency since which thread error is returned is arbitrary
>>>> anyway.
>>>
>>> IMHO this is not a hot path, so mem ordering isn't an issue.  If we could
>>> avoid any data race we still should try to.
>>>
>>> I do feel uneasy on the current design where everybody shares the "whether
>>> to quit" via one bool, and any thread can set it... meanwhile we can't
>>> stablize the first error to report later.
>>>
>>> E.g., ideally we want to capture the first error no matter where it came
>>> from, then keep it with migrate_set_error() so that "query-migrate" on dest
>>> later can tell us what was wrong.  I think libvirt generally uses that.
>>>
>>> So as to support a string error, at least we'll need to allow Error** in
>>> the thread fn:
>>>
>>> typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit,
>>>                                       Error **errp);
>>>
>>> I also changed retval to bool, as I mentioned elsewhere QEMU tries to stick
>>> with "bool SOME_FUNCTION(..., Error **errp)" kind of error reporting.
>>>
>>> Then any thread should only report error to qemu_loadvm_load_thread(), and
>>> the report should always be a local Error**, then it further reports to the
>>> global error.  Something like:
>>>
>>> static int qemu_loadvm_load_thread(void *thread_opaque)
>>> {
>>>       MigrationIncomingState *mis = migration_incoming_get_current();
>>>       struct LoadThreadData *data = thread_opaque;
>>>       Error *error = NULL;
>>>
>>>       if (!data->function(data->opaque, &mis->should_quit, &error)) {
>>>          migrate_set_error(migrate_get_current(), error);
>>>       }
>>>
>>>       return 0;
>>> }
>>>
>>> migrate_set_error() is thread-safe, and it'll only record the 1st error.
>>>
>>> Then the thread should only read &should_quit, and only set &error.  If we
>>> want, migrate_set_error() can set &should_quit.
>>>
>>> PS: I wished we have an unified place to tell whether we should quit
>>> incoming migration - we already have multifd_recv_state->exiting, we could
>>> have had a global flag like that then we can already use.  But I know I'm
>>> asking too much.. However would you think it make sense to still have at
>>> least Error** report the error and record it?
>>>
>>
>> This could work with the following changes/caveats:
>> * Needs g_autoptr(Error) otherwise these Error objects will leak.
> 
> True.. or just error_free() it after set.
> 
>>
>> * "1st error" here is as arbitrary as with my current code since which
>> thread first acquires the mutex in migrate_set_error() is unspecified.
> 
> Yes that's still a step forward on being verbose of errors, which is almost
> always more helpful than a bool..
> 
> Not exactly the 1st error in time sequence is ok - we don't strongly ask
> for that, e.g. if two threads error at merely the same time it's ok we only
> record one of them no matter which one is first.  That's unusual to start
> with.
> 
> OTOH it matters on that we fail other threads only _after_ we set_error()
> for the first error.  If so it's mostly always the case the captured error
> will be valid and the real 1st error.
> 
>>
>> * We still need to test this new error flag (now as migrate_has_error())
>> in qemu_loadvm_state() to see whether we proceed forward with the
>> migration.
> 
> Yes, or just to work like what this patch does: set mis->should_quit within
> the 1st setup of migrate_set_error().  For the longer term, maybe we need
> to do more to put together all error setup/detection for migration.. but
> for now we can at least do that for this series to set should_quit=true
> there only.  It should work like your series, only that the boolean won't
> be writable to data->function() but read-only there, for the sake of
> capturing the Error string.

migrate_set_error() wouldn't be called until qemu_loadvm_state() exits
into process_incoming_migration_co().

Also this does not account other qemu_loadvm_state() callers like
qmp_xen_load_devices_state() or load_snapshot().

While these other callers might not use load threads currently, it feels
wrong to wait for these threads in qemu_loadvm_state() but set their
termination/abort flag as a side effect of completely different function
(migrate_set_error()).

Having a dedicated abort flag also makes the semantics easy to infer
from code since once can simply grep for this flag name (load_threads_abort)
to see where it is being written.

Its name is also pretty descriptive making it easy to immediately tell
what it does.

>>
>> -------------------------------------------------------------------
>>
>> Also, I am not in favor of replacing load_threads_abort with something
>> else since we still want to ask threads to quit for other reasons, like
>> earlier (non-load threads related) failure in the migration process.
>>
>> That's why we set this flag unconditionally in qemu_loadvm_state() -
>> see also my answer about that flag in the next message.
> 
> I'm not against having a boolean to say quit, maybe we should have that for
> !vfio use case too, and I'm ok we introduce one.  But I hope two things can
> work out:
> 
>    - Capture Error* and persist it in query-migrate (aka, use
>      migrate_set_error).

Will do.

>    - Avoid setting load_threads_abort explicitly in vmstate load path.  It
>      should really be part of destroy(), IMHO, as I mentioned in the other
>      email, to recycle load threads in a failure case.

That's the same thread abort flag issue as in the first block of my reply
above.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-12 22:53             ` Maciej S. Szmigiero
@ 2024-12-16 16:33               ` Peter Xu
  2024-12-16 23:15                 ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-16 16:33 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On Thu, Dec 12, 2024 at 11:53:42PM +0100, Maciej S. Szmigiero wrote:
> migrate_set_error() wouldn't be called until qemu_loadvm_state() exits
> into process_incoming_migration_co().
> 
> Also this does not account other qemu_loadvm_state() callers like
> qmp_xen_load_devices_state() or load_snapshot().
> 
> While these other callers might not use load threads currently, it feels
> wrong to wait for these threads in qemu_loadvm_state() but set their
> termination/abort flag as a side effect of completely different function
> (migrate_set_error()).
> 
> Having a dedicated abort flag also makes the semantics easy to infer
> from code since once can simply grep for this flag name (load_threads_abort)
> to see where it is being written.
> 
> Its name is also pretty descriptive making it easy to immediately tell
> what it does.

That's fine. As long as we can at least report an Error** and remember that
it's OK to me.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads
  2024-12-16 16:33               ` Peter Xu
@ 2024-12-16 23:15                 ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-16 23:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Joao Martins, qemu-devel

On 16.12.2024 17:33, Peter Xu wrote:
> On Thu, Dec 12, 2024 at 11:53:42PM +0100, Maciej S. Szmigiero wrote:
>> migrate_set_error() wouldn't be called until qemu_loadvm_state() exits
>> into process_incoming_migration_co().
>>
>> Also this does not account other qemu_loadvm_state() callers like
>> qmp_xen_load_devices_state() or load_snapshot().
>>
>> While these other callers might not use load threads currently, it feels
>> wrong to wait for these threads in qemu_loadvm_state() but set their
>> termination/abort flag as a side effect of completely different function
>> (migrate_set_error()).
>>
>> Having a dedicated abort flag also makes the semantics easy to infer
>> from code since once can simply grep for this flag name (load_threads_abort)
>> to see where it is being written.
>>
>> Its name is also pretty descriptive making it easy to immediately tell
>> what it does.
> 
> That's fine. As long as we can at least report an Error** and remember that
> it's OK to me.

I think the above will be a good design indeed.
  > Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (7 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 08/24] migration: Add thread pool of optional load threads Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-26 14:34   ` Fabiano Rosas
  2024-12-05 15:29   ` Peter Xu
  2024-11-17 19:20 ` [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
                   ` (16 subsequent siblings)
  25 siblings, 2 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Read packet header first so in the future we will be able to
differentiate between a RAM multifd packet and a device state multifd
packet.

Since these two are of different size we can't read the packet body until
we know which packet type it is.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 49 +++++++++++++++++++++++++++++++++++----------
 migration/multifd.h |  5 +++++
 2 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 498e71fd1024..999b88b7ebcb 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -209,10 +209,10 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
 
     memset(packet, 0, p->packet_len);
 
-    packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-    packet->version = cpu_to_be32(MULTIFD_VERSION);
+    packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+    packet->hdr.version = cpu_to_be32(MULTIFD_VERSION);
 
-    packet->flags = cpu_to_be32(p->flags);
+    packet->hdr.flags = cpu_to_be32(p->flags);
     packet->next_packet_size = cpu_to_be32(p->next_packet_size);
 
     packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
@@ -228,12 +228,12 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
                             p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
+                                             const MultiFDPacketHdr_t *hdr,
+                                             Error **errp)
 {
-    const MultiFDPacket_t *packet = p->packet;
-    uint32_t magic = be32_to_cpu(packet->magic);
-    uint32_t version = be32_to_cpu(packet->version);
-    int ret = 0;
+    uint32_t magic = be32_to_cpu(hdr->magic);
+    uint32_t version = be32_to_cpu(hdr->version);
 
     if (magic != MULTIFD_MAGIC) {
         error_setg(errp, "multifd: received packet magic %x, expected %x",
@@ -247,7 +247,16 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
         return -1;
     }
 
-    p->flags = be32_to_cpu(packet->flags);
+    p->flags = be32_to_cpu(hdr->flags);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    const MultiFDPacket_t *packet = p->packet;
+    int ret = 0;
+
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
     p->packets_recved++;
@@ -1126,8 +1135,12 @@ static void *multifd_recv_thread(void *opaque)
     rcu_register_thread();
 
     while (true) {
+        MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
         bool has_data = false;
+        uint8_t *pkt_buf;
+        size_t pkt_len;
+
         p->normal_num = 0;
 
         if (use_packets) {
@@ -1135,8 +1148,22 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
-                                           p->packet_len, &local_err);
+            ret = qio_channel_read_all_eof(p->c, (void *)&hdr,
+                                           sizeof(hdr), &local_err);
+            if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
+                break;
+            }
+
+            ret = multifd_recv_unfill_packet_header(p, &hdr, &local_err);
+            if (ret) {
+                break;
+            }
+
+            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+            pkt_len = p->packet_len - sizeof(hdr);
+
+            ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
+                                           &local_err);
             if (ret == 0 || ret == -1) {   /* 0: EOF  -1: Error */
                 break;
             }
diff --git a/migration/multifd.h b/migration/multifd.h
index 50d58c0c9cec..106a48496dc6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -53,6 +53,11 @@ typedef struct {
     uint32_t magic;
     uint32_t version;
     uint32_t flags;
+} __attribute__((packed)) MultiFDPacketHdr_t;
+
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
     /* maximum number of allocated pages */
     uint32_t pages_alloc;
     /* non zero pages */


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data
  2024-11-17 19:20 ` [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
@ 2024-11-26 14:34   ` Fabiano Rosas
  2024-12-05 15:29   ` Peter Xu
  1 sibling, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-26 14:34 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Read packet header first so in the future we will be able to
> differentiate between a RAM multifd packet and a device state multifd
> packet.
>
> Since these two are of different size we can't read the packet body until
> we know which packet type it is.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data
  2024-11-17 19:20 ` [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
  2024-11-26 14:34   ` Fabiano Rosas
@ 2024-12-05 15:29   ` Peter Xu
  1 sibling, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-05 15:29 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:04PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Read packet header first so in the future we will be able to
> differentiate between a RAM multifd packet and a device state multifd
> packet.
> 
> Since these two are of different size we can't read the packet body until
> we know which packet type it is.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (8 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-12-05 16:06   ` Peter Xu
  2024-11-17 19:20 ` [PATCH v3 11/24] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 87 +++++++++++++++++++++++++++++++++++++++++----
 migration/multifd.h | 26 +++++++++++++-
 2 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 999b88b7ebcb..9578a985449b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -21,6 +21,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -252,14 +253,24 @@ static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
     return 0;
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
+                                                   Error **errp)
+{
+    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+    packet->instance_id = be32_to_cpu(packet->instance_id);
+    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+    return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
 {
     const MultiFDPacket_t *packet = p->packet;
     int ret = 0;
 
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
-    p->packets_recved++;
 
     if (!(p->flags & MULTIFD_FLAG_SYNC)) {
         ret = multifd_ram_unfill_packet(p, errp);
@@ -271,6 +282,17 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     return ret;
 }
 
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+    p->packets_recved++;
+
+    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
+        return multifd_recv_unfill_packet_device_state(p, errp);
+    }
+
+    return multifd_recv_unfill_packet_ram(p, errp);
+}
+
 static bool multifd_send_should_exit(void)
 {
     return qatomic_read(&multifd_send_state->exiting);
@@ -1023,6 +1045,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
+    g_clear_pointer(&p->packet_dev_state, g_free);
     g_free(p->normal);
     p->normal = NULL;
     g_free(p->zero);
@@ -1124,6 +1147,28 @@ void multifd_recv_sync_main(void)
     trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
 }
 
+static int multifd_device_state_recv(MultiFDRecvParams *p, Error **errp)
+{
+    g_autofree char *idstr = NULL;
+    g_autofree char *dev_state_buf = NULL;
+    int ret;
+
+    dev_state_buf = g_malloc(p->next_packet_size);
+
+    ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    idstr = g_strndup(p->packet_dev_state->idstr,
+                      sizeof(p->packet_dev_state->idstr));
+
+    return qemu_loadvm_load_state_buffer(idstr,
+                                         p->packet_dev_state->instance_id,
+                                         dev_state_buf, p->next_packet_size,
+                                         errp);
+}
+
 static void *multifd_recv_thread(void *opaque)
 {
     MultiFDRecvParams *p = opaque;
@@ -1137,6 +1182,7 @@ static void *multifd_recv_thread(void *opaque)
     while (true) {
         MultiFDPacketHdr_t hdr;
         uint32_t flags = 0;
+        bool is_device_state = false;
         bool has_data = false;
         uint8_t *pkt_buf;
         size_t pkt_len;
@@ -1159,8 +1205,14 @@ static void *multifd_recv_thread(void *opaque)
                 break;
             }
 
-            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
-            pkt_len = p->packet_len - sizeof(hdr);
+            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
+            if (is_device_state) {
+                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
+                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
+            } else {
+                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
+                pkt_len = p->packet_len - sizeof(hdr);
+            }
 
             ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
                                            &local_err);
@@ -1178,9 +1230,14 @@ static void *multifd_recv_thread(void *opaque)
             flags = p->flags;
             /* recv methods don't know how to handle the SYNC flag */
             p->flags &= ~MULTIFD_FLAG_SYNC;
-            if (!(flags & MULTIFD_FLAG_SYNC)) {
-                has_data = p->normal_num || p->zero_num;
+
+            if (is_device_state) {
+                has_data = p->next_packet_size > 0;
+            } else {
+                has_data = !(flags & MULTIFD_FLAG_SYNC) &&
+                    (p->normal_num || p->zero_num);
             }
+
             qemu_mutex_unlock(&p->mutex);
         } else {
             /*
@@ -1209,14 +1266,29 @@ static void *multifd_recv_thread(void *opaque)
         }
 
         if (has_data) {
-            ret = multifd_recv_state->ops->recv(p, &local_err);
+            if (is_device_state) {
+                assert(use_packets);
+                ret = multifd_device_state_recv(p, &local_err);
+            } else {
+                ret = multifd_recv_state->ops->recv(p, &local_err);
+            }
             if (ret != 0) {
                 break;
             }
+        } else if (is_device_state) {
+            error_setg(&local_err,
+                       "multifd: received empty device state packet");
+            break;
         }
 
         if (use_packets) {
             if (flags & MULTIFD_FLAG_SYNC) {
+                if (is_device_state) {
+                    error_setg(&local_err,
+                               "multifd: received SYNC device state packet");
+                    break;
+                }
+
                 qemu_sem_post(&multifd_recv_state->sem_sync);
                 qemu_sem_wait(&p->sem_sync);
             }
@@ -1285,6 +1357,7 @@ int multifd_recv_setup(Error **errp)
             p->packet_len = sizeof(MultiFDPacket_t)
                 + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
         }
         p->name = g_strdup_printf(MIGRATION_THREAD_DST_MULTIFD, i);
         p->normal = g_new0(ram_addr_t, page_count);
diff --git a/migration/multifd.h b/migration/multifd.h
index 106a48496dc6..026b653057e2 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -46,6 +46,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_UADK (8 << 1)
 #define MULTIFD_FLAG_QATZIP (16 << 1)
 
+/*
+ * If set it means that this packet contains device state
+ * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
+ */
+#define MULTIFD_FLAG_DEVICE_STATE (1 << 6)
+
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
 
@@ -78,6 +84,16 @@ typedef struct {
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
+typedef struct {
+    MultiFDPacketHdr_t hdr;
+
+    char idstr[256] QEMU_NONSTRING;
+    uint32_t instance_id;
+
+    /* size of the next packet that contains the actual data */
+    uint32_t next_packet_size;
+} __attribute__((packed)) MultiFDPacketDeviceState_t;
+
 typedef struct {
     /* number of used pages */
     uint32_t num;
@@ -95,6 +111,13 @@ struct MultiFDRecvData {
     off_t file_offset;
 };
 
+typedef struct {
+    char *idstr;
+    uint32_t instance_id;
+    char *buf;
+    size_t buf_len;
+} MultiFDDeviceState_t;
+
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
@@ -210,8 +233,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_dev_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets received through this channel */


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side
  2024-11-17 19:20 ` [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2024-12-05 16:06   ` Peter Xu
  2024-12-06 21:12     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-05 16:06 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:05PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Add a basic support for receiving device state via multifd channels -
> channels that are shared with RAM transfers.
> 
> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
> packet header either device state (MultiFDPacketDeviceState_t) or RAM
> data (existing MultiFDPacket_t) is read.
> 
> The received device state data is provided to
> qemu_loadvm_load_state_buffer() function for processing in the
> device's load_state_buffer handler.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

Only a few nitpicks:

> ---
>  migration/multifd.c | 87 +++++++++++++++++++++++++++++++++++++++++----
>  migration/multifd.h | 26 +++++++++++++-
>  2 files changed, 105 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 999b88b7ebcb..9578a985449b 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -21,6 +21,7 @@
>  #include "file.h"
>  #include "migration.h"
>  #include "migration-stats.h"
> +#include "savevm.h"
>  #include "socket.h"
>  #include "tls.h"
>  #include "qemu-file.h"
> @@ -252,14 +253,24 @@ static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
>      return 0;
>  }
>  
> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
> +                                                   Error **errp)
> +{
> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
> +
> +    packet->instance_id = be32_to_cpu(packet->instance_id);
> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
> +
> +    return 0;
> +}
> +
> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
>  {
>      const MultiFDPacket_t *packet = p->packet;
>      int ret = 0;
>  
>      p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>      p->packet_num = be64_to_cpu(packet->packet_num);
> -    p->packets_recved++;
>  
>      if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>          ret = multifd_ram_unfill_packet(p, errp);
> @@ -271,6 +282,17 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>      return ret;
>  }
>  
> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
> +{
> +    p->packets_recved++;
> +
> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
> +        return multifd_recv_unfill_packet_device_state(p, errp);
> +    }
> +
> +    return multifd_recv_unfill_packet_ram(p, errp);
> +}
> +
>  static bool multifd_send_should_exit(void)
>  {
>      return qatomic_read(&multifd_send_state->exiting);
> @@ -1023,6 +1045,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
>      p->packet_len = 0;
>      g_free(p->packet);
>      p->packet = NULL;
> +    g_clear_pointer(&p->packet_dev_state, g_free);
>      g_free(p->normal);
>      p->normal = NULL;
>      g_free(p->zero);
> @@ -1124,6 +1147,28 @@ void multifd_recv_sync_main(void)
>      trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
>  }
>  
> +static int multifd_device_state_recv(MultiFDRecvParams *p, Error **errp)
> +{
> +    g_autofree char *idstr = NULL;
> +    g_autofree char *dev_state_buf = NULL;
> +    int ret;
> +
> +    dev_state_buf = g_malloc(p->next_packet_size);
> +
> +    ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, errp);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +
> +    idstr = g_strndup(p->packet_dev_state->idstr,
> +                      sizeof(p->packet_dev_state->idstr));
> +
> +    return qemu_loadvm_load_state_buffer(idstr,
> +                                         p->packet_dev_state->instance_id,
> +                                         dev_state_buf, p->next_packet_size,
> +                                         errp);
> +}
> +
>  static void *multifd_recv_thread(void *opaque)
>  {
>      MultiFDRecvParams *p = opaque;
> @@ -1137,6 +1182,7 @@ static void *multifd_recv_thread(void *opaque)
>      while (true) {
>          MultiFDPacketHdr_t hdr;
>          uint32_t flags = 0;
> +        bool is_device_state = false;
>          bool has_data = false;
>          uint8_t *pkt_buf;
>          size_t pkt_len;
> @@ -1159,8 +1205,14 @@ static void *multifd_recv_thread(void *opaque)
>                  break;
>              }
>  
> -            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
> -            pkt_len = p->packet_len - sizeof(hdr);
> +            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
> +            if (is_device_state) {
> +                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
> +                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
> +            } else {
> +                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
> +                pkt_len = p->packet_len - sizeof(hdr);
> +            }
>  
>              ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
>                                             &local_err);
> @@ -1178,9 +1230,14 @@ static void *multifd_recv_thread(void *opaque)
>              flags = p->flags;
>              /* recv methods don't know how to handle the SYNC flag */
>              p->flags &= ~MULTIFD_FLAG_SYNC;
> -            if (!(flags & MULTIFD_FLAG_SYNC)) {
> -                has_data = p->normal_num || p->zero_num;
> +
> +            if (is_device_state) {
> +                has_data = p->next_packet_size > 0;
> +            } else {
> +                has_data = !(flags & MULTIFD_FLAG_SYNC) &&
> +                    (p->normal_num || p->zero_num);
>              }
> +
>              qemu_mutex_unlock(&p->mutex);
>          } else {
>              /*
> @@ -1209,14 +1266,29 @@ static void *multifd_recv_thread(void *opaque)
>          }
>  
>          if (has_data) {
> -            ret = multifd_recv_state->ops->recv(p, &local_err);
> +            if (is_device_state) {
> +                assert(use_packets);
> +                ret = multifd_device_state_recv(p, &local_err);
> +            } else {
> +                ret = multifd_recv_state->ops->recv(p, &local_err);
> +            }
>              if (ret != 0) {
>                  break;
>              }
> +        } else if (is_device_state) {
> +            error_setg(&local_err,
> +                       "multifd: received empty device state packet");
> +            break;

You used assert anyway elsewhere, and this also smells like programming
error.  We could stick with assert above and reduce "if / elif ...":

    if (is_device_state) {
        assert(p->next_packet_size > 0);
        has_data = true;
    }

Then drop else if.

>          }
>  
>          if (use_packets) {
>              if (flags & MULTIFD_FLAG_SYNC) {
> +                if (is_device_state) {
> +                    error_setg(&local_err,
> +                               "multifd: received SYNC device state packet");
> +                    break;
> +                }

Same here. I'd use assert().

> +
>                  qemu_sem_post(&multifd_recv_state->sem_sync);
>                  qemu_sem_wait(&p->sem_sync);
>              }
> @@ -1285,6 +1357,7 @@ int multifd_recv_setup(Error **errp)
>              p->packet_len = sizeof(MultiFDPacket_t)
>                  + sizeof(uint64_t) * page_count;
>              p->packet = g_malloc0(p->packet_len);
> +            p->packet_dev_state = g_malloc0(sizeof(*p->packet_dev_state));
>          }
>          p->name = g_strdup_printf(MIGRATION_THREAD_DST_MULTIFD, i);
>          p->normal = g_new0(ram_addr_t, page_count);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 106a48496dc6..026b653057e2 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -46,6 +46,12 @@ MultiFDRecvData *multifd_get_recv_data(void);
>  #define MULTIFD_FLAG_UADK (8 << 1)
>  #define MULTIFD_FLAG_QATZIP (16 << 1)
>  
> +/*
> + * If set it means that this packet contains device state
> + * (MultiFDPacketDeviceState_t), not RAM data (MultiFDPacket_t).
> + */
> +#define MULTIFD_FLAG_DEVICE_STATE (1 << 6)
> +
>  /* This value needs to be a multiple of qemu_target_page_size() */
>  #define MULTIFD_PACKET_SIZE (512 * 1024)
>  
> @@ -78,6 +84,16 @@ typedef struct {
>      uint64_t offset[];
>  } __attribute__((packed)) MultiFDPacket_t;
>  
> +typedef struct {
> +    MultiFDPacketHdr_t hdr;
> +
> +    char idstr[256] QEMU_NONSTRING;
> +    uint32_t instance_id;
> +
> +    /* size of the next packet that contains the actual data */
> +    uint32_t next_packet_size;
> +} __attribute__((packed)) MultiFDPacketDeviceState_t;
> +
>  typedef struct {
>      /* number of used pages */
>      uint32_t num;
> @@ -95,6 +111,13 @@ struct MultiFDRecvData {
>      off_t file_offset;
>  };
>  
> +typedef struct {
> +    char *idstr;
> +    uint32_t instance_id;
> +    char *buf;
> +    size_t buf_len;
> +} MultiFDDeviceState_t;
> +
>  typedef enum {
>      MULTIFD_PAYLOAD_NONE,
>      MULTIFD_PAYLOAD_RAM,
> @@ -210,8 +233,9 @@ typedef struct {
>  
>      /* thread local variables. No locking required */
>  
> -    /* pointer to the packet */
> +    /* pointers to the possible packet types */
>      MultiFDPacket_t *packet;
> +    MultiFDPacketDeviceState_t *packet_dev_state;
>      /* size of the next packet that contains pages */
>      uint32_t next_packet_size;
>      /* packets received through this channel */
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side
  2024-12-05 16:06   ` Peter Xu
@ 2024-12-06 21:12     ` Maciej S. Szmigiero
  2024-12-06 21:57       ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 21:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 5.12.2024 17:06, Peter Xu wrote:
> On Sun, Nov 17, 2024 at 08:20:05PM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Add a basic support for receiving device state via multifd channels -
>> channels that are shared with RAM transfers.
>>
>> Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
>> packet header either device state (MultiFDPacketDeviceState_t) or RAM
>> data (existing MultiFDPacket_t) is read.
>>
>> The received device state data is provided to
>> qemu_loadvm_load_state_buffer() function for processing in the
>> device's load_state_buffer handler.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Only a few nitpicks:
> 
>> ---
>>   migration/multifd.c | 87 +++++++++++++++++++++++++++++++++++++++++----
>>   migration/multifd.h | 26 +++++++++++++-
>>   2 files changed, 105 insertions(+), 8 deletions(-)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 999b88b7ebcb..9578a985449b 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -21,6 +21,7 @@
>>   #include "file.h"
>>   #include "migration.h"
>>   #include "migration-stats.h"
>> +#include "savevm.h"
>>   #include "socket.h"
>>   #include "tls.h"
>>   #include "qemu-file.h"
>> @@ -252,14 +253,24 @@ static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p,
>>       return 0;
>>   }
>>   
>> -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>> +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p,
>> +                                                   Error **errp)
>> +{
>> +    MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
>> +
>> +    packet->instance_id = be32_to_cpu(packet->instance_id);
>> +    p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>> +
>> +    return 0;
>> +}
>> +
>> +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
>>   {
>>       const MultiFDPacket_t *packet = p->packet;
>>       int ret = 0;
>>   
>>       p->next_packet_size = be32_to_cpu(packet->next_packet_size);
>>       p->packet_num = be64_to_cpu(packet->packet_num);
>> -    p->packets_recved++;
>>   
>>       if (!(p->flags & MULTIFD_FLAG_SYNC)) {
>>           ret = multifd_ram_unfill_packet(p, errp);
>> @@ -271,6 +282,17 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>>       return ret;
>>   }
>>   
>> +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>> +{
>> +    p->packets_recved++;
>> +
>> +    if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
>> +        return multifd_recv_unfill_packet_device_state(p, errp);
>> +    }
>> +
>> +    return multifd_recv_unfill_packet_ram(p, errp);
>> +}
>> +
>>   static bool multifd_send_should_exit(void)
>>   {
>>       return qatomic_read(&multifd_send_state->exiting);
>> @@ -1023,6 +1045,7 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
>>       p->packet_len = 0;
>>       g_free(p->packet);
>>       p->packet = NULL;
>> +    g_clear_pointer(&p->packet_dev_state, g_free);
>>       g_free(p->normal);
>>       p->normal = NULL;
>>       g_free(p->zero);
>> @@ -1124,6 +1147,28 @@ void multifd_recv_sync_main(void)
>>       trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
>>   }
>>   
>> +static int multifd_device_state_recv(MultiFDRecvParams *p, Error **errp)
>> +{
>> +    g_autofree char *idstr = NULL;
>> +    g_autofree char *dev_state_buf = NULL;
>> +    int ret;
>> +
>> +    dev_state_buf = g_malloc(p->next_packet_size);
>> +
>> +    ret = qio_channel_read_all(p->c, dev_state_buf, p->next_packet_size, errp);
>> +    if (ret != 0) {
>> +        return ret;
>> +    }
>> +
>> +    idstr = g_strndup(p->packet_dev_state->idstr,
>> +                      sizeof(p->packet_dev_state->idstr));
>> +
>> +    return qemu_loadvm_load_state_buffer(idstr,
>> +                                         p->packet_dev_state->instance_id,
>> +                                         dev_state_buf, p->next_packet_size,
>> +                                         errp);
>> +}
>> +
>>   static void *multifd_recv_thread(void *opaque)
>>   {
>>       MultiFDRecvParams *p = opaque;
>> @@ -1137,6 +1182,7 @@ static void *multifd_recv_thread(void *opaque)
>>       while (true) {
>>           MultiFDPacketHdr_t hdr;
>>           uint32_t flags = 0;
>> +        bool is_device_state = false;
>>           bool has_data = false;
>>           uint8_t *pkt_buf;
>>           size_t pkt_len;
>> @@ -1159,8 +1205,14 @@ static void *multifd_recv_thread(void *opaque)
>>                   break;
>>               }
>>   
>> -            pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
>> -            pkt_len = p->packet_len - sizeof(hdr);
>> +            is_device_state = p->flags & MULTIFD_FLAG_DEVICE_STATE;
>> +            if (is_device_state) {
>> +                pkt_buf = (uint8_t *)p->packet_dev_state + sizeof(hdr);
>> +                pkt_len = sizeof(*p->packet_dev_state) - sizeof(hdr);
>> +            } else {
>> +                pkt_buf = (uint8_t *)p->packet + sizeof(hdr);
>> +                pkt_len = p->packet_len - sizeof(hdr);
>> +            }
>>   
>>               ret = qio_channel_read_all_eof(p->c, (char *)pkt_buf, pkt_len,
>>                                              &local_err);
>> @@ -1178,9 +1230,14 @@ static void *multifd_recv_thread(void *opaque)
>>               flags = p->flags;
>>               /* recv methods don't know how to handle the SYNC flag */
>>               p->flags &= ~MULTIFD_FLAG_SYNC;
>> -            if (!(flags & MULTIFD_FLAG_SYNC)) {
>> -                has_data = p->normal_num || p->zero_num;
>> +
>> +            if (is_device_state) {
>> +                has_data = p->next_packet_size > 0;
>> +            } else {
>> +                has_data = !(flags & MULTIFD_FLAG_SYNC) &&
>> +                    (p->normal_num || p->zero_num);
>>               }
>> +
>>               qemu_mutex_unlock(&p->mutex);
>>           } else {
>>               /*
>> @@ -1209,14 +1266,29 @@ static void *multifd_recv_thread(void *opaque)
>>           }
>>   
>>           if (has_data) {
>> -            ret = multifd_recv_state->ops->recv(p, &local_err);
>> +            if (is_device_state) {
>> +                assert(use_packets);
>> +                ret = multifd_device_state_recv(p, &local_err);
>> +            } else {
>> +                ret = multifd_recv_state->ops->recv(p, &local_err);
>> +            }
>>               if (ret != 0) {
>>                   break;
>>               }
>> +        } else if (is_device_state) {
>> +            error_setg(&local_err,
>> +                       "multifd: received empty device state packet");
>> +            break;
> 
> You used assert anyway elsewhere, and this also smells like programming
> error.  We could stick with assert above and reduce "if / elif ...":
> 
>      if (is_device_state) {
>          assert(p->next_packet_size > 0);
>          has_data = true;
>      }
> 
> Then drop else if.

It's not necessarily a programming error, but rather a problem with the
received bit stream or its incompatibility with the receiving QEMU version.

So I think returning an error is more appropriate than triggering
an assert() failure for that.

>>           }
>>   
>>           if (use_packets) {
>>               if (flags & MULTIFD_FLAG_SYNC) {
>> +                if (is_device_state) {
>> +                    error_setg(&local_err,
>> +                               "multifd: received SYNC device state packet");
>> +                    break;
>> +                }
> 
> Same here. I'd use assert().
> 

Same here :) - the sender sent us possibly wrong packet or packet of
incompatible version, we should handle this gracefully rather than
assert()/abort() QEMU.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side
  2024-12-06 21:12     ` Maciej S. Szmigiero
@ 2024-12-06 21:57       ` Peter Xu
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-06 21:57 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Dec 06, 2024 at 10:12:27PM +0100, Maciej S. Szmigiero wrote:
> Same here :) - the sender sent us possibly wrong packet or packet of
> incompatible version, we should handle this gracefully rather than
> assert()/abort() QEMU.

Ah, sure.  Feel free to keep them.  My R-b can keep.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 11/24] migration/multifd: Make multifd_send() thread safe
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (9 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-12-05 16:17   ` Peter Xu
  2024-11-17 19:20 ` [PATCH v3 12/24] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

multifd_send() function is currently not thread safe, make it thread safe
by holding a lock during its execution.

This way it will be possible to safely call it concurrently from multiple
threads.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 9578a985449b..4575495c8816 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -50,6 +50,10 @@ typedef struct {
 
 struct {
     MultiFDSendParams *params;
+
+    /* multifd_send() body is not thread safe, needs serialization */
+    QemuMutex multifd_send_mutex;
+
     /*
      * Global number of generated multifd packets.
      *
@@ -331,6 +335,7 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
  */
 bool multifd_send(MultiFDSendData **send_data)
 {
+    QEMU_LOCK_GUARD(&multifd_send_state->multifd_send_mutex);
     int i;
     static int next_channel;
     MultiFDSendParams *p = NULL; /* make happy gcc */
@@ -508,6 +513,7 @@ static void multifd_send_cleanup_state(void)
     socket_cleanup_outgoing_migration();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
+    qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
     g_free(multifd_send_state->params);
     multifd_send_state->params = NULL;
     g_free(multifd_send_state);
@@ -853,6 +859,7 @@ bool multifd_send_setup(void)
     thread_count = migrate_multifd_channels();
     multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
+    qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
     qemu_sem_init(&multifd_send_state->channels_created, 0);
     qemu_sem_init(&multifd_send_state->channels_ready, 0);
     qatomic_set(&multifd_send_state->exiting, 0);


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 11/24] migration/multifd: Make multifd_send() thread safe
  2024-11-17 19:20 ` [PATCH v3 11/24] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
@ 2024-12-05 16:17   ` Peter Xu
  2024-12-06 21:12     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-05 16:17 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:06PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> multifd_send() function is currently not thread safe, make it thread safe
> by holding a lock during its execution.
> 
> This way it will be possible to safely call it concurrently from multiple
> threads.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

One nitpick:

> ---
>  migration/multifd.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 9578a985449b..4575495c8816 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -50,6 +50,10 @@ typedef struct {
>  
>  struct {
>      MultiFDSendParams *params;
> +
> +    /* multifd_send() body is not thread safe, needs serialization */
> +    QemuMutex multifd_send_mutex;
> +
>      /*
>       * Global number of generated multifd packets.
>       *
> @@ -331,6 +335,7 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
>   */
>  bool multifd_send(MultiFDSendData **send_data)
>  {
> +    QEMU_LOCK_GUARD(&multifd_send_state->multifd_send_mutex);

Better move this after the varaible declarations to be clear..

Perhaps even after multifd_send_should_exit() because reading that doesn't
need a lock, just in case something wanna quit but keep stuck with the
mutex.

>      int i;
>      static int next_channel;
>      MultiFDSendParams *p = NULL; /* make happy gcc */
> @@ -508,6 +513,7 @@ static void multifd_send_cleanup_state(void)
>      socket_cleanup_outgoing_migration();
>      qemu_sem_destroy(&multifd_send_state->channels_created);
>      qemu_sem_destroy(&multifd_send_state->channels_ready);
> +    qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
>      g_free(multifd_send_state->params);
>      multifd_send_state->params = NULL;
>      g_free(multifd_send_state);
> @@ -853,6 +859,7 @@ bool multifd_send_setup(void)
>      thread_count = migrate_multifd_channels();
>      multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
>      multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> +    qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
>      qemu_sem_init(&multifd_send_state->channels_created, 0);
>      qemu_sem_init(&multifd_send_state->channels_ready, 0);
>      qatomic_set(&multifd_send_state->exiting, 0);
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 11/24] migration/multifd: Make multifd_send() thread safe
  2024-12-05 16:17   ` Peter Xu
@ 2024-12-06 21:12     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 21:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 5.12.2024 17:17, Peter Xu wrote:
> On Sun, Nov 17, 2024 at 08:20:06PM +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> multifd_send() function is currently not thread safe, make it thread safe
>> by holding a lock during its execution.
>>
>> This way it will be possible to safely call it concurrently from multiple
>> threads.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> One nitpick:
> 
>> ---
>>   migration/multifd.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 9578a985449b..4575495c8816 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -50,6 +50,10 @@ typedef struct {
>>   
>>   struct {
>>       MultiFDSendParams *params;
>> +
>> +    /* multifd_send() body is not thread safe, needs serialization */
>> +    QemuMutex multifd_send_mutex;
>> +
>>       /*
>>        * Global number of generated multifd packets.
>>        *
>> @@ -331,6 +335,7 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
>>    */
>>   bool multifd_send(MultiFDSendData **send_data)
>>   {
>> +    QEMU_LOCK_GUARD(&multifd_send_state->multifd_send_mutex);
> 
> Better move this after the varaible declarations to be clear..
> 
> Perhaps even after multifd_send_should_exit() because reading that doesn't
> need a lock, just in case something wanna quit but keep stuck with the
> mutex.
> 

Will do.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 12/24] migration/multifd: Add an explicit MultiFDSendData destructor
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (10 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 11/24] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-12-05 16:23   ` Peter Xu
  2024-11-17 19:20 ` [PATCH v3 13/24] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This way if there are fields there that needs explicit disposal (like, for
example, some attached buffers) they will be handled appropriately.

Add a related assert to multifd_set_payload_type() in order to make sure
that this function is only used to fill a previously empty MultiFDSendData
with some payload, not the other way around.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-nocomp.c |  3 +--
 migration/multifd.c        | 31 ++++++++++++++++++++++++++++---
 migration/multifd.h        |  5 +++++
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 55191152f9cb..fa0fd0289eca 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -41,8 +41,7 @@ void multifd_ram_save_setup(void)
 
 void multifd_ram_save_cleanup(void)
 {
-    g_free(multifd_ram_send);
-    multifd_ram_send = NULL;
+    g_clear_pointer(&multifd_ram_send, multifd_send_data_free);
 }
 
 static void multifd_set_file_bitmap(MultiFDSendParams *p)
diff --git a/migration/multifd.c b/migration/multifd.c
index 4575495c8816..730acf55cfad 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -123,6 +123,32 @@ MultiFDSendData *multifd_send_data_alloc(void)
     return g_malloc0(size_minus_payload + max_payload_size);
 }
 
+void multifd_send_data_clear(MultiFDSendData *data)
+{
+    if (multifd_payload_empty(data)) {
+        return;
+    }
+
+    switch (data->type) {
+    default:
+        /* Nothing to do */
+        break;
+    }
+
+    data->type = MULTIFD_PAYLOAD_NONE;
+}
+
+void multifd_send_data_free(MultiFDSendData *data)
+{
+    if (!data) {
+        return;
+    }
+
+    multifd_send_data_clear(data);
+
+    g_free(data);
+}
+
 static bool multifd_use_packets(void)
 {
     return !migrate_mapped_ram();
@@ -496,8 +522,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     qemu_sem_destroy(&p->sem_sync);
     g_free(p->name);
     p->name = NULL;
-    g_free(p->data);
-    p->data = NULL;
+    g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
@@ -663,7 +688,7 @@ static void *multifd_send_thread(void *opaque)
                        (uint64_t)p->next_packet_size + p->packet_len);
 
             p->next_packet_size = 0;
-            multifd_set_payload_type(p->data, MULTIFD_PAYLOAD_NONE);
+            multifd_send_data_clear(p->data);
 
             /*
              * Making sure p->data is published before saying "we're
diff --git a/migration/multifd.h b/migration/multifd.h
index 026b653057e2..d2f1d0d74da7 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -140,6 +140,9 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
+    assert(multifd_payload_empty(data));
+    assert(type != MULTIFD_PAYLOAD_NONE);
+
     data->type = type;
 }
 
@@ -353,6 +356,8 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
+void multifd_send_data_clear(MultiFDSendData *data);
+void multifd_send_data_free(MultiFDSendData *data);
 
 static inline uint32_t multifd_ram_page_size(void)
 {


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 12/24] migration/multifd: Add an explicit MultiFDSendData destructor
  2024-11-17 19:20 ` [PATCH v3 12/24] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
@ 2024-12-05 16:23   ` Peter Xu
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-05 16:23 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:07PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This way if there are fields there that needs explicit disposal (like, for
> example, some attached buffers) they will be handled appropriately.
> 
> Add a related assert to multifd_set_payload_type() in order to make sure
> that this function is only used to fill a previously empty MultiFDSendData
> with some payload, not the other way around.
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 13/24] migration/multifd: Device state transfer support - send side
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (11 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 12/24] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-26 19:58   ` Fabiano Rosas
  2024-11-17 19:20 ` [PATCH v3 14/24] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         |   4 ++
 migration/meson.build            |   1 +
 migration/multifd-device-state.c | 106 +++++++++++++++++++++++++++++++
 migration/multifd-nocomp.c       |  11 +++-
 migration/multifd.c              |  43 +++++++++++--
 migration/multifd.h              |  24 ++++---
 6 files changed, 173 insertions(+), 16 deletions(-)
 create mode 100644 migration/multifd-device-state.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index c92ca018ab3b..118e205bbcc6 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -109,4 +109,8 @@ bool migration_incoming_postcopy_advised(void);
 /* True if background snapshot is active */
 bool migration_in_bg_snapshot(void);
 
+/* migration/multifd-device-state.c */
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len);
+
 #endif
diff --git a/migration/meson.build b/migration/meson.build
index d53cf3417ab8..9788c47bb56e 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -22,6 +22,7 @@ system_ss.add(files(
   'migration-hmp-cmds.c',
   'migration.c',
   'multifd.c',
+  'multifd-device-state.c',
   'multifd-nocomp.c',
   'multifd-zlib.c',
   'multifd-zero-page.c',
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
new file mode 100644
index 000000000000..7741a64fbd4d
--- /dev/null
+++ b/migration/multifd-device-state.c
@@ -0,0 +1,106 @@
+/*
+ * Multifd device state migration
+ *
+ * Copyright (C) 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/lockable.h"
+#include "migration/misc.h"
+#include "multifd.h"
+
+static QemuMutex queue_job_mutex;
+
+static MultiFDSendData *device_state_send;
+
+size_t multifd_device_state_payload_size(void)
+{
+    return sizeof(MultiFDDeviceState_t);
+}
+
+void multifd_device_state_send_setup(void)
+{
+    qemu_mutex_init(&queue_job_mutex);
+
+    device_state_send = multifd_send_data_alloc();
+}
+
+void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
+{
+    g_clear_pointer(&device_state->idstr, g_free);
+    g_clear_pointer(&device_state->buf, g_free);
+}
+
+void multifd_device_state_send_cleanup(void)
+{
+    g_clear_pointer(&device_state_send, multifd_send_data_free);
+
+    qemu_mutex_destroy(&queue_job_mutex);
+}
+
+static void multifd_device_state_fill_packet(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+    packet->hdr.flags = cpu_to_be32(p->flags);
+    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
+    packet->instance_id = cpu_to_be32(device_state->instance_id);
+    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+static void multifd_prepare_header_device_state(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = sizeof(*p->packet_device_state);
+    p->iov[0].iov_base = p->packet_device_state;
+    p->iovs_num++;
+}
+
+void multifd_device_state_send_prepare(MultiFDSendParams *p)
+{
+    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
+
+    assert(multifd_payload_device_state(p->data));
+
+    multifd_prepare_header_device_state(p);
+
+    assert(!(p->flags & MULTIFD_FLAG_SYNC));
+
+    p->next_packet_size = device_state->buf_len;
+    if (p->next_packet_size > 0) {
+        p->iov[p->iovs_num].iov_base = device_state->buf;
+        p->iov[p->iovs_num].iov_len = p->next_packet_size;
+        p->iovs_num++;
+    }
+
+    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
+
+    multifd_device_state_fill_packet(p);
+}
+
+bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
+                                char *data, size_t len)
+{
+    /* Device state submissions can come from multiple threads */
+    QEMU_LOCK_GUARD(&queue_job_mutex);
+    MultiFDDeviceState_t *device_state;
+
+    assert(multifd_payload_empty(device_state_send));
+
+    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
+    device_state = &device_state_send->u.device_state;
+    device_state->idstr = g_strdup(idstr);
+    device_state->instance_id = instance_id;
+    device_state->buf = g_memdup2(data, len);
+    device_state->buf_len = len;
+
+    if (!multifd_send(&device_state_send)) {
+        multifd_send_data_clear(device_state_send);
+        return false;
+    }
+
+    return true;
+}
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index fa0fd0289eca..23564ce9aea9 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -84,6 +84,13 @@ static void multifd_nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
     return;
 }
 
+static void multifd_ram_prepare_header(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = p->packet_len;
+    p->iov[0].iov_base = p->packet;
+    p->iovs_num++;
+}
+
 static void multifd_send_prepare_iovs(MultiFDSendParams *p)
 {
     MultiFDPages_t *pages = &p->data->u.ram;
@@ -117,7 +124,7 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
          * Only !zerocopy needs the header in IOV; zerocopy will
          * send it separately.
          */
-        multifd_send_prepare_header(p);
+        multifd_ram_prepare_header(p);
     }
 
     multifd_send_prepare_iovs(p);
@@ -368,7 +375,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
         return false;
     }
 
-    multifd_send_prepare_header(p);
+    multifd_ram_prepare_header(p);
 
     return true;
 }
diff --git a/migration/multifd.c b/migration/multifd.c
index 730acf55cfad..56419af417cc 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -19,6 +20,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -111,7 +113,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
      * added to the union in the future are larger than
      * (MultiFDPages_t + flex array).
      */
-    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
+    max_payload_size = MAX(multifd_ram_payload_size(),
+                           multifd_device_state_payload_size());
+    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
 
     /*
      * Account for any holes the compiler might insert. We can't pack
@@ -130,6 +134,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
     }
 
     switch (data->type) {
+    case MULTIFD_PAYLOAD_DEVICE_STATE:
+        multifd_device_state_clear(&data->u.device_state);
+        break;
     default:
         /* Nothing to do */
         break;
@@ -232,6 +239,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
     return msg.id;
 }
 
+/* Fills a RAM multifd packet */
 void multifd_send_fill_packet(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
@@ -524,6 +532,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     p->name = NULL;
     g_clear_pointer(&p->data, multifd_send_data_free);
     p->packet_len = 0;
+    g_clear_pointer(&p->packet_device_state, g_free);
     g_free(p->packet);
     p->packet = NULL;
     multifd_send_state->ops->send_cleanup(p, errp);
@@ -536,6 +545,7 @@ static void multifd_send_cleanup_state(void)
 {
     file_cleanup_outgoing_migration();
     socket_cleanup_outgoing_migration();
+    multifd_device_state_send_cleanup();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
     qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
@@ -662,16 +672,33 @@ static void *multifd_send_thread(void *opaque)
          * qatomic_store_release() in multifd_send().
          */
         if (qatomic_load_acquire(&p->pending_job)) {
+            bool is_device_state = multifd_payload_device_state(p->data);
+            size_t total_size;
+
             p->flags = 0;
             p->iovs_num = 0;
             assert(!multifd_payload_empty(p->data));
 
-            ret = multifd_send_state->ops->send_prepare(p, &local_err);
-            if (ret != 0) {
-                break;
+            if (is_device_state) {
+                multifd_device_state_send_prepare(p);
+
+                total_size = iov_size(p->iov, p->iovs_num);
+            } else {
+                ret = multifd_send_state->ops->send_prepare(p, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+
+                /*
+                 * Can't just always measure IOVs since these do not include
+                 * packet header in the zerocopy RAM case.
+                 */
+                total_size = (uint64_t)p->next_packet_size + p->packet_len;
             }
 
             if (migrate_mapped_ram()) {
+                assert(!is_device_state);
+
                 ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
                                               &p->data->u.ram, &local_err);
             } else {
@@ -684,8 +711,7 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
-            stat64_add(&mig_stats.multifd_bytes,
-                       (uint64_t)p->next_packet_size + p->packet_len);
+            stat64_add(&mig_stats.multifd_bytes, total_size);
 
             p->next_packet_size = 0;
             multifd_send_data_clear(p->data);
@@ -903,6 +929,9 @@ bool multifd_send_setup(void)
             p->packet_len = sizeof(MultiFDPacket_t)
                           + sizeof(uint64_t) * page_count;
             p->packet = g_malloc0(p->packet_len);
+            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
+            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
         }
         p->name = g_strdup_printf(MIGRATION_THREAD_SRC_MULTIFD, i);
         p->write_flags = 0;
@@ -938,6 +967,8 @@ bool multifd_send_setup(void)
         assert(p->iov);
     }
 
+    multifd_device_state_send_setup();
+
     return true;
 
 err:
diff --git a/migration/multifd.h b/migration/multifd.h
index d2f1d0d74da7..dec7d9404434 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -121,10 +121,12 @@ typedef struct {
 typedef enum {
     MULTIFD_PAYLOAD_NONE,
     MULTIFD_PAYLOAD_RAM,
+    MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
 typedef union MultiFDPayload {
     MultiFDPages_t ram;
+    MultiFDDeviceState_t device_state;
 } MultiFDPayload;
 
 struct MultiFDSendData {
@@ -137,6 +139,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
     return data->type == MULTIFD_PAYLOAD_NONE;
 }
 
+static inline bool multifd_payload_device_state(MultiFDSendData *data)
+{
+    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
+}
+
 static inline void multifd_set_payload_type(MultiFDSendData *data,
                                             MultiFDPayloadType type)
 {
@@ -188,8 +195,9 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointers to the possible packet types */
     MultiFDPacket_t *packet;
+    MultiFDPacketDeviceState_t *packet_device_state;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
@@ -346,13 +354,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p);
 void multifd_send_zero_page_detect(MultiFDSendParams *p);
 void multifd_recv_zero_page_process(MultiFDRecvParams *p);
 
-static inline void multifd_send_prepare_header(MultiFDSendParams *p)
-{
-    p->iov[0].iov_len = p->packet_len;
-    p->iov[0].iov_base = p->packet;
-    p->iovs_num++;
-}
-
 void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
 bool multifd_send(MultiFDSendData **send_data);
 MultiFDSendData *multifd_send_data_alloc(void);
@@ -375,4 +376,11 @@ int multifd_ram_flush_and_sync(void);
 size_t multifd_ram_payload_size(void);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
+
+size_t multifd_device_state_payload_size(void);
+void multifd_device_state_send_setup(void);
+void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
+void multifd_device_state_send_cleanup(void);
+void multifd_device_state_send_prepare(MultiFDSendParams *p);
+
 #endif


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 13/24] migration/multifd: Device state transfer support - send side
  2024-11-17 19:20 ` [PATCH v3 13/24] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2024-11-26 19:58   ` Fabiano Rosas
  2024-11-26 21:22     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-26 19:58 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> A new function multifd_queue_device_state() is provided for device to queue
> its state for transmission via a multifd channel.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  include/migration/misc.h         |   4 ++
>  migration/meson.build            |   1 +
>  migration/multifd-device-state.c | 106 +++++++++++++++++++++++++++++++
>  migration/multifd-nocomp.c       |  11 +++-
>  migration/multifd.c              |  43 +++++++++++--
>  migration/multifd.h              |  24 ++++---
>  6 files changed, 173 insertions(+), 16 deletions(-)
>  create mode 100644 migration/multifd-device-state.c
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index c92ca018ab3b..118e205bbcc6 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -109,4 +109,8 @@ bool migration_incoming_postcopy_advised(void);
>  /* True if background snapshot is active */
>  bool migration_in_bg_snapshot(void);
>  
> +/* migration/multifd-device-state.c */
> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> +                                char *data, size_t len);
> +
>  #endif
> diff --git a/migration/meson.build b/migration/meson.build
> index d53cf3417ab8..9788c47bb56e 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -22,6 +22,7 @@ system_ss.add(files(
>    'migration-hmp-cmds.c',
>    'migration.c',
>    'multifd.c',
> +  'multifd-device-state.c',
>    'multifd-nocomp.c',
>    'multifd-zlib.c',
>    'multifd-zero-page.c',
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> new file mode 100644
> index 000000000000..7741a64fbd4d
> --- /dev/null
> +++ b/migration/multifd-device-state.c
> @@ -0,0 +1,106 @@
> +/*
> + * Multifd device state migration
> + *
> + * Copyright (C) 2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/lockable.h"
> +#include "migration/misc.h"
> +#include "multifd.h"
> +
> +static QemuMutex queue_job_mutex;
> +
> +static MultiFDSendData *device_state_send;
> +
> +size_t multifd_device_state_payload_size(void)
> +{
> +    return sizeof(MultiFDDeviceState_t);
> +}
> +
> +void multifd_device_state_send_setup(void)
> +{
> +    qemu_mutex_init(&queue_job_mutex);
> +
> +    device_state_send = multifd_send_data_alloc();
> +}
> +
> +void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
> +{
> +    g_clear_pointer(&device_state->idstr, g_free);
> +    g_clear_pointer(&device_state->buf, g_free);
> +}
> +
> +void multifd_device_state_send_cleanup(void)
> +{
> +    g_clear_pointer(&device_state_send, multifd_send_data_free);
> +
> +    qemu_mutex_destroy(&queue_job_mutex);
> +}
> +
> +static void multifd_device_state_fill_packet(MultiFDSendParams *p)
> +{
> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
> +    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
> +
> +    packet->hdr.flags = cpu_to_be32(p->flags);
> +    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
> +    packet->instance_id = cpu_to_be32(device_state->instance_id);
> +    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
> +}
> +
> +static void multifd_prepare_header_device_state(MultiFDSendParams *p)
> +{
> +    p->iov[0].iov_len = sizeof(*p->packet_device_state);
> +    p->iov[0].iov_base = p->packet_device_state;
> +    p->iovs_num++;
> +}
> +
> +void multifd_device_state_send_prepare(MultiFDSendParams *p)
> +{
> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
> +
> +    assert(multifd_payload_device_state(p->data));
> +
> +    multifd_prepare_header_device_state(p);
> +
> +    assert(!(p->flags & MULTIFD_FLAG_SYNC));
> +
> +    p->next_packet_size = device_state->buf_len;
> +    if (p->next_packet_size > 0) {
> +        p->iov[p->iovs_num].iov_base = device_state->buf;
> +        p->iov[p->iovs_num].iov_len = p->next_packet_size;
> +        p->iovs_num++;
> +    }
> +
> +    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
> +
> +    multifd_device_state_fill_packet(p);
> +}
> +
> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> +                                char *data, size_t len)
> +{
> +    /* Device state submissions can come from multiple threads */
> +    QEMU_LOCK_GUARD(&queue_job_mutex);
> +    MultiFDDeviceState_t *device_state;
> +
> +    assert(multifd_payload_empty(device_state_send));
> +
> +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
> +    device_state = &device_state_send->u.device_state;
> +    device_state->idstr = g_strdup(idstr);
> +    device_state->instance_id = instance_id;
> +    device_state->buf = g_memdup2(data, len);
> +    device_state->buf_len = len;
> +
> +    if (!multifd_send(&device_state_send)) {
> +        multifd_send_data_clear(device_state_send);
> +        return false;
> +    }
> +
> +    return true;
> +}
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index fa0fd0289eca..23564ce9aea9 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -84,6 +84,13 @@ static void multifd_nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
>      return;
>  }
>  
> +static void multifd_ram_prepare_header(MultiFDSendParams *p)
> +{
> +    p->iov[0].iov_len = p->packet_len;
> +    p->iov[0].iov_base = p->packet;
> +    p->iovs_num++;
> +}
> +
>  static void multifd_send_prepare_iovs(MultiFDSendParams *p)
>  {
>      MultiFDPages_t *pages = &p->data->u.ram;
> @@ -117,7 +124,7 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
>           * Only !zerocopy needs the header in IOV; zerocopy will
>           * send it separately.
>           */
> -        multifd_send_prepare_header(p);
> +        multifd_ram_prepare_header(p);
>      }
>  
>      multifd_send_prepare_iovs(p);
> @@ -368,7 +375,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
>          return false;
>      }
>  
> -    multifd_send_prepare_header(p);
> +    multifd_ram_prepare_header(p);
>  
>      return true;
>  }
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 730acf55cfad..56419af417cc 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -12,6 +12,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/cutils.h"
> +#include "qemu/iov.h"
>  #include "qemu/rcu.h"
>  #include "exec/target_page.h"
>  #include "sysemu/sysemu.h"
> @@ -19,6 +20,7 @@
>  #include "qemu/error-report.h"
>  #include "qapi/error.h"
>  #include "file.h"
> +#include "migration/misc.h"
>  #include "migration.h"
>  #include "migration-stats.h"
>  #include "savevm.h"
> @@ -111,7 +113,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
>       * added to the union in the future are larger than
>       * (MultiFDPages_t + flex array).
>       */
> -    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
> +    max_payload_size = MAX(multifd_ram_payload_size(),
> +                           multifd_device_state_payload_size());
> +    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
>  
>      /*
>       * Account for any holes the compiler might insert. We can't pack
> @@ -130,6 +134,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
>      }
>  
>      switch (data->type) {
> +    case MULTIFD_PAYLOAD_DEVICE_STATE:
> +        multifd_device_state_clear(&data->u.device_state);
> +        break;
>      default:
>          /* Nothing to do */
>          break;
> @@ -232,6 +239,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
>      return msg.id;
>  }
>  
> +/* Fills a RAM multifd packet */
>  void multifd_send_fill_packet(MultiFDSendParams *p)
>  {
>      MultiFDPacket_t *packet = p->packet;
> @@ -524,6 +532,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>      p->name = NULL;
>      g_clear_pointer(&p->data, multifd_send_data_free);
>      p->packet_len = 0;
> +    g_clear_pointer(&p->packet_device_state, g_free);
>      g_free(p->packet);
>      p->packet = NULL;
>      multifd_send_state->ops->send_cleanup(p, errp);
> @@ -536,6 +545,7 @@ static void multifd_send_cleanup_state(void)
>  {
>      file_cleanup_outgoing_migration();
>      socket_cleanup_outgoing_migration();
> +    multifd_device_state_send_cleanup();
>      qemu_sem_destroy(&multifd_send_state->channels_created);
>      qemu_sem_destroy(&multifd_send_state->channels_ready);
>      qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
> @@ -662,16 +672,33 @@ static void *multifd_send_thread(void *opaque)
>           * qatomic_store_release() in multifd_send().
>           */
>          if (qatomic_load_acquire(&p->pending_job)) {
> +            bool is_device_state = multifd_payload_device_state(p->data);
> +            size_t total_size;
> +
>              p->flags = 0;
>              p->iovs_num = 0;
>              assert(!multifd_payload_empty(p->data));
>  
> -            ret = multifd_send_state->ops->send_prepare(p, &local_err);
> -            if (ret != 0) {
> -                break;
> +            if (is_device_state) {
> +                multifd_device_state_send_prepare(p);
> +
> +                total_size = iov_size(p->iov, p->iovs_num);

This is such a good idea, because it allows us to kill
next_packet_size. Let's make it work.

What if you add packet_len to mig_stats under use_zero_copy at
multifd_nocomp_send_prepare? It's only fair since that's when the data
is actually sent. Then this total_size gets consolidated between the
paths.

> +            } else {
> +                ret = multifd_send_state->ops->send_prepare(p, &local_err);
> +                if (ret != 0) {
> +                    break;
> +                }
> +
> +                /*
> +                 * Can't just always measure IOVs since these do not include
> +                 * packet header in the zerocopy RAM case.
> +                 */
> +                total_size = (uint64_t)p->next_packet_size + p->packet_len;
>              }
>  
>              if (migrate_mapped_ram()) {
> +                assert(!is_device_state);
> +
>                  ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
>                                                &p->data->u.ram, &local_err);
>              } else {
> @@ -684,8 +711,7 @@ static void *multifd_send_thread(void *opaque)
>                  break;
>              }
>  
> -            stat64_add(&mig_stats.multifd_bytes,
> -                       (uint64_t)p->next_packet_size + p->packet_len);
> +            stat64_add(&mig_stats.multifd_bytes, total_size);
>  
>              p->next_packet_size = 0;
>              multifd_send_data_clear(p->data);
> @@ -903,6 +929,9 @@ bool multifd_send_setup(void)
>              p->packet_len = sizeof(MultiFDPacket_t)
>                            + sizeof(uint64_t) * page_count;
>              p->packet = g_malloc0(p->packet_len);
> +            p->packet_device_state = g_malloc0(sizeof(*p->packet_device_state));
> +            p->packet_device_state->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
> +            p->packet_device_state->hdr.version = cpu_to_be32(MULTIFD_VERSION);
>          }
>          p->name = g_strdup_printf(MIGRATION_THREAD_SRC_MULTIFD, i);
>          p->write_flags = 0;
> @@ -938,6 +967,8 @@ bool multifd_send_setup(void)
>          assert(p->iov);
>      }
>  
> +    multifd_device_state_send_setup();
> +
>      return true;
>  
>  err:
> diff --git a/migration/multifd.h b/migration/multifd.h
> index d2f1d0d74da7..dec7d9404434 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -121,10 +121,12 @@ typedef struct {
>  typedef enum {
>      MULTIFD_PAYLOAD_NONE,
>      MULTIFD_PAYLOAD_RAM,
> +    MULTIFD_PAYLOAD_DEVICE_STATE,
>  } MultiFDPayloadType;
>  
>  typedef union MultiFDPayload {
>      MultiFDPages_t ram;
> +    MultiFDDeviceState_t device_state;
>  } MultiFDPayload;
>  
>  struct MultiFDSendData {
> @@ -137,6 +139,11 @@ static inline bool multifd_payload_empty(MultiFDSendData *data)
>      return data->type == MULTIFD_PAYLOAD_NONE;
>  }
>  
> +static inline bool multifd_payload_device_state(MultiFDSendData *data)
> +{
> +    return data->type == MULTIFD_PAYLOAD_DEVICE_STATE;
> +}
> +
>  static inline void multifd_set_payload_type(MultiFDSendData *data,
>                                              MultiFDPayloadType type)
>  {
> @@ -188,8 +195,9 @@ typedef struct {
>  
>      /* thread local variables. No locking required */
>  
> -    /* pointer to the packet */
> +    /* pointers to the possible packet types */
>      MultiFDPacket_t *packet;
> +    MultiFDPacketDeviceState_t *packet_device_state;
>      /* size of the next packet that contains pages */
>      uint32_t next_packet_size;
>      /* packets sent through this channel */
> @@ -346,13 +354,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p);
>  void multifd_send_zero_page_detect(MultiFDSendParams *p);
>  void multifd_recv_zero_page_process(MultiFDRecvParams *p);
>  
> -static inline void multifd_send_prepare_header(MultiFDSendParams *p)
> -{
> -    p->iov[0].iov_len = p->packet_len;
> -    p->iov[0].iov_base = p->packet;
> -    p->iovs_num++;
> -}
> -
>  void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc);
>  bool multifd_send(MultiFDSendData **send_data);
>  MultiFDSendData *multifd_send_data_alloc(void);
> @@ -375,4 +376,11 @@ int multifd_ram_flush_and_sync(void);
>  size_t multifd_ram_payload_size(void);
>  void multifd_ram_fill_packet(MultiFDSendParams *p);
>  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
> +
> +size_t multifd_device_state_payload_size(void);
> +void multifd_device_state_send_setup(void);
> +void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
> +void multifd_device_state_send_cleanup(void);
> +void multifd_device_state_send_prepare(MultiFDSendParams *p);
> +
>  #endif


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 13/24] migration/multifd: Device state transfer support - send side
  2024-11-26 19:58   ` Fabiano Rosas
@ 2024-11-26 21:22     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-26 21:22 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.11.2024 20:58, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> A new function multifd_queue_device_state() is provided for device to queue
>> its state for transmission via a multifd channel.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h         |   4 ++
>>   migration/meson.build            |   1 +
>>   migration/multifd-device-state.c | 106 +++++++++++++++++++++++++++++++
>>   migration/multifd-nocomp.c       |  11 +++-
>>   migration/multifd.c              |  43 +++++++++++--
>>   migration/multifd.h              |  24 ++++---
>>   6 files changed, 173 insertions(+), 16 deletions(-)
>>   create mode 100644 migration/multifd-device-state.c
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index c92ca018ab3b..118e205bbcc6 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -109,4 +109,8 @@ bool migration_incoming_postcopy_advised(void);
>>   /* True if background snapshot is active */
>>   bool migration_in_bg_snapshot(void);
>>   
>> +/* migration/multifd-device-state.c */
>> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> +                                char *data, size_t len);
>> +
>>   #endif
>> diff --git a/migration/meson.build b/migration/meson.build
>> index d53cf3417ab8..9788c47bb56e 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -22,6 +22,7 @@ system_ss.add(files(
>>     'migration-hmp-cmds.c',
>>     'migration.c',
>>     'multifd.c',
>> +  'multifd-device-state.c',
>>     'multifd-nocomp.c',
>>     'multifd-zlib.c',
>>     'multifd-zero-page.c',
>> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> new file mode 100644
>> index 000000000000..7741a64fbd4d
>> --- /dev/null
>> +++ b/migration/multifd-device-state.c
>> @@ -0,0 +1,106 @@
>> +/*
>> + * Multifd device state migration
>> + *
>> + * Copyright (C) 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/lockable.h"
>> +#include "migration/misc.h"
>> +#include "multifd.h"
>> +
>> +static QemuMutex queue_job_mutex;
>> +
>> +static MultiFDSendData *device_state_send;
>> +
>> +size_t multifd_device_state_payload_size(void)
>> +{
>> +    return sizeof(MultiFDDeviceState_t);
>> +}
>> +
>> +void multifd_device_state_send_setup(void)
>> +{
>> +    qemu_mutex_init(&queue_job_mutex);
>> +
>> +    device_state_send = multifd_send_data_alloc();
>> +}
>> +
>> +void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>> +{
>> +    g_clear_pointer(&device_state->idstr, g_free);
>> +    g_clear_pointer(&device_state->buf, g_free);
>> +}
>> +
>> +void multifd_device_state_send_cleanup(void)
>> +{
>> +    g_clear_pointer(&device_state_send, multifd_send_data_free);
>> +
>> +    qemu_mutex_destroy(&queue_job_mutex);
>> +}
>> +
>> +static void multifd_device_state_fill_packet(MultiFDSendParams *p)
>> +{
>> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
>> +    MultiFDPacketDeviceState_t *packet = p->packet_device_state;
>> +
>> +    packet->hdr.flags = cpu_to_be32(p->flags);
>> +    strncpy(packet->idstr, device_state->idstr, sizeof(packet->idstr));
>> +    packet->instance_id = cpu_to_be32(device_state->instance_id);
>> +    packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>> +}
>> +
>> +static void multifd_prepare_header_device_state(MultiFDSendParams *p)
>> +{
>> +    p->iov[0].iov_len = sizeof(*p->packet_device_state);
>> +    p->iov[0].iov_base = p->packet_device_state;
>> +    p->iovs_num++;
>> +}
>> +
>> +void multifd_device_state_send_prepare(MultiFDSendParams *p)
>> +{
>> +    MultiFDDeviceState_t *device_state = &p->data->u.device_state;
>> +
>> +    assert(multifd_payload_device_state(p->data));
>> +
>> +    multifd_prepare_header_device_state(p);
>> +
>> +    assert(!(p->flags & MULTIFD_FLAG_SYNC));
>> +
>> +    p->next_packet_size = device_state->buf_len;
>> +    if (p->next_packet_size > 0) {
>> +        p->iov[p->iovs_num].iov_base = device_state->buf;
>> +        p->iov[p->iovs_num].iov_len = p->next_packet_size;
>> +        p->iovs_num++;
>> +    }
>> +
>> +    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
>> +
>> +    multifd_device_state_fill_packet(p);
>> +}
>> +
>> +bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>> +                                char *data, size_t len)
>> +{
>> +    /* Device state submissions can come from multiple threads */
>> +    QEMU_LOCK_GUARD(&queue_job_mutex);
>> +    MultiFDDeviceState_t *device_state;
>> +
>> +    assert(multifd_payload_empty(device_state_send));
>> +
>> +    multifd_set_payload_type(device_state_send, MULTIFD_PAYLOAD_DEVICE_STATE);
>> +    device_state = &device_state_send->u.device_state;
>> +    device_state->idstr = g_strdup(idstr);
>> +    device_state->instance_id = instance_id;
>> +    device_state->buf = g_memdup2(data, len);
>> +    device_state->buf_len = len;
>> +
>> +    if (!multifd_send(&device_state_send)) {
>> +        multifd_send_data_clear(device_state_send);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
>> index fa0fd0289eca..23564ce9aea9 100644
>> --- a/migration/multifd-nocomp.c
>> +++ b/migration/multifd-nocomp.c
>> @@ -84,6 +84,13 @@ static void multifd_nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
>>       return;
>>   }
>>   
>> +static void multifd_ram_prepare_header(MultiFDSendParams *p)
>> +{
>> +    p->iov[0].iov_len = p->packet_len;
>> +    p->iov[0].iov_base = p->packet;
>> +    p->iovs_num++;
>> +}
>> +
>>   static void multifd_send_prepare_iovs(MultiFDSendParams *p)
>>   {
>>       MultiFDPages_t *pages = &p->data->u.ram;
>> @@ -117,7 +124,7 @@ static int multifd_nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
>>            * Only !zerocopy needs the header in IOV; zerocopy will
>>            * send it separately.
>>            */
>> -        multifd_send_prepare_header(p);
>> +        multifd_ram_prepare_header(p);
>>       }
>>   
>>       multifd_send_prepare_iovs(p);
>> @@ -368,7 +375,7 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
>>           return false;
>>       }
>>   
>> -    multifd_send_prepare_header(p);
>> +    multifd_ram_prepare_header(p);
>>   
>>       return true;
>>   }
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 730acf55cfad..56419af417cc 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -12,6 +12,7 @@
>>   
>>   #include "qemu/osdep.h"
>>   #include "qemu/cutils.h"
>> +#include "qemu/iov.h"
>>   #include "qemu/rcu.h"
>>   #include "exec/target_page.h"
>>   #include "sysemu/sysemu.h"
>> @@ -19,6 +20,7 @@
>>   #include "qemu/error-report.h"
>>   #include "qapi/error.h"
>>   #include "file.h"
>> +#include "migration/misc.h"
>>   #include "migration.h"
>>   #include "migration-stats.h"
>>   #include "savevm.h"
>> @@ -111,7 +113,9 @@ MultiFDSendData *multifd_send_data_alloc(void)
>>        * added to the union in the future are larger than
>>        * (MultiFDPages_t + flex array).
>>        */
>> -    max_payload_size = MAX(multifd_ram_payload_size(), sizeof(MultiFDPayload));
>> +    max_payload_size = MAX(multifd_ram_payload_size(),
>> +                           multifd_device_state_payload_size());
>> +    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
>>   
>>       /*
>>        * Account for any holes the compiler might insert. We can't pack
>> @@ -130,6 +134,9 @@ void multifd_send_data_clear(MultiFDSendData *data)
>>       }
>>   
>>       switch (data->type) {
>> +    case MULTIFD_PAYLOAD_DEVICE_STATE:
>> +        multifd_device_state_clear(&data->u.device_state);
>> +        break;
>>       default:
>>           /* Nothing to do */
>>           break;
>> @@ -232,6 +239,7 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
>>       return msg.id;
>>   }
>>   
>> +/* Fills a RAM multifd packet */
>>   void multifd_send_fill_packet(MultiFDSendParams *p)
>>   {
>>       MultiFDPacket_t *packet = p->packet;
>> @@ -524,6 +532,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>>       p->name = NULL;
>>       g_clear_pointer(&p->data, multifd_send_data_free);
>>       p->packet_len = 0;
>> +    g_clear_pointer(&p->packet_device_state, g_free);
>>       g_free(p->packet);
>>       p->packet = NULL;
>>       multifd_send_state->ops->send_cleanup(p, errp);
>> @@ -536,6 +545,7 @@ static void multifd_send_cleanup_state(void)
>>   {
>>       file_cleanup_outgoing_migration();
>>       socket_cleanup_outgoing_migration();
>> +    multifd_device_state_send_cleanup();
>>       qemu_sem_destroy(&multifd_send_state->channels_created);
>>       qemu_sem_destroy(&multifd_send_state->channels_ready);
>>       qemu_mutex_destroy(&multifd_send_state->multifd_send_mutex);
>> @@ -662,16 +672,33 @@ static void *multifd_send_thread(void *opaque)
>>            * qatomic_store_release() in multifd_send().
>>            */
>>           if (qatomic_load_acquire(&p->pending_job)) {
>> +            bool is_device_state = multifd_payload_device_state(p->data);
>> +            size_t total_size;
>> +
>>               p->flags = 0;
>>               p->iovs_num = 0;
>>               assert(!multifd_payload_empty(p->data));
>>   
>> -            ret = multifd_send_state->ops->send_prepare(p, &local_err);
>> -            if (ret != 0) {
>> -                break;
>> +            if (is_device_state) {
>> +                multifd_device_state_send_prepare(p);
>> +
>> +                total_size = iov_size(p->iov, p->iovs_num);
> 
> This is such a good idea, because it allows us to kill
> next_packet_size. Let's make it work.
> 
> What if you add packet_len to mig_stats under use_zero_copy at
> multifd_nocomp_send_prepare? It's only fair since that's when the data
> is actually sent. Then this total_size gets consolidated between the
> paths.
> 

Adding the header to multifd_bytes where it is actually sent
(in multifd_nocomp_send_prepare() in this case) makes sense to me -
will change it so.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 14/24] migration/multifd: Make MultiFDSendData a struct
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (12 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 13/24] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-17 19:20 ` [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: Peter Xu <peterx@redhat.com>

The newly introduced device state buffer can be used for either storing
VFIO's read() raw data, but already also possible to store generic device
states.  After noticing that device states may not easily provide a max
buffer size (also the fact that RAM MultiFDPages_t after all also want to
have flexibility on managing offset[] array), it may not be a good idea to
stick with union on MultiFDSendData.. as it won't play well with such
flexibility.

Switch MultiFDSendData to a struct.

It won't consume a lot more space in reality, after all the real buffers
were already dynamically allocated, so it's so far only about the two
structs (pages, device_state) that will be duplicated, but they're small.

With this, we can remove the pretty hard to understand alloc size logic.
Because now we can allocate offset[] together with the SendData, and
properly free it when the SendData is freed.

Signed-off-by: Peter Xu <peterx@redhat.com>
[MSS: Make sure to clear possible device state payload before freeing
MultiFDSendData, remove placeholders for other patches not included]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-device-state.c |  5 -----
 migration/multifd-nocomp.c       | 13 ++++++-------
 migration/multifd.c              | 25 +++++++------------------
 migration/multifd.h              | 14 +++++++++-----
 4 files changed, 22 insertions(+), 35 deletions(-)

diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index 7741a64fbd4d..8cf5a6c2668c 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -16,11 +16,6 @@ static QemuMutex queue_job_mutex;
 
 static MultiFDSendData *device_state_send;
 
-size_t multifd_device_state_payload_size(void)
-{
-    return sizeof(MultiFDDeviceState_t);
-}
-
 void multifd_device_state_send_setup(void)
 {
     qemu_mutex_init(&queue_job_mutex);
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 23564ce9aea9..90c0927b9bcb 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -23,15 +23,14 @@
 
 static MultiFDSendData *multifd_ram_send;
 
-size_t multifd_ram_payload_size(void)
+void multifd_ram_payload_alloc(MultiFDPages_t *pages)
 {
-    uint32_t n = multifd_ram_page_count();
+    pages->offset = g_new0(ram_addr_t, multifd_ram_page_count());
+}
 
-    /*
-     * We keep an array of page offsets at the end of MultiFDPages_t,
-     * add space for it in the allocation.
-     */
-    return sizeof(MultiFDPages_t) + n * sizeof(ram_addr_t);
+void multifd_ram_payload_free(MultiFDPages_t *pages)
+{
+    g_clear_pointer(&pages->offset, g_free);
 }
 
 void multifd_ram_save_setup(void)
diff --git a/migration/multifd.c b/migration/multifd.c
index 56419af417cc..4b03253f739e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -105,26 +105,12 @@ struct {
 
 MultiFDSendData *multifd_send_data_alloc(void)
 {
-    size_t max_payload_size, size_minus_payload;
+    MultiFDSendData *new = g_new0(MultiFDSendData, 1);
 
-    /*
-     * MultiFDPages_t has a flexible array at the end, account for it
-     * when allocating MultiFDSendData. Use max() in case other types
-     * added to the union in the future are larger than
-     * (MultiFDPages_t + flex array).
-     */
-    max_payload_size = MAX(multifd_ram_payload_size(),
-                           multifd_device_state_payload_size());
-    max_payload_size = MAX(max_payload_size, sizeof(MultiFDPayload));
-
-    /*
-     * Account for any holes the compiler might insert. We can't pack
-     * the structure because that misaligns the members and triggers
-     * Waddress-of-packed-member.
-     */
-    size_minus_payload = sizeof(MultiFDSendData) - sizeof(MultiFDPayload);
+    multifd_ram_payload_alloc(&new->u.ram);
+    /* Device state allocates its payload on-demand */
 
-    return g_malloc0(size_minus_payload + max_payload_size);
+    return new;
 }
 
 void multifd_send_data_clear(MultiFDSendData *data)
@@ -151,8 +137,11 @@ void multifd_send_data_free(MultiFDSendData *data)
         return;
     }
 
+    /* This also free's device state payload */
     multifd_send_data_clear(data);
 
+    multifd_ram_payload_free(&data->u.ram);
+
     g_free(data);
 }
 
diff --git a/migration/multifd.h b/migration/multifd.h
index dec7d9404434..05ddfb4bf119 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -99,9 +99,13 @@ typedef struct {
     uint32_t num;
     /* number of normal pages */
     uint32_t normal_num;
+    /*
+     * Pointer to the ramblock.  NOTE: it's caller's responsibility to make
+     * sure the pointer is always valid!
+     */
     RAMBlock *block;
-    /* offset of each page */
-    ram_addr_t offset[];
+    /* offset array of each page, managed by multifd */
+    ram_addr_t *offset;
 } MultiFDPages_t;
 
 struct MultiFDRecvData {
@@ -124,7 +128,7 @@ typedef enum {
     MULTIFD_PAYLOAD_DEVICE_STATE,
 } MultiFDPayloadType;
 
-typedef union MultiFDPayload {
+typedef struct MultiFDPayload {
     MultiFDPages_t ram;
     MultiFDDeviceState_t device_state;
 } MultiFDPayload;
@@ -373,11 +377,11 @@ static inline uint32_t multifd_ram_page_count(void)
 void multifd_ram_save_setup(void);
 void multifd_ram_save_cleanup(void);
 int multifd_ram_flush_and_sync(void);
-size_t multifd_ram_payload_size(void);
+void multifd_ram_payload_alloc(MultiFDPages_t *pages);
+void multifd_ram_payload_free(MultiFDPages_t *pages);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
 
-size_t multifd_device_state_payload_size(void);
 void multifd_device_state_send_setup(void);
 void multifd_device_state_clear(MultiFDDeviceState_t *device_state);
 void multifd_device_state_send_cleanup(void);


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support()
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (13 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 14/24] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-26 20:05   ` Fabiano Rosas
  2024-11-28 10:33   ` Avihai Horon
  2024-11-17 19:20 ` [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete Maciej S. Szmigiero
                   ` (10 subsequent siblings)
  25 siblings, 2 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         | 1 +
 migration/multifd-device-state.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 118e205bbcc6..43558d9198f7 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -112,5 +112,6 @@ bool migration_in_bg_snapshot(void);
 /* migration/multifd-device-state.c */
 bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
+bool migration_has_device_state_support(void);
 
 #endif
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index 8cf5a6c2668c..bcbea926b6be 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -11,6 +11,7 @@
 #include "qemu/lockable.h"
 #include "migration/misc.h"
 #include "multifd.h"
+#include "options.h"
 
 static QemuMutex queue_job_mutex;
 
@@ -99,3 +100,9 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
 
     return true;
 }
+
+bool migration_has_device_state_support(void)
+{
+    return migrate_multifd() && !migrate_mapped_ram() &&
+        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support()
  2024-11-17 19:20 ` [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
@ 2024-11-26 20:05   ` Fabiano Rosas
  2024-11-28 10:33   ` Avihai Horon
  1 sibling, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-26 20:05 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Since device state transfer via multifd channels requires multifd
> channels with packets and is currently not compatible with multifd
> compression add an appropriate query function so device can learn
> whether it can actually make use of it.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support()
  2024-11-17 19:20 ` [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
  2024-11-26 20:05   ` Fabiano Rosas
@ 2024-11-28 10:33   ` Avihai Horon
  2024-11-28 12:12     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-11-28 10:33 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Since device state transfer via multifd channels requires multifd
> channels with packets and is currently not compatible with multifd
> compression add an appropriate query function so device can learn
> whether it can actually make use of it.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/misc.h         | 1 +
>   migration/multifd-device-state.c | 7 +++++++
>   2 files changed, 8 insertions(+)
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 118e205bbcc6..43558d9198f7 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -112,5 +112,6 @@ bool migration_in_bg_snapshot(void);
>   /* migration/multifd-device-state.c */
>   bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>                                   char *data, size_t len);
> +bool migration_has_device_state_support(void);

Nit: maybe rename to multifd_device_state_supported or 
migration_multifd_device_state_supported, as it's specifically related 
to multifd?

Thanks.

>
>   #endif
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index 8cf5a6c2668c..bcbea926b6be 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -11,6 +11,7 @@
>   #include "qemu/lockable.h"
>   #include "migration/misc.h"
>   #include "multifd.h"
> +#include "options.h"
>
>   static QemuMutex queue_job_mutex;
>
> @@ -99,3 +100,9 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>
>       return true;
>   }
> +
> +bool migration_has_device_state_support(void)
> +{
> +    return migrate_multifd() && !migrate_mapped_ram() &&
> +        migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
> +}


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support()
  2024-11-28 10:33   ` Avihai Horon
@ 2024-11-28 12:12     ` Maciej S. Szmigiero
  2024-12-05 16:44       ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-28 12:12 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 28.11.2024 11:33, Avihai Horon wrote:
> 
> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Since device state transfer via multifd channels requires multifd
>> channels with packets and is currently not compatible with multifd
>> compression add an appropriate query function so device can learn
>> whether it can actually make use of it.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h         | 1 +
>>   migration/multifd-device-state.c | 7 +++++++
>>   2 files changed, 8 insertions(+)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 118e205bbcc6..43558d9198f7 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -112,5 +112,6 @@ bool migration_in_bg_snapshot(void);
>>   /* migration/multifd-device-state.c */
>>   bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>>                                   char *data, size_t len);
>> +bool migration_has_device_state_support(void);
> 
> Nit: maybe rename to multifd_device_state_supported or migration_multifd_device_state_supported, as it's specifically related to multifd?

Sure, will do.
  
> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support()
  2024-11-28 12:12     ` Maciej S. Szmigiero
@ 2024-12-05 16:44       ` Peter Xu
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-05 16:44 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel

On Thu, Nov 28, 2024 at 01:12:01PM +0100, Maciej S. Szmigiero wrote:
> On 28.11.2024 11:33, Avihai Horon wrote:
> > 
> > On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Since device state transfer via multifd channels requires multifd
> > > channels with packets and is currently not compatible with multifd
> > > compression add an appropriate query function so device can learn
> > > whether it can actually make use of it.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > ---
> > >   include/migration/misc.h         | 1 +
> > >   migration/multifd-device-state.c | 7 +++++++
> > >   2 files changed, 8 insertions(+)
> > > 
> > > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > > index 118e205bbcc6..43558d9198f7 100644
> > > --- a/include/migration/misc.h
> > > +++ b/include/migration/misc.h
> > > @@ -112,5 +112,6 @@ bool migration_in_bg_snapshot(void);
> > >   /* migration/multifd-device-state.c */
> > >   bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
> > >                                   char *data, size_t len);
> > > +bool migration_has_device_state_support(void);
> > 
> > Nit: maybe rename to multifd_device_state_supported or migration_multifd_device_state_supported, as it's specifically related to multifd?
> 
> Sure, will do.

With that, feel free to take:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (14 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-26 20:52   ` Fabiano Rosas
  2024-11-17 19:20 ` [PATCH v3 17/24] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Currently, ram_save_complete() sends a final SYNC multifd packet near this
function end, after sending all of the remaining RAM data.

On the receive side, this SYNC packet will cause multifd channel threads
to block, waiting for the final sem_sync posting in
multifd_recv_terminate_threads().

However, multifd_recv_terminate_threads() won't be called until the
migration is complete, which causes a problem if multifd channels are
still required for transferring device state data after RAM transfer is
complete but before finishing the migration process.

Defer sending the final SYNC packet to the end of sending of
post-switchover iterable data instead if device state transfer is possible.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/multifd-nocomp.c | 18 +++++++++++++++++-
 migration/multifd.h        |  1 +
 migration/ram.c            | 10 +++++++++-
 migration/savevm.c         | 11 +++++++++++
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 90c0927b9bcb..db87b1262ffa 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -348,7 +348,7 @@ retry:
     return true;
 }
 
-int multifd_ram_flush_and_sync(void)
+int multifd_ram_flush(void)
 {
     if (!migrate_multifd()) {
         return 0;
@@ -361,6 +361,22 @@ int multifd_ram_flush_and_sync(void)
         }
     }
 
+    return 0;
+}
+
+int multifd_ram_flush_and_sync(void)
+{
+    int ret;
+
+    if (!migrate_multifd()) {
+        return 0;
+    }
+
+    ret = multifd_ram_flush();
+    if (ret) {
+        return ret;
+    }
+
     return multifd_send_sync_main();
 }
 
diff --git a/migration/multifd.h b/migration/multifd.h
index 05ddfb4bf119..3abf9578e2ae 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -376,6 +376,7 @@ static inline uint32_t multifd_ram_page_count(void)
 
 void multifd_ram_save_setup(void);
 void multifd_ram_save_cleanup(void);
+int multifd_ram_flush(void);
 int multifd_ram_flush_and_sync(void);
 void multifd_ram_payload_alloc(MultiFDPages_t *pages);
 void multifd_ram_payload_free(MultiFDPages_t *pages);
diff --git a/migration/ram.c b/migration/ram.c
index 05ff9eb32876..cf7bea3f073b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3283,7 +3283,15 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = multifd_ram_flush_and_sync();
+    if (migration_has_device_state_support()) {
+        /*
+         * Can't do the final SYNC here since device state might still
+         * be transferring via multifd channels.
+         */
+        ret = multifd_ram_flush();
+    } else {
+        ret = multifd_ram_flush_and_sync();
+    }
     if (ret < 0) {
         return ret;
     }
diff --git a/migration/savevm.c b/migration/savevm.c
index 6ea9054c4083..98049cb9b09a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -37,6 +37,7 @@
 #include "migration/register.h"
 #include "migration/global_state.h"
 #include "migration/channel-block.h"
+#include "multifd.h"
 #include "ram.h"
 #include "qemu-file.h"
 #include "savevm.h"
@@ -1496,6 +1497,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     int64_t start_ts_each, end_ts_each;
     SaveStateEntry *se;
     int ret;
+    bool multifd_device_state = migration_has_device_state_support();
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
@@ -1528,6 +1530,15 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
                                     end_ts_each - start_ts_each);
     }
 
+    if (multifd_device_state) {
+        /* Send the final SYNC */
+        ret = multifd_send_sync_main();
+        if (ret) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete
  2024-11-17 19:20 ` [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete Maciej S. Szmigiero
@ 2024-11-26 20:52   ` Fabiano Rosas
  2024-11-26 21:22     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-26 20:52 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Currently, ram_save_complete() sends a final SYNC multifd packet near this
> function end, after sending all of the remaining RAM data.
>
> On the receive side, this SYNC packet will cause multifd channel threads
> to block, waiting for the final sem_sync posting in
> multifd_recv_terminate_threads().
>
> However, multifd_recv_terminate_threads() won't be called until the
> migration is complete, which causes a problem if multifd channels are
> still required for transferring device state data after RAM transfer is
> complete but before finishing the migration process.
>
> Defer sending the final SYNC packet to the end of sending of
> post-switchover iterable data instead if device state transfer is possible.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

I wonder whether we could just defer the sync for the !device_state case
as well.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete
  2024-11-26 20:52   ` Fabiano Rosas
@ 2024-11-26 21:22     ` Maciej S. Szmigiero
  2024-12-05 19:02       ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-26 21:22 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Alex Williamson, Peter Xu, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 26.11.2024 21:52, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Currently, ram_save_complete() sends a final SYNC multifd packet near this
>> function end, after sending all of the remaining RAM data.
>>
>> On the receive side, this SYNC packet will cause multifd channel threads
>> to block, waiting for the final sem_sync posting in
>> multifd_recv_terminate_threads().
>>
>> However, multifd_recv_terminate_threads() won't be called until the
>> migration is complete, which causes a problem if multifd channels are
>> still required for transferring device state data after RAM transfer is
>> complete but before finishing the migration process.
>>
>> Defer sending the final SYNC packet to the end of sending of
>> post-switchover iterable data instead if device state transfer is possible.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 
> I wonder whether we could just defer the sync for the !device_state case
> as well.
> 

AFAIK this should work, just wanted to be extra cautious with bit
stream timing changes in case there's for example some race in an
older QEMU version.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete
  2024-11-26 21:22     ` Maciej S. Szmigiero
@ 2024-12-05 19:02       ` Peter Xu
  2024-12-10 23:05         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-05 19:02 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Tue, Nov 26, 2024 at 10:22:42PM +0100, Maciej S. Szmigiero wrote:
> On 26.11.2024 21:52, Fabiano Rosas wrote:
> > "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> > 
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > Currently, ram_save_complete() sends a final SYNC multifd packet near this
> > > function end, after sending all of the remaining RAM data.
> > > 
> > > On the receive side, this SYNC packet will cause multifd channel threads
> > > to block, waiting for the final sem_sync posting in
> > > multifd_recv_terminate_threads().
> > > 
> > > However, multifd_recv_terminate_threads() won't be called until the
> > > migration is complete, which causes a problem if multifd channels are
> > > still required for transferring device state data after RAM transfer is
> > > complete but before finishing the migration process.
> > > 
> > > Defer sending the final SYNC packet to the end of sending of
> > > post-switchover iterable data instead if device state transfer is possible.
> > > 
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > 
> > Reviewed-by: Fabiano Rosas <farosas@suse.de>
> > 
> > I wonder whether we could just defer the sync for the !device_state case
> > as well.
> > 
> 
> AFAIK this should work, just wanted to be extra cautious with bit
> stream timing changes in case there's for example some race in an
> older QEMU version.

I see the issue, but maybe we don't even need this patch..

When I was working on commit 637280aeb2 previously, I forgot that the SYNC
messages are together with the FLUSH which got removed.  It means now in
complete() we will sent SYNCs always, but always without FLUSH.

On new binaries, it means SYNCs are not collected properly on dest threads
so it'll hang all threads there.

So yeah, at least from that part it's me to blame..

I think maybe VFIO doesn't need to change the generic path to sync, because
logically speaking VFIO can also use multifd_send_sync_main() in its own
complete() hook to flush everything.  Here the trick is such sync doesn't
need to be attached to any message (either SYNC or FLUSH, that only RAM
uses).  The sync is about "sync against all sender threads", just like what
we do exactly with mapped-ram.  Mapped-ram tricked that path with a
use_packet check in sender thread, however for VFIO we could already expose
a new parameter to multifd_send_sync_main() saying "let's only sync
threads".

I sent two small patches here:

https://lore.kernel.org/r/20241205185303.897010-1-peterx@redhat.com

The 1st patch should fix the SYNC message hang for 637280aeb2 that I did.
The 2nd patch introduced the flag that I said.  I think after that applied
VFIO should be able to sync directly with:

  multifd_send_sync_main(MULTIFD_SYNC_THREADS);

Then maybe we don't need this patch anymore.  Please have a look.

PS: the two patches could be ready to merge already even before VFIO, if
they're properly reviewed and acked.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete
  2024-12-05 19:02       ` Peter Xu
@ 2024-12-10 23:05         ` Maciej S. Szmigiero
  2024-12-11 13:20           ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 5.12.2024 20:02, Peter Xu wrote:
> On Tue, Nov 26, 2024 at 10:22:42PM +0100, Maciej S. Szmigiero wrote:
>> On 26.11.2024 21:52, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Currently, ram_save_complete() sends a final SYNC multifd packet near this
>>>> function end, after sending all of the remaining RAM data.
>>>>
>>>> On the receive side, this SYNC packet will cause multifd channel threads
>>>> to block, waiting for the final sem_sync posting in
>>>> multifd_recv_terminate_threads().
>>>>
>>>> However, multifd_recv_terminate_threads() won't be called until the
>>>> migration is complete, which causes a problem if multifd channels are
>>>> still required for transferring device state data after RAM transfer is
>>>> complete but before finishing the migration process.
>>>>
>>>> Defer sending the final SYNC packet to the end of sending of
>>>> post-switchover iterable data instead if device state transfer is possible.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>
>>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
>>>
>>> I wonder whether we could just defer the sync for the !device_state case
>>> as well.
>>>
>>
>> AFAIK this should work, just wanted to be extra cautious with bit
>> stream timing changes in case there's for example some race in an
>> older QEMU version.
> 
> I see the issue, but maybe we don't even need this patch..
> 
> When I was working on commit 637280aeb2 previously, I forgot that the SYNC
> messages are together with the FLUSH which got removed.  It means now in
> complete() we will sent SYNCs always, but always without FLUSH.
> 
> On new binaries, it means SYNCs are not collected properly on dest threads
> so it'll hang all threads there.
> 
> So yeah, at least from that part it's me to blame..
> 
> I think maybe VFIO doesn't need to change the generic path to sync, because
> logically speaking VFIO can also use multifd_send_sync_main() in its own
> complete() hook to flush everything.  Here the trick is such sync doesn't
> need to be attached to any message (either SYNC or FLUSH, that only RAM
> uses).  The sync is about "sync against all sender threads", just like what
> we do exactly with mapped-ram.  Mapped-ram tricked that path with a
> use_packet check in sender thread, however for VFIO we could already expose
> a new parameter to multifd_send_sync_main() saying "let's only sync
> threads".
> 
> I sent two small patches here:
> 
> https://lore.kernel.org/r/20241205185303.897010-1-peterx@redhat.com
> 
> The 1st patch should fix the SYNC message hang for 637280aeb2 that I did.
> The 2nd patch introduced the flag that I said.  I think after that applied
> VFIO should be able to sync directly with:
> 
>    multifd_send_sync_main(MULTIFD_SYNC_THREADS);
> 
> Then maybe we don't need this patch anymore.  Please have a look.
> 
> PS: the two patches could be ready to merge already even before VFIO, if
> they're properly reviewed and acked.

Thanks Peter for this alternate solution

I think/hope that by the time I will be preparing the next version of
this patch multifd device state set these SYNC patches will be already
merged and I can develop/test against them.
  
> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete
  2024-12-10 23:05         ` Maciej S. Szmigiero
@ 2024-12-11 13:20           ` Peter Xu
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-11 13:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Dec 11, 2024 at 12:05:40AM +0100, Maciej S. Szmigiero wrote:
> > I sent two small patches here:
> > 
> > https://lore.kernel.org/r/20241205185303.897010-1-peterx@redhat.com
> > 
> > The 1st patch should fix the SYNC message hang for 637280aeb2 that I did.
> > The 2nd patch introduced the flag that I said.  I think after that applied
> > VFIO should be able to sync directly with:
> > 
> >    multifd_send_sync_main(MULTIFD_SYNC_THREADS);
> > 
> > Then maybe we don't need this patch anymore.  Please have a look.
> > 
> > PS: the two patches could be ready to merge already even before VFIO, if
> > they're properly reviewed and acked.
> 
> Thanks Peter for this alternate solution
> 
> I think/hope that by the time I will be preparing the next version of
> this patch multifd device state set these SYNC patches will be already
> merged and I can develop/test against them.

Yes that's the plan, even if it didn't yet land you can also collect the
first two patches, especially if you agree with the changes.  I think we
should fix it one way or another, so basing on top of that might be best
for this series (it should hopefully have less code to change with that).

Just to mention: when rebased on top, multifd_send_sync_main() may or may
not need a lock to protect when VFIO uses it.  I think no, as long as it
always comes from the migration thread, but worth double check as I don't
100% know what's the next version looks like (or it can simply share the
same multifd mutex, I think).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 17/24] migration: Add save_live_complete_precopy_thread handler
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (15 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-29 14:03   ` Cédric Le Goater
  2024-11-17 19:20 ` [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run Maciej S. Szmigiero
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This SaveVMHandler helps device provide its own asynchronous transmission
of the remaining data at the end of a precopy phase via multifd channels,
in parallel with the transfer done by save_live_complete_precopy handlers.

These threads are launched only when multifd device state transfer is
supported.

Management of these threads in done in the multifd migration code,
wrapping them in the generic thread pool.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/migration/misc.h         |  8 +++
 include/migration/register.h     | 23 +++++++++
 include/qemu/typedefs.h          |  4 ++
 migration/multifd-device-state.c | 85 ++++++++++++++++++++++++++++++++
 migration/savevm.c               | 33 ++++++++++++-
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 43558d9198f7..67014122dcff 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -114,4 +114,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
                                 char *data, size_t len);
 bool migration_has_device_state_support(void);
 
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque);
+
+void multifd_abort_device_state_save_threads(void);
+int multifd_join_device_state_save_threads(void);
+
 #endif
diff --git a/include/migration/register.h b/include/migration/register.h
index 761e4e4d8bcb..ab702e0a930b 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -105,6 +105,29 @@ typedef struct SaveVMHandlers {
      */
     int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+    /* This runs in a separate thread. */
+
+    /**
+     * @save_live_complete_precopy_thread
+     *
+     * Called at the end of a precopy phase from a separate worker thread
+     * in configurations where multifd device state transfer is supported
+     * in order to perform asynchronous transmission of the remaining data in
+     * parallel with @save_live_complete_precopy handlers.
+     * When postcopy is enabled, devices that support postcopy will skip this
+     * step.
+     *
+     * @idstr: this device section idstr
+     * @instance_id: this device section instance_id
+     * @abort_flag: flag indicating that the migration core wants to abort
+     * the transmission and so the handler should exit ASAP. To be read by
+     * qatomic_read() or similar.
+     * @opaque: data pointer passed to register_savevm_live()
+     *
+     * Returns zero to indicate success and negative for error
+     */
+    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
+
     /* This runs both outside and inside the BQL.  */
 
     /**
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 8c8ea5c2840d..926baaad211f 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -132,5 +132,9 @@ typedef struct IRQState *qemu_irq;
  */
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
 typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
+typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
+                                                    uint32_t instance_id,
+                                                    bool *abort_flag,
+                                                    void *opaque);
 
 #endif /* QEMU_TYPEDEFS_H */
diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
index bcbea926b6be..74a4aef346c8 100644
--- a/migration/multifd-device-state.c
+++ b/migration/multifd-device-state.c
@@ -9,12 +9,17 @@
 
 #include "qemu/osdep.h"
 #include "qemu/lockable.h"
+#include "block/thread-pool.h"
 #include "migration/misc.h"
 #include "multifd.h"
 #include "options.h"
 
 static QemuMutex queue_job_mutex;
 
+static ThreadPool *send_threads;
+static int send_threads_ret;
+static bool send_threads_abort;
+
 static MultiFDSendData *device_state_send;
 
 void multifd_device_state_send_setup(void)
@@ -22,6 +27,10 @@ void multifd_device_state_send_setup(void)
     qemu_mutex_init(&queue_job_mutex);
 
     device_state_send = multifd_send_data_alloc();
+
+    send_threads = thread_pool_new();
+    send_threads_ret = 0;
+    send_threads_abort = false;
 }
 
 void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
@@ -32,6 +41,7 @@ void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
 
 void multifd_device_state_send_cleanup(void)
 {
+    g_clear_pointer(&send_threads, thread_pool_free);
     g_clear_pointer(&device_state_send, multifd_send_data_free);
 
     qemu_mutex_destroy(&queue_job_mutex);
@@ -106,3 +116,78 @@ bool migration_has_device_state_support(void)
     return migrate_multifd() && !migrate_mapped_ram() &&
         migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
 }
+
+struct MultiFDDSSaveThreadData {
+    SaveLiveCompletePrecopyThreadHandler hdlr;
+    char *idstr;
+    uint32_t instance_id;
+    void *handler_opaque;
+};
+
+static void multifd_device_state_save_thread_data_free(void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data = opaque;
+
+    g_clear_pointer(&data->idstr, g_free);
+    g_free(data);
+}
+
+static int multifd_device_state_save_thread(void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data = opaque;
+    int ret;
+
+    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
+                     data->handler_opaque);
+    if (ret && !qatomic_read(&send_threads_ret)) {
+        /*
+         * Racy with the above read but that's okay - which thread error
+         * return we report is purely arbitrary anyway.
+         */
+        qatomic_set(&send_threads_ret, ret);
+    }
+
+    return 0;
+}
+
+void
+multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
+                                       char *idstr, uint32_t instance_id,
+                                       void *opaque)
+{
+    struct MultiFDDSSaveThreadData *data;
+
+    assert(migration_has_device_state_support());
+
+    data = g_new(struct MultiFDDSSaveThreadData, 1);
+    data->hdlr = hdlr;
+    data->idstr = g_strdup(idstr);
+    data->instance_id = instance_id;
+    data->handler_opaque = opaque;
+
+    thread_pool_submit(send_threads,
+                       multifd_device_state_save_thread,
+                       data, multifd_device_state_save_thread_data_free);
+
+    /*
+     * Make sure that this new thread is actually spawned immediately so it
+     * can start its work right now.
+     */
+    thread_pool_adjust_max_threads_to_work(send_threads);
+}
+
+void multifd_abort_device_state_save_threads(void)
+{
+    assert(migration_has_device_state_support());
+
+    qatomic_set(&send_threads_abort, true);
+}
+
+int multifd_join_device_state_save_threads(void)
+{
+    assert(migration_has_device_state_support());
+
+    thread_pool_wait(send_threads);
+
+    return send_threads_ret;
+}
diff --git a/migration/savevm.c b/migration/savevm.c
index 98049cb9b09a..177849e7d493 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1499,6 +1499,23 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     int ret;
     bool multifd_device_state = migration_has_device_state_support();
 
+    if (multifd_device_state) {
+        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+            SaveLiveCompletePrecopyThreadHandler hdlr;
+
+            if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+                             se->ops->has_postcopy(se->opaque)) ||
+                !se->ops->save_live_complete_precopy_thread) {
+                continue;
+            }
+
+            hdlr = se->ops->save_live_complete_precopy_thread;
+            multifd_spawn_device_state_save_thread(hdlr,
+                                                   se->idstr, se->instance_id,
+                                                   se->opaque);
+        }
+    }
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops ||
             (in_postcopy && se->ops->has_postcopy &&
@@ -1523,7 +1540,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
         save_section_footer(f, se);
         if (ret < 0) {
             qemu_file_set_error(f, ret);
-            return -1;
+            goto ret_fail_abort_threads;
         }
         end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
         trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id,
@@ -1531,6 +1548,12 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     }
 
     if (multifd_device_state) {
+        ret = multifd_join_device_state_save_threads();
+        if (ret) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+
         /* Send the final SYNC */
         ret = multifd_send_sync_main();
         if (ret) {
@@ -1542,6 +1565,14 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
     return 0;
+
+ret_fail_abort_threads:
+    if (multifd_device_state) {
+        multifd_abort_device_state_save_threads();
+        multifd_join_device_state_save_threads();
+    }
+
+    return -1;
 }
 
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 17/24] migration: Add save_live_complete_precopy_thread handler
  2024-11-17 19:20 ` [PATCH v3 17/24] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
@ 2024-11-29 14:03   ` Cédric Le Goater
  2024-11-29 17:14     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-29 14:03 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This SaveVMHandler helps device provide its own asynchronous transmission
> of the remaining data at the end of a precopy phase via multifd channels,
> in parallel with the transfer done by save_live_complete_precopy handlers.
> 
> These threads are launched only when multifd device state transfer is
> supported.
> 
> Management of these threads in done in the multifd migration code,
> wrapping them in the generic thread pool.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   include/migration/misc.h         |  8 +++
>   include/migration/register.h     | 23 +++++++++
>   include/qemu/typedefs.h          |  4 ++
>   migration/multifd-device-state.c | 85 ++++++++++++++++++++++++++++++++
>   migration/savevm.c               | 33 ++++++++++++-
>   5 files changed, 152 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 43558d9198f7..67014122dcff 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -114,4 +114,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>                                   char *data, size_t len);
>   bool migration_has_device_state_support(void);
>   
> +void
> +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
> +                                       char *idstr, uint32_t instance_id,
> +                                       void *opaque);
> +
> +void multifd_abort_device_state_save_threads(void);
> +int multifd_join_device_state_save_threads(void);
> +
>   #endif
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 761e4e4d8bcb..ab702e0a930b 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -105,6 +105,29 @@ typedef struct SaveVMHandlers {
>        */
>       int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>   
> +    /* This runs in a separate thread. */
> +
> +    /**
> +     * @save_live_complete_precopy_thread
> +     *
> +     * Called at the end of a precopy phase from a separate worker thread
> +     * in configurations where multifd device state transfer is supported
> +     * in order to perform asynchronous transmission of the remaining data in
> +     * parallel with @save_live_complete_precopy handlers.
> +     * When postcopy is enabled, devices that support postcopy will skip this
> +     * step.
> +     *
> +     * @idstr: this device section idstr
> +     * @instance_id: this device section instance_id
> +     * @abort_flag: flag indicating that the migration core wants to abort
> +     * the transmission and so the handler should exit ASAP. To be read by
> +     * qatomic_read() or similar.
> +     * @opaque: data pointer passed to register_savevm_live()
> +     *
> +     * Returns zero to indicate success and negative for error
> +     */
> +    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
> +
>       /* This runs both outside and inside the BQL.  */
>   
>       /**
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index 8c8ea5c2840d..926baaad211f 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -132,5 +132,9 @@ typedef struct IRQState *qemu_irq;
>    */
>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>   typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
> +typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
> +                                                    uint32_t instance_id,
> +                                                    bool *abort_flag,
> +                                                    void *opaque);
>   
>   #endif /* QEMU_TYPEDEFS_H */
> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
> index bcbea926b6be..74a4aef346c8 100644
> --- a/migration/multifd-device-state.c
> +++ b/migration/multifd-device-state.c
> @@ -9,12 +9,17 @@
>   
>   #include "qemu/osdep.h"
>   #include "qemu/lockable.h"
> +#include "block/thread-pool.h"
>   #include "migration/misc.h"
>   #include "multifd.h"
>   #include "options.h"
>   
>   static QemuMutex queue_job_mutex;
>   
> +static ThreadPool *send_threads;
> +static int send_threads_ret;
> +static bool send_threads_abort;
> +
>   static MultiFDSendData *device_state_send;
>   
>   void multifd_device_state_send_setup(void)
> @@ -22,6 +27,10 @@ void multifd_device_state_send_setup(void)
>       qemu_mutex_init(&queue_job_mutex);
>   
>       device_state_send = multifd_send_data_alloc();
> +
> +    send_threads = thread_pool_new();
> +    send_threads_ret = 0;
> +    send_threads_abort = false;
>   }
>   
>   void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
> @@ -32,6 +41,7 @@ void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>   
>   void multifd_device_state_send_cleanup(void)
>   {
> +    g_clear_pointer(&send_threads, thread_pool_free);
>       g_clear_pointer(&device_state_send, multifd_send_data_free);
>   
>       qemu_mutex_destroy(&queue_job_mutex);
> @@ -106,3 +116,78 @@ bool migration_has_device_state_support(void)
>       return migrate_multifd() && !migrate_mapped_ram() &&
>           migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
>   }
> +
> +struct MultiFDDSSaveThreadData {
> +    SaveLiveCompletePrecopyThreadHandler hdlr;
> +    char *idstr;
> +    uint32_t instance_id;
> +    void *handler_opaque;
> +};
> +
> +static void multifd_device_state_save_thread_data_free(void *opaque)
> +{
> +    struct MultiFDDSSaveThreadData *data = opaque;
> +
> +    g_clear_pointer(&data->idstr, g_free);
> +    g_free(data);
> +}
> +
> +static int multifd_device_state_save_thread(void *opaque)
> +{
> +    struct MultiFDDSSaveThreadData *data = opaque;
> +    int ret;
> +
> +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
> +                     data->handler_opaque);
> +    if (ret && !qatomic_read(&send_threads_ret)) {
> +        /*
> +         * Racy with the above read but that's okay - which thread error
> +         * return we report is purely arbitrary anyway.
> +         */
> +        qatomic_set(&send_threads_ret, ret);
> +    }
> +
> +    return 0;
> +}
> +
> +void
> +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
> +                                       char *idstr, uint32_t instance_id,
> +                                       void *opaque)
> +{
> +    struct MultiFDDSSaveThreadData *data;
> +
> +    assert(migration_has_device_state_support());
> +
> +    data = g_new(struct MultiFDDSSaveThreadData, 1);
> +    data->hdlr = hdlr;
> +    data->idstr = g_strdup(idstr);
> +    data->instance_id = instance_id;
> +    data->handler_opaque = opaque;
> +
> +    thread_pool_submit(send_threads,
> +                       multifd_device_state_save_thread,
> +                       data, multifd_device_state_save_thread_data_free);
> +
> +    /*
> +     * Make sure that this new thread is actually spawned immediately so it
> +     * can start its work right now.
> +     */
> +    thread_pool_adjust_max_threads_to_work(send_threads);
> +}
> +
> +void multifd_abort_device_state_save_threads(void)
> +{
> +    assert(migration_has_device_state_support());
> +
> +    qatomic_set(&send_threads_abort, true);
> +}
> +
> +int multifd_join_device_state_save_threads(void)
> +{
> +    assert(migration_has_device_state_support());
> +
> +    thread_pool_wait(send_threads);
> +
> +    return send_threads_ret;
> +}

There is a lot in common with the load_thread part in patch 8. I think
more code could be shared.

C.


> diff --git a/migration/savevm.c b/migration/savevm.c
> index 98049cb9b09a..177849e7d493 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1499,6 +1499,23 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>       int ret;
>       bool multifd_device_state = migration_has_device_state_support();
>   
> +    if (multifd_device_state) {
> +        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +            SaveLiveCompletePrecopyThreadHandler hdlr;
> +
> +            if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
> +                             se->ops->has_postcopy(se->opaque)) ||
> +                !se->ops->save_live_complete_precopy_thread) {
> +                continue;
> +            }
> +
> +            hdlr = se->ops->save_live_complete_precopy_thread;
> +            multifd_spawn_device_state_save_thread(hdlr,
> +                                                   se->idstr, se->instance_id,
> +                                                   se->opaque);
> +        }
> +    }
> +
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           if (!se->ops ||
>               (in_postcopy && se->ops->has_postcopy &&
> @@ -1523,7 +1540,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>           save_section_footer(f, se);
>           if (ret < 0) {
>               qemu_file_set_error(f, ret);
> -            return -1;
> +            goto ret_fail_abort_threads;
>           }
>           end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
>           trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id,
> @@ -1531,6 +1548,12 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>       }
>   
>       if (multifd_device_state) {
> +        ret = multifd_join_device_state_save_threads();
> +        if (ret) {
> +            qemu_file_set_error(f, ret);
> +            return -1;
> +        }
> +
>           /* Send the final SYNC */
>           ret = multifd_send_sync_main();
>           if (ret) {
> @@ -1542,6 +1565,14 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>       trace_vmstate_downtime_checkpoint("src-iterable-saved");
>   
>       return 0;
> +
> +ret_fail_abort_threads:
> +    if (multifd_device_state) {
> +        multifd_abort_device_state_save_threads();
> +        multifd_join_device_state_save_threads();
> +    }
> +
> +    return -1;
>   }
>   
>   int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 17/24] migration: Add save_live_complete_precopy_thread handler
  2024-11-29 14:03   ` Cédric Le Goater
@ 2024-11-29 17:14     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-29 17:14 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 29.11.2024 15:03, Cédric Le Goater wrote:
> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This SaveVMHandler helps device provide its own asynchronous transmission
>> of the remaining data at the end of a precopy phase via multifd channels,
>> in parallel with the transfer done by save_live_complete_precopy handlers.
>>
>> These threads are launched only when multifd device state transfer is
>> supported.
>>
>> Management of these threads in done in the multifd migration code,
>> wrapping them in the generic thread pool.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   include/migration/misc.h         |  8 +++
>>   include/migration/register.h     | 23 +++++++++
>>   include/qemu/typedefs.h          |  4 ++
>>   migration/multifd-device-state.c | 85 ++++++++++++++++++++++++++++++++
>>   migration/savevm.c               | 33 ++++++++++++-
>>   5 files changed, 152 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 43558d9198f7..67014122dcff 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -114,4 +114,12 @@ bool multifd_queue_device_state(char *idstr, uint32_t instance_id,
>>                                   char *data, size_t len);
>>   bool migration_has_device_state_support(void);
>> +void
>> +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
>> +                                       char *idstr, uint32_t instance_id,
>> +                                       void *opaque);
>> +
>> +void multifd_abort_device_state_save_threads(void);
>> +int multifd_join_device_state_save_threads(void);
>> +
>>   #endif
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 761e4e4d8bcb..ab702e0a930b 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -105,6 +105,29 @@ typedef struct SaveVMHandlers {
>>        */
>>       int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
>> +    /* This runs in a separate thread. */
>> +
>> +    /**
>> +     * @save_live_complete_precopy_thread
>> +     *
>> +     * Called at the end of a precopy phase from a separate worker thread
>> +     * in configurations where multifd device state transfer is supported
>> +     * in order to perform asynchronous transmission of the remaining data in
>> +     * parallel with @save_live_complete_precopy handlers.
>> +     * When postcopy is enabled, devices that support postcopy will skip this
>> +     * step.
>> +     *
>> +     * @idstr: this device section idstr
>> +     * @instance_id: this device section instance_id
>> +     * @abort_flag: flag indicating that the migration core wants to abort
>> +     * the transmission and so the handler should exit ASAP. To be read by
>> +     * qatomic_read() or similar.
>> +     * @opaque: data pointer passed to register_savevm_live()
>> +     *
>> +     * Returns zero to indicate success and negative for error
>> +     */
>> +    SaveLiveCompletePrecopyThreadHandler save_live_complete_precopy_thread;
>> +
>>       /* This runs both outside and inside the BQL.  */
>>       /**
>> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
>> index 8c8ea5c2840d..926baaad211f 100644
>> --- a/include/qemu/typedefs.h
>> +++ b/include/qemu/typedefs.h
>> @@ -132,5 +132,9 @@ typedef struct IRQState *qemu_irq;
>>    */
>>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>>   typedef int (*MigrationLoadThread)(bool *abort_flag, void *opaque);
>> +typedef int (*SaveLiveCompletePrecopyThreadHandler)(char *idstr,
>> +                                                    uint32_t instance_id,
>> +                                                    bool *abort_flag,
>> +                                                    void *opaque);
>>   #endif /* QEMU_TYPEDEFS_H */
>> diff --git a/migration/multifd-device-state.c b/migration/multifd-device-state.c
>> index bcbea926b6be..74a4aef346c8 100644
>> --- a/migration/multifd-device-state.c
>> +++ b/migration/multifd-device-state.c
>> @@ -9,12 +9,17 @@
>>   #include "qemu/osdep.h"
>>   #include "qemu/lockable.h"
>> +#include "block/thread-pool.h"
>>   #include "migration/misc.h"
>>   #include "multifd.h"
>>   #include "options.h"
>>   static QemuMutex queue_job_mutex;
>> +static ThreadPool *send_threads;
>> +static int send_threads_ret;
>> +static bool send_threads_abort;
>> +
>>   static MultiFDSendData *device_state_send;
>>   void multifd_device_state_send_setup(void)
>> @@ -22,6 +27,10 @@ void multifd_device_state_send_setup(void)
>>       qemu_mutex_init(&queue_job_mutex);
>>       device_state_send = multifd_send_data_alloc();
>> +
>> +    send_threads = thread_pool_new();
>> +    send_threads_ret = 0;
>> +    send_threads_abort = false;
>>   }
>>   void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>> @@ -32,6 +41,7 @@ void multifd_device_state_clear(MultiFDDeviceState_t *device_state)
>>   void multifd_device_state_send_cleanup(void)
>>   {
>> +    g_clear_pointer(&send_threads, thread_pool_free);
>>       g_clear_pointer(&device_state_send, multifd_send_data_free);
>>       qemu_mutex_destroy(&queue_job_mutex);
>> @@ -106,3 +116,78 @@ bool migration_has_device_state_support(void)
>>       return migrate_multifd() && !migrate_mapped_ram() &&
>>           migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
>>   }
>> +
>> +struct MultiFDDSSaveThreadData {
>> +    SaveLiveCompletePrecopyThreadHandler hdlr;
>> +    char *idstr;
>> +    uint32_t instance_id;
>> +    void *handler_opaque;
>> +};
>> +
>> +static void multifd_device_state_save_thread_data_free(void *opaque)
>> +{
>> +    struct MultiFDDSSaveThreadData *data = opaque;
>> +
>> +    g_clear_pointer(&data->idstr, g_free);
>> +    g_free(data);
>> +}
>> +
>> +static int multifd_device_state_save_thread(void *opaque)
>> +{
>> +    struct MultiFDDSSaveThreadData *data = opaque;
>> +    int ret;
>> +
>> +    ret = data->hdlr(data->idstr, data->instance_id, &send_threads_abort,
>> +                     data->handler_opaque);
>> +    if (ret && !qatomic_read(&send_threads_ret)) {
>> +        /*
>> +         * Racy with the above read but that's okay - which thread error
>> +         * return we report is purely arbitrary anyway.
>> +         */
>> +        qatomic_set(&send_threads_ret, ret);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +void
>> +multifd_spawn_device_state_save_thread(SaveLiveCompletePrecopyThreadHandler hdlr,
>> +                                       char *idstr, uint32_t instance_id,
>> +                                       void *opaque)
>> +{
>> +    struct MultiFDDSSaveThreadData *data;
>> +
>> +    assert(migration_has_device_state_support());
>> +
>> +    data = g_new(struct MultiFDDSSaveThreadData, 1);
>> +    data->hdlr = hdlr;
>> +    data->idstr = g_strdup(idstr);
>> +    data->instance_id = instance_id;
>> +    data->handler_opaque = opaque;
>> +
>> +    thread_pool_submit(send_threads,
>> +                       multifd_device_state_save_thread,
>> +                       data, multifd_device_state_save_thread_data_free);
>> +
>> +    /*
>> +     * Make sure that this new thread is actually spawned immediately so it
>> +     * can start its work right now.
>> +     */
>> +    thread_pool_adjust_max_threads_to_work(send_threads);
>> +}
>> +
>> +void multifd_abort_device_state_save_threads(void)
>> +{
>> +    assert(migration_has_device_state_support());
>> +
>> +    qatomic_set(&send_threads_abort, true);
>> +}
>> +
>> +int multifd_join_device_state_save_threads(void)
>> +{
>> +    assert(migration_has_device_state_support());
>> +
>> +    thread_pool_wait(send_threads);
>> +
>> +    return send_threads_ret;
>> +}
> 
> There is a lot in common with the load_thread part in patch 8. I think
> more code could be shared.

I will have a second look whether some code can be indeed shared with
load threads here when I will be preparing the next version of this
patch set.

> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (16 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 17/24] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-29 14:08   ` Cédric Le Goater
  2024-11-17 19:20 ` [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

It's possible for load_cleanup SaveVMHandler to get called without
load_setup handler being called first.

Since we'll be soon running cleanup operations there that access objects
that need earlier initialization in load_setup let's make sure these
cleanups only run when load_setup handler had indeed been called
earlier.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 21 +++++++++++++++++++--
 include/hw/vfio/vfio-common.h |  1 +
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 01aa11013e42..9e2657073012 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -688,16 +688,33 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
 static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    assert(!migration->load_setup);
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   migration->device_state, errp);
+    if (ret) {
+        return ret;
+    }
 
-    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
-                                    vbasedev->migration->device_state, errp);
+    migration->load_setup = true;
+
+    return 0;
 }
 
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration->load_setup) {
+        return 0;
+    }
 
     vfio_migration_cleanup(vbasedev);
+    migration->load_setup = false;
     trace_vfio_load_cleanup(vbasedev->name);
 
     return 0;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e0ce6ec3a9b3..246250ed8b75 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,7 @@ typedef struct VFIOMigration {
     VMChangeStateEntry *vm_state;
     NotifierWithReturn migration_state;
     uint32_t device_state;
+    bool load_setup;
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-11-17 19:20 ` [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run Maciej S. Szmigiero
@ 2024-11-29 14:08   ` Cédric Le Goater
  2024-11-29 17:15     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-29 14:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> It's possible for load_cleanup SaveVMHandler to get called without
> load_setup handler being called first.
> 
> Since we'll be soon running cleanup operations there that access objects
> that need earlier initialization in load_setup let's make sure these
> cleanups only run when load_setup handler had indeed been called
> earlier.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

tbh, that's a bit ugly. I agree it's similar to those 'bool initialized'
attributes we have in some structs, so nothing new or really wrong.
But it does look like a workaound for a problem or cleanups missing
that would need time to untangle.

I would prefer to avoid this change and address the issue from the
migration subsystem if possible.


Thanks,

C.




> ---
>   hw/vfio/migration.c           | 21 +++++++++++++++++++--
>   include/hw/vfio/vfio-common.h |  1 +
>   2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 01aa11013e42..9e2657073012 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -688,16 +688,33 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    assert(!migration->load_setup);
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   migration->device_state, errp);
> +    if (ret) {
> +        return ret;
> +    }
>   
> -    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> -                                    vbasedev->migration->device_state, errp);
> +    migration->load_setup = true;
> +
> +    return 0;
>   }
>   
>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration->load_setup) {
> +        return 0;
> +    }
>   
>       vfio_migration_cleanup(vbasedev);
> +    migration->load_setup = false;
>       trace_vfio_load_cleanup(vbasedev->name);
>   
>       return 0;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index e0ce6ec3a9b3..246250ed8b75 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,7 @@ typedef struct VFIOMigration {
>       VMChangeStateEntry *vm_state;
>       NotifierWithReturn migration_state;
>       uint32_t device_state;
> +    bool load_setup;
>       int data_fd;
>       void *data_buffer;
>       size_t data_buffer_size;
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-11-29 14:08   ` Cédric Le Goater
@ 2024-11-29 17:15     ` Maciej S. Szmigiero
  2024-12-03 15:09       ` Avihai Horon
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-29 17:15 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 29.11.2024 15:08, Cédric Le Goater wrote:
> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> It's possible for load_cleanup SaveVMHandler to get called without
>> load_setup handler being called first.
>>
>> Since we'll be soon running cleanup operations there that access objects
>> that need earlier initialization in load_setup let's make sure these
>> cleanups only run when load_setup handler had indeed been called
>> earlier.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> tbh, that's a bit ugly. I agree it's similar to those 'bool initialized'
> attributes we have in some structs, so nothing new or really wrong.
> But it does look like a workaound for a problem or cleanups missing
> that would need time to untangle.
> 
> I would prefer to avoid this change and address the issue from the
> migration subsystem if possible.

While it would be pretty simple to only call {load,save}_cleanup
SaveVMHandlers when the relevant {load,save}_setup handler was
successfully called first this would amount to a change of these
handler semantics.

This would risk introducing regressions - for example vfio_save_setup()
doesn't clean up (free) newly allocated migration->data_buffer
if vfio_migration_set_state() were to fail later in this handler
and relies on an unconstitutional call to vfio_save_cleanup() in
order to clean it up.

There might be similar issues in other drivers too.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-11-29 17:15     ` Maciej S. Szmigiero
@ 2024-12-03 15:09       ` Avihai Horon
  2024-12-10 23:04         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-12-03 15:09 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 29/11/2024 19:15, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 29.11.2024 15:08, Cédric Le Goater wrote:
>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> It's possible for load_cleanup SaveVMHandler to get called without
>>> load_setup handler being called first.
>>>
>>> Since we'll be soon running cleanup operations there that access 
>>> objects
>>> that need earlier initialization in load_setup let's make sure these
>>> cleanups only run when load_setup handler had indeed been called
>>> earlier.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>
>> tbh, that's a bit ugly. I agree it's similar to those 'bool initialized'
>> attributes we have in some structs, so nothing new or really wrong.
>> But it does look like a workaound for a problem or cleanups missing
>> that would need time to untangle.
>>
>> I would prefer to avoid this change and address the issue from the
>> migration subsystem if possible.
>
> While it would be pretty simple to only call {load,save}_cleanup
> SaveVMHandlers when the relevant {load,save}_setup handler was
> successfully called first this would amount to a change of these
> handler semantics.
>
> This would risk introducing regressions - for example vfio_save_setup()
> doesn't clean up (free) newly allocated migration->data_buffer
> if vfio_migration_set_state() were to fail later in this handler
> and relies on an unconstitutional call to vfio_save_cleanup() in
> order to clean it up.
>
> There might be similar issues in other drivers too.

We can put all objects related to multifd load in their own struct (as 
suggested by Cedric in patch #22) and allocate the struct only if 
multifd device state transfer is used.
Then in the cleanup flow we clean the struct only if it was allocated.

This way we don't need to add the load_setup flag and we can keep the 
SaveVMHandlers semantics as is.

Do you think this will be OK?

Thanks.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-12-03 15:09       ` Avihai Horon
@ 2024-12-10 23:04         ` Maciej S. Szmigiero
  2024-12-12 14:30           ` Avihai Horon
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:04 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Peter Xu,
	Fabiano Rosas, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 3.12.2024 16:09, Avihai Horon wrote:
> 
> On 29/11/2024 19:15, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 29.11.2024 15:08, Cédric Le Goater wrote:
>>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> It's possible for load_cleanup SaveVMHandler to get called without
>>>> load_setup handler being called first.
>>>>
>>>> Since we'll be soon running cleanup operations there that access objects
>>>> that need earlier initialization in load_setup let's make sure these
>>>> cleanups only run when load_setup handler had indeed been called
>>>> earlier.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>
>>> tbh, that's a bit ugly. I agree it's similar to those 'bool initialized'
>>> attributes we have in some structs, so nothing new or really wrong.
>>> But it does look like a workaound for a problem or cleanups missing
>>> that would need time to untangle.
>>>
>>> I would prefer to avoid this change and address the issue from the
>>> migration subsystem if possible.
>>
>> While it would be pretty simple to only call {load,save}_cleanup
>> SaveVMHandlers when the relevant {load,save}_setup handler was
>> successfully called first this would amount to a change of these
>> handler semantics.
>>
>> This would risk introducing regressions - for example vfio_save_setup()
>> doesn't clean up (free) newly allocated migration->data_buffer
>> if vfio_migration_set_state() were to fail later in this handler
>> and relies on an unconstitutional call to vfio_save_cleanup() in
>> order to clean it up.
>>
>> There might be similar issues in other drivers too.
> 
> We can put all objects related to multifd load in their own struct (as suggested by Cedric in patch #22) and allocate the struct only if multifd device state transfer is used.
> Then in the cleanup flow we clean the struct only if it was allocated.
> 
> This way we don't need to add the load_setup flag and we can keep the SaveVMHandlers semantics as is.
> 
> Do you think this will be OK?

I think here the discussion is more of whether we refactor the
{load,save}_cleanup handler semantics to "cleaner" design where
these handlers are only called if the relevant {load,save}_setup
handler was successfully called first (but at the same time risk
introducing regressions).

If we keep the existing semantics of these handlers (like this
patch set did) then it is just an implementation detail whether
we keep an explicit flag like "migration->load_setup" or have
a struct pointer that serves as an implicit equivalent flag
(when not NULL) - I don't have a strong opinion on this particular
detail.

> Thanks.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-12-10 23:04         ` Maciej S. Szmigiero
@ 2024-12-12 14:30           ` Avihai Horon
  2024-12-12 22:52             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-12-12 14:30 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Peter Xu,
	Fabiano Rosas, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 11/12/2024 1:04, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 3.12.2024 16:09, Avihai Horon wrote:
>>
>> On 29/11/2024 19:15, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 29.11.2024 15:08, Cédric Le Goater wrote:
>>>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>
>>>>> It's possible for load_cleanup SaveVMHandler to get called without
>>>>> load_setup handler being called first.
>>>>>
>>>>> Since we'll be soon running cleanup operations there that access 
>>>>> objects
>>>>> that need earlier initialization in load_setup let's make sure these
>>>>> cleanups only run when load_setup handler had indeed been called
>>>>> earlier.
>>>>>
>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>
>>>> tbh, that's a bit ugly. I agree it's similar to those 'bool 
>>>> initialized'
>>>> attributes we have in some structs, so nothing new or really wrong.
>>>> But it does look like a workaound for a problem or cleanups missing
>>>> that would need time to untangle.
>>>>
>>>> I would prefer to avoid this change and address the issue from the
>>>> migration subsystem if possible.
>>>
>>> While it would be pretty simple to only call {load,save}_cleanup
>>> SaveVMHandlers when the relevant {load,save}_setup handler was
>>> successfully called first this would amount to a change of these
>>> handler semantics.
>>>
>>> This would risk introducing regressions - for example vfio_save_setup()
>>> doesn't clean up (free) newly allocated migration->data_buffer
>>> if vfio_migration_set_state() were to fail later in this handler
>>> and relies on an unconstitutional call to vfio_save_cleanup() in
>>> order to clean it up.
>>>
>>> There might be similar issues in other drivers too.
>>
>> We can put all objects related to multifd load in their own struct 
>> (as suggested by Cedric in patch #22) and allocate the struct only if 
>> multifd device state transfer is used.
>> Then in the cleanup flow we clean the struct only if it was allocated.
>>
>> This way we don't need to add the load_setup flag and we can keep the 
>> SaveVMHandlers semantics as is.
>>
>> Do you think this will be OK?
>
> I think here the discussion is more of whether we refactor the
> {load,save}_cleanup handler semantics to "cleaner" design where
> these handlers are only called if the relevant {load,save}_setup
> handler was successfully called first (but at the same time risk
> introducing regressions).

Yes, and I agree with you that changing the semantics of SaveVMHandlers 
can be risky and may deserve a series of its own.
But Cedric didn't like the flag option, so I suggested to do what we 
usually do, AFAIU, which is to check if the structs are allocated and 
need cleanup.

>
>
> If we keep the existing semantics of these handlers (like this
> patch set did) then it is just an implementation detail whether
> we keep an explicit flag like "migration->load_setup" or have
> a struct pointer that serves as an implicit equivalent flag
> (when not NULL) - I don't have a strong opinion on this particular
> detail.
>
I prefer the struct pointer way, it seems less cumbersome to me.
But it's Cedric's call at the end.

Thanks.




^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-12-12 14:30           ` Avihai Horon
@ 2024-12-12 22:52             ` Maciej S. Szmigiero
  2024-12-19  9:19               ` Cédric Le Goater
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-12 22:52 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Cédric Le Goater, Alex Williamson, Eric Blake, Peter Xu,
	Fabiano Rosas, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 12.12.2024 15:30, Avihai Horon wrote:
> 
> On 11/12/2024 1:04, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 3.12.2024 16:09, Avihai Horon wrote:
>>>
>>> On 29/11/2024 19:15, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 29.11.2024 15:08, Cédric Le Goater wrote:
>>>>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> It's possible for load_cleanup SaveVMHandler to get called without
>>>>>> load_setup handler being called first.
>>>>>>
>>>>>> Since we'll be soon running cleanup operations there that access objects
>>>>>> that need earlier initialization in load_setup let's make sure these
>>>>>> cleanups only run when load_setup handler had indeed been called
>>>>>> earlier.
>>>>>>
>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>
>>>>> tbh, that's a bit ugly. I agree it's similar to those 'bool initialized'
>>>>> attributes we have in some structs, so nothing new or really wrong.
>>>>> But it does look like a workaound for a problem or cleanups missing
>>>>> that would need time to untangle.
>>>>>
>>>>> I would prefer to avoid this change and address the issue from the
>>>>> migration subsystem if possible.
>>>>
>>>> While it would be pretty simple to only call {load,save}_cleanup
>>>> SaveVMHandlers when the relevant {load,save}_setup handler was
>>>> successfully called first this would amount to a change of these
>>>> handler semantics.
>>>>
>>>> This would risk introducing regressions - for example vfio_save_setup()
>>>> doesn't clean up (free) newly allocated migration->data_buffer
>>>> if vfio_migration_set_state() were to fail later in this handler
>>>> and relies on an unconstitutional call to vfio_save_cleanup() in
>>>> order to clean it up.
>>>>
>>>> There might be similar issues in other drivers too.
>>>
>>> We can put all objects related to multifd load in their own struct (as suggested by Cedric in patch #22) and allocate the struct only if multifd device state transfer is used.
>>> Then in the cleanup flow we clean the struct only if it was allocated.
>>>
>>> This way we don't need to add the load_setup flag and we can keep the SaveVMHandlers semantics as is.
>>>
>>> Do you think this will be OK?
>>
>> I think here the discussion is more of whether we refactor the
>> {load,save}_cleanup handler semantics to "cleaner" design where
>> these handlers are only called if the relevant {load,save}_setup
>> handler was successfully called first (but at the same time risk
>> introducing regressions).
> 
> Yes, and I agree with you that changing the semantics of SaveVMHandlers can be risky and may deserve a series of its own.
> But Cedric didn't like the flag option, so I suggested to do what we usually do, AFAIU, which is to check if the structs are allocated and need cleanup.
> 
>>
>>
>> If we keep the existing semantics of these handlers (like this
>> patch set did) then it is just an implementation detail whether
>> we keep an explicit flag like "migration->load_setup" or have
>> a struct pointer that serves as an implicit equivalent flag
>> (when not NULL) - I don't have a strong opinion on this particular
>> detail.
>>
> I prefer the struct pointer way, it seems less cumbersome to me.
> But it's Cedric's call at the end.

As I wrote above "I don't have a strong opinion on this particular
detail" - I'm okay with moving these new variables to a dedicated
struct.

I guess this means we settled on *not* changing the semantics of
{load,save}_cleanup handler SaveVMHandlers - that was the important
decision for me.

> Thanks.
> 
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run
  2024-12-12 22:52             ` Maciej S. Szmigiero
@ 2024-12-19  9:19               ` Cédric Le Goater
  0 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-19  9:19 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Avihai Horon
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

On 12/12/24 23:52, Maciej S. Szmigiero wrote:
> On 12.12.2024 15:30, Avihai Horon wrote:
>>
>> On 11/12/2024 1:04, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 3.12.2024 16:09, Avihai Horon wrote:
>>>>
>>>> On 29/11/2024 19:15, Maciej S. Szmigiero wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On 29.11.2024 15:08, Cédric Le Goater wrote:
>>>>>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>>
>>>>>>> It's possible for load_cleanup SaveVMHandler to get called without
>>>>>>> load_setup handler being called first.
>>>>>>>
>>>>>>> Since we'll be soon running cleanup operations there that access objects
>>>>>>> that need earlier initialization in load_setup let's make sure these
>>>>>>> cleanups only run when load_setup handler had indeed been called
>>>>>>> earlier.
>>>>>>>
>>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> tbh, that's a bit ugly. I agree it's similar to those 'bool initialized'
>>>>>> attributes we have in some structs, so nothing new or really wrong.
>>>>>> But it does look like a workaound for a problem or cleanups missing
>>>>>> that would need time to untangle.
>>>>>>
>>>>>> I would prefer to avoid this change and address the issue from the
>>>>>> migration subsystem if possible.
>>>>>
>>>>> While it would be pretty simple to only call {load,save}_cleanup
>>>>> SaveVMHandlers when the relevant {load,save}_setup handler was
>>>>> successfully called first this would amount to a change of these
>>>>> handler semantics.
>>>>>
>>>>> This would risk introducing regressions - for example vfio_save_setup()
>>>>> doesn't clean up (free) newly allocated migration->data_buffer
>>>>> if vfio_migration_set_state() were to fail later in this handler
>>>>> and relies on an unconstitutional call to vfio_save_cleanup() in
>>>>> order to clean it up.
>>>>>
>>>>> There might be similar issues in other drivers too.
>>>>
>>>> We can put all objects related to multifd load in their own struct (as suggested by Cedric in patch #22) and allocate the struct only if multifd device state transfer is used.
>>>> Then in the cleanup flow we clean the struct only if it was allocated.
>>>>
>>>> This way we don't need to add the load_setup flag and we can keep the SaveVMHandlers semantics as is.
>>>>
>>>> Do you think this will be OK?
>>>
>>> I think here the discussion is more of whether we refactor the
>>> {load,save}_cleanup handler semantics to "cleaner" design where
>>> these handlers are only called if the relevant {load,save}_setup
>>> handler was successfully called first (but at the same time risk
>>> introducing regressions).
>>
>> Yes, and I agree with you that changing the semantics of SaveVMHandlers can be risky and may deserve a series of its own.
>> But Cedric didn't like the flag option, so I suggested to do what we usually do, AFAIU, which is to check if the structs are allocated and need cleanup.
>>
>>>
>>>
>>> If we keep the existing semantics of these handlers (like this
>>> patch set did) then it is just an implementation detail whether
>>> we keep an explicit flag like "migration->load_setup" or have
>>> a struct pointer that serves as an implicit equivalent flag
>>> (when not NULL) - I don't have a strong opinion on this particular
>>> detail.
>>>
>> I prefer the struct pointer way, it seems less cumbersome to me.
>> But it's Cedric's call at the end.
> 
> As I wrote above "I don't have a strong opinion on this particular
> detail" - I'm okay with moving these new variables to a dedicated
> struct.

I would prefer that, to isolate multifd migation support from the rest.

> I guess this means we settled on *not* changing the semantics of
> {load,save}_cleanup handler SaveVMHandlers - that was the important
> decision for me.

Handling errors locally in SaveVMHandlers and unrolling what was done
previously is better practice than relying on another callback to do
the cleanup.

Let's see when v4 comes out.

Thanks,

C.




^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (17 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-29 14:11   ` Cédric Le Goater
  2024-11-17 19:20 ` [PATCH v3 20/24] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This property allows configuring at runtime whether to transfer the
particular device state via multifd channels when live migrating that
device.

It defaults to AUTO, which means that VFIO device state transfer via
multifd channels is attempted in configurations that otherwise support it.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/machine.c             | 1 +
 hw/vfio/pci.c                 | 9 +++++++++
 include/hw/vfio/vfio-common.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index ed8d39fd769f..fda0f8280edd 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -39,6 +39,7 @@
 GlobalProperty hw_compat_9_1[] = {
     { TYPE_PCI_DEVICE, "x-pcie-ext-tag", "false" },
     { "migration", "send-switchover-start", "off"},
+    { "vfio-pci", "x-migration-multifd-transfer", "off" },
 };
 const size_t hw_compat_9_1_len = G_N_ELEMENTS(hw_compat_9_1);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 14bcc725c301..9d547cb5cdff 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3354,6 +3354,8 @@ static void vfio_instance_init(Object *obj)
     pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
 }
 
+static PropertyInfo qdev_prop_on_off_auto_mutable;
+
 static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
     DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
@@ -3378,6 +3380,10 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
                             vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
+                vbasedev.migration_multifd_transfer,
+                qdev_prop_on_off_auto_mutable, OnOffAuto,
+                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
@@ -3475,6 +3481,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
+    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
+
     type_register_static(&vfio_pci_dev_info);
     type_register_static(&vfio_pci_nohotplug_dev_info);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 246250ed8b75..b1c03a82eec8 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -134,6 +134,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     OnOffAuto enable_migration;
+    OnOffAuto migration_multifd_transfer;
     bool migration_events;
     VFIODeviceOps *ops;
     unsigned int num_irqs;


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2024-11-17 19:20 ` [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2024-11-29 14:11   ` Cédric Le Goater
  2024-11-29 17:15     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-29 14:11 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This property allows configuring at runtime whether to transfer the
> particular device state via multifd channels when live migrating that
> device.
> 
> It defaults to AUTO, which means that VFIO device state transfer via
> multifd channels is attempted in configurations that otherwise support it.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/core/machine.c             | 1 +
>   hw/vfio/pci.c                 | 9 +++++++++
>   include/hw/vfio/vfio-common.h | 1 +
>   3 files changed, 11 insertions(+)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index ed8d39fd769f..fda0f8280edd 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -39,6 +39,7 @@
>   GlobalProperty hw_compat_9_1[] = {
>       { TYPE_PCI_DEVICE, "x-pcie-ext-tag", "false" },
>       { "migration", "send-switchover-start", "off"},
> +    { "vfio-pci", "x-migration-multifd-transfer", "off" },

Could you please move the compat changes into their own patch ?
It's easier for backports

>   };
>   const size_t hw_compat_9_1_len = G_N_ELEMENTS(hw_compat_9_1);
>   
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 14bcc725c301..9d547cb5cdff 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3354,6 +3354,8 @@ static void vfio_instance_init(Object *obj)
>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>   }
>   
> +static PropertyInfo qdev_prop_on_off_auto_mutable;
> +
>   static Property vfio_pci_dev_properties[] = {
>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
> @@ -3378,6 +3380,10 @@ static Property vfio_pci_dev_properties[] = {
>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
> +                vbasedev.migration_multifd_transfer,
> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),

What are you trying to do that DEFINE_PROP_ON_OFF_AUTO() can not satisfy ?


Thanks,

C.



>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> @@ -3475,6 +3481,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
>   
>   static void register_vfio_pci_dev_type(void)
>   {
> +    qdev_prop_on_off_auto_mutable = qdev_prop_on_off_auto;
> +    qdev_prop_on_off_auto_mutable.realized_set_allowed = true;
> +
>       type_register_static(&vfio_pci_dev_info);
>       type_register_static(&vfio_pci_nohotplug_dev_info);
>   }
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 246250ed8b75..b1c03a82eec8 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -134,6 +134,7 @@ typedef struct VFIODevice {
>       bool no_mmap;
>       bool ram_block_discard_allowed;
>       OnOffAuto enable_migration;
> +    OnOffAuto migration_multifd_transfer;
>       bool migration_events;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2024-11-29 14:11   ` Cédric Le Goater
@ 2024-11-29 17:15     ` Maciej S. Szmigiero
  2024-12-19  9:37       ` Cédric Le Goater
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-29 17:15 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 29.11.2024 15:11, Cédric Le Goater wrote:
> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This property allows configuring at runtime whether to transfer the
>> particular device state via multifd channels when live migrating that
>> device.
>>
>> It defaults to AUTO, which means that VFIO device state transfer via
>> multifd channels is attempted in configurations that otherwise support it.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/core/machine.c             | 1 +
>>   hw/vfio/pci.c                 | 9 +++++++++
>>   include/hw/vfio/vfio-common.h | 1 +
>>   3 files changed, 11 insertions(+)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index ed8d39fd769f..fda0f8280edd 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -39,6 +39,7 @@
>>   GlobalProperty hw_compat_9_1[] = {
>>       { TYPE_PCI_DEVICE, "x-pcie-ext-tag", "false" },
>>       { "migration", "send-switchover-start", "off"},
>> +    { "vfio-pci", "x-migration-multifd-transfer", "off" },
> 
> Could you please move the compat changes into their own patch ?
> It's easier for backports
> 
>>   };
>>   const size_t hw_compat_9_1_len = G_N_ELEMENTS(hw_compat_9_1);
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 14bcc725c301..9d547cb5cdff 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3354,6 +3354,8 @@ static void vfio_instance_init(Object *obj)
>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>   }
>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
>> +
>>   static Property vfio_pci_dev_properties[] = {
>>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
>> @@ -3378,6 +3380,10 @@ static Property vfio_pci_dev_properties[] = {
>>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
>> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>> +                vbasedev.migration_multifd_transfer,
>> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
>> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
> 
> What are you trying to do that DEFINE_PROP_ON_OFF_AUTO() can not satisfy ?
> 

DEFINE_PROP_ON_OFF_AUTO() property isn't runtime-mutable so using it
would mean that the source VM would need to already decide at startup
time whether it wants to do a multifd device state transfer.

Source VM can run for a long time before being migrated so it is
desirable to have a fallback mechanism to the old way of transferring
VFIO device state if it turns to be necessary for some reason.

After all, ordinary migration parameters can be adjusted at the run time
too.

> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property
  2024-11-29 17:15     ` Maciej S. Szmigiero
@ 2024-12-19  9:37       ` Cédric Le Goater
  0 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-19  9:37 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 11/29/24 18:15, Maciej S. Szmigiero wrote:
> On 29.11.2024 15:11, Cédric Le Goater wrote:
>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> This property allows configuring at runtime whether to transfer the
>>> particular device state via multifd channels when live migrating that
>>> device.
>>>
>>> It defaults to AUTO, which means that VFIO device state transfer via
>>> multifd channels is attempted in configurations that otherwise support it.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/core/machine.c             | 1 +
>>>   hw/vfio/pci.c                 | 9 +++++++++
>>>   include/hw/vfio/vfio-common.h | 1 +
>>>   3 files changed, 11 insertions(+)
>>>
>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>> index ed8d39fd769f..fda0f8280edd 100644
>>> --- a/hw/core/machine.c
>>> +++ b/hw/core/machine.c
>>> @@ -39,6 +39,7 @@
>>>   GlobalProperty hw_compat_9_1[] = {
>>>       { TYPE_PCI_DEVICE, "x-pcie-ext-tag", "false" },
>>>       { "migration", "send-switchover-start", "off"},
>>> +    { "vfio-pci", "x-migration-multifd-transfer", "off" },
>>
>> Could you please move the compat changes into their own patch ?
>> It's easier for backports
>>
>>>   };
>>>   const size_t hw_compat_9_1_len = G_N_ELEMENTS(hw_compat_9_1);
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 14bcc725c301..9d547cb5cdff 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3354,6 +3354,8 @@ static void vfio_instance_init(Object *obj)
>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>>   }
>>> +static PropertyInfo qdev_prop_on_off_auto_mutable;
>>> +
>>>   static Property vfio_pci_dev_properties[] = {
>>>       DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
>>>       DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
>>> @@ -3378,6 +3380,10 @@ static Property vfio_pci_dev_properties[] = {
>>>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>>>       DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
>>>                               vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
>>> +    DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
>>> +                vbasedev.migration_multifd_transfer,
>>> +                qdev_prop_on_off_auto_mutable, OnOffAuto,
>>> +                .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
>>
>> What are you trying to do that DEFINE_PROP_ON_OFF_AUTO() can not satisfy ?
>>
> 
> DEFINE_PROP_ON_OFF_AUTO() property isn't runtime-mutable so using it
> would mean that the source VM would need to already decide at startup
> time whether it wants to do a multifd device state transfer.
>
> Source VM can run for a long time before being migrated so it is
> desirable to have a fallback mechanism to the old way of transferring
> VFIO device state if it turns to be necessary for some reason.
>
> After all, ordinary migration parameters can be adjusted at the run time
> too.

I see. I don't think it works this way. Anyhow, it won't compile anymore
with upstream so this part needs to be reworked. Let's keep it in mind
and make it simpler first. That is to rely on values of
vfio_multifd_transfer_supported() and "x-migration-multifd-transfer"

Thanks,

C.




^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 20/24] vfio/migration: Add load_device_config_state_start trace event
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (18 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-29 14:26   ` Cédric Le Goater
  2024-11-17 19:20 ` [PATCH v3 21/24] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

And rename existing load_device_config_state trace event to
load_device_config_state_end for consistency since it is triggered at the
end of loading of the VFIO device config state.

This way both the start and end points of particular device config
loading operation (a long, BQL-serialized operation) are known.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c  | 4 +++-
 hw/vfio/trace-events | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 9e2657073012..4b2b06b45195 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -285,6 +285,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     uint64_t data;
 
+    trace_vfio_load_device_config_state_start(vbasedev->name);
+
     if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
         int ret;
 
@@ -303,7 +305,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         return -EINVAL;
     }
 
-    trace_vfio_load_device_config_state(vbasedev->name);
+    trace_vfio_load_device_config_state_end(vbasedev->name);
     return qemu_file_get_error(f);
 }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index cab1cf1de0a2..1bebe9877d88 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,7 +149,8 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_device_config_state_start(const char *name) " (%s)"
+vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
 vfio_migration_realize(const char *name) " (%s)"


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 20/24] vfio/migration: Add load_device_config_state_start trace event
  2024-11-17 19:20 ` [PATCH v3 20/24] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
@ 2024-11-29 14:26   ` Cédric Le Goater
  0 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-11-29 14:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> And rename existing load_device_config_state trace event to
> load_device_config_state_end for consistency since it is triggered at the
> end of loading of the VFIO device config state.
> 
> This way both the start and end points of particular device config
> loading operation (a long, BQL-serialized operation) are known.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>


I think we should add more trace events regarding the new threads
this series is adding. At all level :

   hw/vfio/trace-events
   migration/trace-events
   util/trace-events


Some time ago, Peter proposed a series adding an "info migrationthreads"
hmp command [*]. I found it useful for dev/debug. I wonder about its
status.

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.

[*] migration: query-migrationthreads enhancements and cleanups
     https://lore.kernel.org/all/20240930195837.825728-1-peterx@redhat.com/


> ---
>   hw/vfio/migration.c  | 4 +++-
>   hw/vfio/trace-events | 3 ++-
>   2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 9e2657073012..4b2b06b45195 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -285,6 +285,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>       VFIODevice *vbasedev = opaque;
>       uint64_t data;
>   
> +    trace_vfio_load_device_config_state_start(vbasedev->name);
> +
>       if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>           int ret;
>   
> @@ -303,7 +305,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>           return -EINVAL;
>       }
>   
> -    trace_vfio_load_device_config_state(vbasedev->name);
> +    trace_vfio_load_device_config_state_end(vbasedev->name);
>       return qemu_file_get_error(f);
>   }
>   
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index cab1cf1de0a2..1bebe9877d88 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,7 +149,8 @@ vfio_display_edid_write_error(void) ""
>   
>   # migration.c
>   vfio_load_cleanup(const char *name) " (%s)"
> -vfio_load_device_config_state(const char *name) " (%s)"
> +vfio_load_device_config_state_start(const char *name) " (%s)"
> +vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
>   vfio_migration_realize(const char *name) " (%s)"
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 21/24] vfio/migration: Convert bytes_transferred counter to atomic
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (19 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 20/24] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-17 19:20 ` [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

So it can be safety accessed from multiple threads.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4b2b06b45195..683f2ae98d5e 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -391,7 +391,7 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
     qemu_put_be64(f, data_size);
     qemu_put_buffer(f, migration->data_buffer, data_size);
-    bytes_transferred += data_size;
+    qatomic_add(&bytes_transferred, data_size);
 
     trace_vfio_save_block(migration->vbasedev->name, data_size);
 
@@ -1030,12 +1030,12 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
 
 int64_t vfio_mig_bytes_transferred(void)
 {
-    return bytes_transferred;
+    return qatomic_read(&bytes_transferred);
 }
 
 void vfio_reset_bytes_transferred(void)
 {
-    bytes_transferred = 0;
+    qatomic_set(&bytes_transferred, 0);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (20 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 21/24] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-12-02 17:56   ` Cédric Le Goater
  2024-12-09  9:13   ` Avihai Horon
  2024-11-17 19:20 ` [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  25 siblings, 2 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating its
position in the stream.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.

Since it's important to finish loading device state transferred via the
main migration channel (via save_live_iterate SaveVMHandler) before
starting loading the data asynchronously transferred via multifd the thread
doing the actual loading of the multifd transferred data is only started
from switchover_start SaveVMHandler.

switchover_start handler is called when MIG_CMD_SWITCHOVER_START
sub-command of QEMU_VM_COMMAND is received via the main migration channel.

This sub-command is only sent after all save_live_iterate data have already
been posted so it is safe to commence loading of the multifd-transferred
device state upon receiving it - loading of save_live_iterate data happens
synchronously in the main migration thread (much like the processing of
MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
processed all the proceeding data must have already been loaded.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c           | 402 ++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |   2 +
 hw/vfio/trace-events          |   6 +
 include/hw/vfio/vfio-common.h |  19 ++
 4 files changed, 429 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 683f2ae98d5e..b54879fe6209 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -15,6 +15,7 @@
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 
+#include "io/channel-buffer.h"
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/misc.h"
@@ -55,6 +56,15 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+typedef struct VFIODeviceStatePacket {
+    uint32_t version;
+    uint32_t idx;
+    uint32_t flags;
+    uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -254,6 +264,292 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
+typedef struct VFIOStateBuffer {
+    bool is_present;
+    char *data;
+    size_t len;
+} VFIOStateBuffer;
+
+static void vfio_state_buffer_clear(gpointer data)
+{
+    VFIOStateBuffer *lb = data;
+
+    if (!lb->is_present) {
+        return;
+    }
+
+    g_clear_pointer(&lb->data, g_free);
+    lb->is_present = false;
+}
+
+static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
+{
+    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
+    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
+}
+
+static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
+{
+    g_clear_pointer(&bufs->array, g_array_unref);
+}
+
+static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
+{
+    assert(bufs->array);
+}
+
+static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
+{
+    return bufs->array->len;
+}
+
+static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
+{
+    g_array_set_size(bufs->array, size);
+}
+
+static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
+{
+    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
+}
+
+static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                                  Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+    VFIOStateBuffer *lb;
+
+    /*
+     * Holding BQL here would violate the lock order and can cause
+     * a deadlock once we attempt to lock load_bufs_mutex below.
+     */
+    assert(!bql_locked());
+
+    if (!migration->multifd_transfer) {
+        error_setg(errp,
+                   "got device state packet but not doing multifd transfer");
+        return -1;
+    }
+
+    if (data_size < sizeof(*packet)) {
+        error_setg(errp, "packet too short at %zu (min is %zu)",
+                   data_size, sizeof(*packet));
+        return -1;
+    }
+
+    if (packet->version != 0) {
+        error_setg(errp, "packet has unknown version %" PRIu32,
+                   packet->version);
+        return -1;
+    }
+
+    if (packet->idx == UINT32_MAX) {
+        error_setg(errp, "packet has too high idx %" PRIu32,
+                   packet->idx);
+        return -1;
+    }
+
+    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
+
+    /* config state packet should be the last one in the stream */
+    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+        migration->load_buf_idx_last = packet->idx;
+    }
+
+    vfio_state_buffers_assert_init(&migration->load_bufs);
+    if (packet->idx >= vfio_state_buffers_size_get(&migration->load_bufs)) {
+        vfio_state_buffers_size_set(&migration->load_bufs, packet->idx + 1);
+    }
+
+    lb = vfio_state_buffers_at(&migration->load_bufs, packet->idx);
+    if (lb->is_present) {
+        error_setg(errp, "state buffer %" PRIu32 " already filled",
+                   packet->idx);
+        return -1;
+    }
+
+    assert(packet->idx >= migration->load_buf_idx);
+
+    migration->load_buf_queued_pending_buffers++;
+    if (migration->load_buf_queued_pending_buffers >
+        vbasedev->migration_max_queued_buffers) {
+        error_setg(errp,
+                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
+                   packet->idx, vbasedev->migration_max_queued_buffers);
+        return -1;
+    }
+
+    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
+    lb->len = data_size - sizeof(*packet);
+    lb->is_present = true;
+
+    qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
+
+    return 0;
+}
+
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
+
+static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOStateBuffer *lb;
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    QEMUFile *f_out = NULL, *f_in = NULL;
+    uint64_t mig_header;
+    int ret;
+
+    assert(migration->load_buf_idx == migration->load_buf_idx_last);
+    lb = vfio_state_buffers_at(&migration->load_bufs, migration->load_buf_idx);
+    assert(lb->is_present);
+
+    bioc = qio_channel_buffer_new(lb->len);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
+
+    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
+    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
+
+    ret = qemu_fflush(f_out);
+    if (ret) {
+        g_clear_pointer(&f_out, qemu_fclose);
+        return ret;
+    }
+
+    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
+    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
+
+    mig_header = qemu_get_be64(f_in);
+    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
+        g_clear_pointer(&f_out, qemu_fclose);
+        g_clear_pointer(&f_in, qemu_fclose);
+        return -EINVAL;
+    }
+
+    bql_lock();
+    ret = vfio_load_device_config_state(f_in, vbasedev);
+    bql_unlock();
+
+    g_clear_pointer(&f_out, qemu_fclose);
+    g_clear_pointer(&f_in, qemu_fclose);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static bool vfio_load_bufs_thread_want_abort(VFIODevice *vbasedev,
+                                             bool *abort_flag)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->load_bufs_thread_want_exit || qatomic_read(abort_flag);
+}
+
+static int vfio_load_bufs_thread(bool *abort_flag, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
+    int ret;
+
+    assert(migration->load_bufs_thread_running);
+
+    while (!vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
+        VFIOStateBuffer *lb;
+        guint bufs_len;
+        bool starved;
+
+        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
+
+        bufs_len = vfio_state_buffers_size_get(&migration->load_bufs);
+        if (migration->load_buf_idx >= bufs_len) {
+            assert(migration->load_buf_idx == bufs_len);
+            starved = true;
+        } else {
+            lb = vfio_state_buffers_at(&migration->load_bufs,
+                                       migration->load_buf_idx);
+            starved = !lb->is_present;
+        }
+
+        if (starved) {
+            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
+                                                        migration->load_buf_idx);
+            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond,
+                           &migration->load_bufs_mutex);
+            continue;
+        }
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last) {
+            break;
+        }
+
+        if (migration->load_buf_idx == 0) {
+            trace_vfio_load_state_device_buffer_start(vbasedev->name);
+        }
+
+        if (lb->len) {
+            g_autofree char *buf = NULL;
+            size_t buf_len;
+            ssize_t wr_ret;
+            int errno_save;
+
+            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
+                                                           migration->load_buf_idx);
+
+            /* lb might become re-allocated when we drop the lock */
+            buf = g_steal_pointer(&lb->data);
+            buf_len = lb->len;
+
+            /*
+             * Loading data to the device takes a while,
+             * drop the lock during this process.
+             */
+            qemu_mutex_unlock(&migration->load_bufs_mutex);
+            wr_ret = write(migration->data_fd, buf, buf_len);
+            errno_save = errno;
+            qemu_mutex_lock(&migration->load_bufs_mutex);
+
+            if (wr_ret < 0) {
+                ret = -errno_save;
+                goto ret_signal;
+            } else if (wr_ret < buf_len) {
+                ret = -EINVAL;
+                goto ret_signal;
+            }
+
+            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
+                                                         migration->load_buf_idx);
+        }
+
+        assert(migration->load_buf_queued_pending_buffers > 0);
+        migration->load_buf_queued_pending_buffers--;
+
+        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
+            trace_vfio_load_state_device_buffer_end(vbasedev->name);
+        }
+
+        migration->load_buf_idx++;
+    }
+
+    if (vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
+        ret = -ECANCELED;
+        goto ret_signal;
+    }
+
+    ret = vfio_load_bufs_thread_load_config(vbasedev);
+
+ret_signal:
+    migration->load_bufs_thread_running = false;
+    qemu_cond_signal(&migration->load_bufs_thread_finished_cond);
+
+    return ret;
+}
+
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
                                          Error **errp)
 {
@@ -430,6 +726,12 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
     return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
 }
 
+static bool vfio_multifd_transfer_supported(void)
+{
+    return migration_has_device_state_support() &&
+        migrate_send_switchover_start();
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_prepare(void *opaque, Error **errp)
@@ -695,17 +997,73 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
 
     assert(!migration->load_setup);
 
+    /*
+     * Make a copy of this setting at the start in case it is changed
+     * mid-migration.
+     */
+    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
+        migration->multifd_transfer = vfio_multifd_transfer_supported();
+    } else {
+        migration->multifd_transfer =
+            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
+    }
+
+    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
+        error_setg(errp,
+                   "%s: Multifd device transfer requested but unsupported in the current config",
+                   vbasedev->name);
+        return -EINVAL;
+    }
+
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
                                    migration->device_state, errp);
     if (ret) {
         return ret;
     }
 
+    if (migration->multifd_transfer) {
+        assert(!migration->load_bufs.array);
+        vfio_state_buffers_init(&migration->load_bufs);
+
+        qemu_mutex_init(&migration->load_bufs_mutex);
+
+        migration->load_buf_idx = 0;
+        migration->load_buf_idx_last = UINT32_MAX;
+        migration->load_buf_queued_pending_buffers = 0;
+        qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
+
+        migration->load_bufs_thread_running = false;
+        migration->load_bufs_thread_want_exit = false;
+        qemu_cond_init(&migration->load_bufs_thread_finished_cond);
+    }
+
     migration->load_setup = true;
 
     return 0;
 }
 
+static void vfio_load_cleanup_load_bufs_thread(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+    bql_unlock();
+    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
+        if (!migration->load_bufs_thread_running) {
+            break;
+        }
+
+        migration->load_bufs_thread_want_exit = true;
+
+        qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
+        qemu_cond_wait(&migration->load_bufs_thread_finished_cond,
+                       &migration->load_bufs_mutex);
+
+        assert(!migration->load_bufs_thread_running);
+    }
+    bql_lock();
+}
+
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -715,7 +1073,19 @@ static int vfio_load_cleanup(void *opaque)
         return 0;
     }
 
+    if (migration->multifd_transfer) {
+        vfio_load_cleanup_load_bufs_thread(vbasedev);
+    }
+
     vfio_migration_cleanup(vbasedev);
+
+    if (migration->multifd_transfer) {
+        qemu_cond_destroy(&migration->load_bufs_thread_finished_cond);
+        vfio_state_buffers_destroy(&migration->load_bufs);
+        qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
+        qemu_mutex_destroy(&migration->load_bufs_mutex);
+    }
+
     migration->load_setup = false;
     trace_vfio_load_cleanup(vbasedev->name);
 
@@ -725,6 +1095,7 @@ static int vfio_load_cleanup(void *opaque)
 static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     int ret = 0;
     uint64_t data;
 
@@ -736,6 +1107,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
         switch (data) {
         case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
         {
+            if (migration->multifd_transfer) {
+                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
+                             vbasedev->name);
+                return -EINVAL;
+            }
+
             return vfio_load_device_config_state(f, opaque);
         }
         case VFIO_MIG_FLAG_DEV_SETUP_STATE:
@@ -801,6 +1178,29 @@ static bool vfio_switchover_ack_needed(void *opaque)
     return vfio_precopy_supported(vbasedev);
 }
 
+static int vfio_switchover_start(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration->multifd_transfer) {
+        /* Load thread is only used for multifd transfer */
+        return 0;
+    }
+
+    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
+    bql_unlock();
+    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
+        assert(!migration->load_bufs_thread_running);
+        migration->load_bufs_thread_running = true;
+    }
+    bql_lock();
+
+    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
+
+    return 0;
+}
+
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_prepare = vfio_save_prepare,
     .save_setup = vfio_save_setup,
@@ -814,7 +1214,9 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .load_state_buffer = vfio_load_state_buffer,
     .switchover_ack_needed = vfio_switchover_ack_needed,
+    .switchover_start = vfio_switchover_start,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9d547cb5cdff..72d62ada8a39 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3384,6 +3384,8 @@ static Property vfio_pci_dev_properties[] = {
                 vbasedev.migration_multifd_transfer,
                 qdev_prop_on_off_auto_mutable, OnOffAuto,
                 .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
+    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
+                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 1bebe9877d88..418b378ebd29 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -153,6 +153,12 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
 vfio_load_device_config_state_end(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
+vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_start(const char *name) " (%s)"
+vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
+vfio_load_state_device_buffer_end(const char *name) " (%s)"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b1c03a82eec8..0954d6981a22 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -61,6 +61,11 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+/* type safety */
+typedef struct VFIOStateBuffers {
+    GArray *array;
+} VFIOStateBuffers;
+
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
@@ -73,10 +78,23 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    bool multifd_transfer;
     bool initial_data_sent;
 
     bool event_save_iterate_started;
     bool event_precopy_empty_hit;
+
+    QemuThread load_bufs_thread;
+    bool load_bufs_thread_running;
+    bool load_bufs_thread_want_exit;
+
+    VFIOStateBuffers load_bufs;
+    QemuCond load_bufs_buffer_ready_cond;
+    QemuCond load_bufs_thread_finished_cond;
+    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
+    uint32_t load_buf_idx;
+    uint32_t load_buf_idx_last;
+    uint32_t load_buf_queued_pending_buffers;
 } VFIOMigration;
 
 struct VFIOGroup;
@@ -136,6 +154,7 @@ typedef struct VFIODevice {
     OnOffAuto enable_migration;
     OnOffAuto migration_multifd_transfer;
     bool migration_events;
+    uint64_t migration_max_queued_buffers;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side
  2024-11-17 19:20 ` [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
@ 2024-12-02 17:56   ` Cédric Le Goater
  2024-12-10 23:04     ` Maciej S. Szmigiero
  2024-12-09  9:13   ` Avihai Horon
  1 sibling, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-02 17:56 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

Hello Maciej,

On 11/17/24 20:20, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> The multifd received data needs to be reassembled since device state
> packets sent via different multifd channels can arrive out-of-order.
> 
> Therefore, each VFIO device state packet carries a header indicating its
> position in the stream.
> 
> The last such VFIO device state packet should have
> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
> 
> Since it's important to finish loading device state transferred via the
> main migration channel (via save_live_iterate SaveVMHandler) before
> starting loading the data asynchronously transferred via multifd the thread
> doing the actual loading of the multifd transferred data is only started
> from switchover_start SaveVMHandler.
> 
> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
> 
> This sub-command is only sent after all save_live_iterate data have already
> been posted so it is safe to commence loading of the multifd-transferred
> device state upon receiving it - loading of save_live_iterate data happens
> synchronously in the main migration thread (much like the processing of
> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
> processed all the proceeding data must have already been loaded.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 402 ++++++++++++++++++++++++++++++++++

This is quite a significant update to introduce all at once. It lacks a
comprehensive overview of the design for those who were not involved in
the earlier discussions adding support for multifd migration of device
state. There are multiple threads and migration streams involved at
load time which deserve some descriptions. I think the best place
would be at the end of :

    https://qemu.readthedocs.io/en/v9.1.0/devel/migration/vfio.html

Could you please break down the patch to progressively introduce the
various elements needed for the receive sequence ? Something like :

   - data structures first
   - init phase
   - run time
   - and clean up phase
   - toggles to enable/disable/tune
   - finaly, documentation update (under vfio migration)

Some more below,

>   hw/vfio/pci.c                 |   2 +
>   hw/vfio/trace-events          |   6 +
>   include/hw/vfio/vfio-common.h |  19 ++
>   4 files changed, 429 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 683f2ae98d5e..b54879fe6209 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -15,6 +15,7 @@
>   #include <linux/vfio.h>
>   #include <sys/ioctl.h>
>   
> +#include "io/channel-buffer.h"
>   #include "sysemu/runstate.h"
>   #include "hw/vfio/vfio-common.h"
>   #include "migration/misc.h"
> @@ -55,6 +56,15 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>   
> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
> +
> +typedef struct VFIODeviceStatePacket {
> +    uint32_t version;
> +    uint32_t idx;
> +    uint32_t flags;
> +    uint8_t data[0];
> +} QEMU_PACKED VFIODeviceStatePacket;
> +
>   static int64_t bytes_transferred;
>   
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
> @@ -254,6 +264,292 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>       return ret;
>   }
>   
> +typedef struct VFIOStateBuffer {
> +    bool is_present;
> +    char *data;
> +    size_t len;
> +} VFIOStateBuffer;
> +
> +static void vfio_state_buffer_clear(gpointer data)
> +{
> +    VFIOStateBuffer *lb = data;
> +
> +    if (!lb->is_present) {
> +        return;
> +    }
> +
> +    g_clear_pointer(&lb->data, g_free);
> +    lb->is_present = false;
> +}
> +
> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
> +{
> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
> +}
> +
> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
> +{
> +    g_clear_pointer(&bufs->array, g_array_unref);
> +}
> +
> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
> +{
> +    assert(bufs->array);
> +}
> +
> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
> +{
> +    return bufs->array->len;
> +}
> +
> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
> +{
> +    g_array_set_size(bufs->array, size);
> +}
> +
> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
> +{
> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
> +}
> +
> +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                                  Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
> +    VFIOStateBuffer *lb;
> +
> +    /*
> +     * Holding BQL here would violate the lock order and can cause
> +     * a deadlock once we attempt to lock load_bufs_mutex below.
> +     */
> +    assert(!bql_locked());
> +
> +    if (!migration->multifd_transfer) {

Hmm, why is 'multifd_transfer' a migration attribute ? Shouldn't it
be at the device level ? Or should all devices of a VM support multifd
transfer ? That said, I'm a bit unclear about the limitations, if there
are any. Could please you explain a bit more when the migration sequence
is setup for the  device ?



> +        error_setg(errp,
> +                   "got device state packet but not doing multifd transfer");
> +        return -1;
> +    }
> +
> +    if (data_size < sizeof(*packet)) {
> +        error_setg(errp, "packet too short at %zu (min is %zu)",
> +                   data_size, sizeof(*packet));
> +        return -1;
> +    }
> +
> +    if (packet->version != 0) {
> +        error_setg(errp, "packet has unknown version %" PRIu32,
> +                   packet->version);
> +        return -1;
> +    }
> +
> +    if (packet->idx == UINT32_MAX) {
> +        error_setg(errp, "packet has too high idx %" PRIu32,
> +                   packet->idx);
> +        return -1;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
> +
> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
> +
> +    /* config state packet should be the last one in the stream */
> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
> +        migration->load_buf_idx_last = packet->idx;
> +    }
> +
> +    vfio_state_buffers_assert_init(&migration->load_bufs);
> +    if (packet->idx >= vfio_state_buffers_size_get(&migration->load_bufs)) {
> +        vfio_state_buffers_size_set(&migration->load_bufs, packet->idx + 1);
> +    }
> +
> +    lb = vfio_state_buffers_at(&migration->load_bufs, packet->idx);
> +    if (lb->is_present) {
> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
> +                   packet->idx);
> +        return -1;
> +    }
> +
> +    assert(packet->idx >= migration->load_buf_idx);
> +
> +    migration->load_buf_queued_pending_buffers++;
> +    if (migration->load_buf_queued_pending_buffers >
> +        vbasedev->migration_max_queued_buffers) {
> +        error_setg(errp,
> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
> +                   packet->idx, vbasedev->migration_max_queued_buffers);
> +        return -1;
> +    }
> +
> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
> +    lb->len = data_size - sizeof(*packet);
> +    lb->is_present = true;
> +
> +    qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
> +
> +    return 0;
> +}
> +
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
> +
> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOStateBuffer *lb;
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    QEMUFile *f_out = NULL, *f_in = NULL;
> +    uint64_t mig_header;
> +    int ret;
> +
> +    assert(migration->load_buf_idx == migration->load_buf_idx_last);
> +    lb = vfio_state_buffers_at(&migration->load_bufs, migration->load_buf_idx);
> +    assert(lb->is_present);
> +
> +    bioc = qio_channel_buffer_new(lb->len);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
> +
> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
> +
> +    ret = qemu_fflush(f_out);
> +    if (ret) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        return ret;
> +    }
> +
> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
> +
> +    mig_header = qemu_get_be64(f_in);
> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        g_clear_pointer(&f_in, qemu_fclose);
> +        return -EINVAL;
> +    }

All the above code is using the QIOChannel interface which is sort of an
internal API of the migration subsystem. Can we move it under migration ?


> +
> +    bql_lock();
> +    ret = vfio_load_device_config_state(f_in, vbasedev);
> +    bql_unlock();
> +
> +    g_clear_pointer(&f_out, qemu_fclose);
> +    g_clear_pointer(&f_in, qemu_fclose);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static bool vfio_load_bufs_thread_want_abort(VFIODevice *vbasedev,
> +                                             bool *abort_flag)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->load_bufs_thread_want_exit || qatomic_read(abort_flag);
> +}
> +
> +static int vfio_load_bufs_thread(bool *abort_flag, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
> +    int ret;
> +
> +    assert(migration->load_bufs_thread_running);
> +
> +    while (!vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
> +        VFIOStateBuffer *lb;
> +        guint bufs_len;
> +        bool starved;
> +
> +        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
> +
> +        bufs_len = vfio_state_buffers_size_get(&migration->load_bufs);
> +        if (migration->load_buf_idx >= bufs_len) {
> +            assert(migration->load_buf_idx == bufs_len);
> +            starved = true;
> +        } else {
> +            lb = vfio_state_buffers_at(&migration->load_bufs,
> +                                       migration->load_buf_idx);
> +            starved = !lb->is_present;
> +        }
> +
> +        if (starved) {
> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
> +                                                        migration->load_buf_idx);
> +            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond,
> +                           &migration->load_bufs_mutex);
> +            continue;
> +        }
> +
> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
> +            break;
> +        }
> +
> +        if (migration->load_buf_idx == 0) {
> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
> +        }
> +
> +        if (lb->len) {
> +            g_autofree char *buf = NULL;
> +            size_t buf_len;
> +            ssize_t wr_ret;
> +            int errno_save;
> +
> +            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
> +                                                           migration->load_buf_idx);
> +
> +            /* lb might become re-allocated when we drop the lock */
> +            buf = g_steal_pointer(&lb->data);
> +            buf_len = lb->len;
> +
> +            /*
> +             * Loading data to the device takes a while,
> +             * drop the lock during this process.
> +             */
> +            qemu_mutex_unlock(&migration->load_bufs_mutex);
> +            wr_ret = write(migration->data_fd, buf, buf_len);
> +            errno_save = errno;
> +            qemu_mutex_lock(&migration->load_bufs_mutex);
> +
> +            if (wr_ret < 0) {
> +                ret = -errno_save;
> +                goto ret_signal;
> +            } else if (wr_ret < buf_len) {
> +                ret = -EINVAL;
> +                goto ret_signal;
> +            }
> +
> +            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
> +                                                         migration->load_buf_idx);
> +        }
> +
> +        assert(migration->load_buf_queued_pending_buffers > 0);
> +        migration->load_buf_queued_pending_buffers--;
> +
> +        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
> +        }
> +
> +        migration->load_buf_idx++;
> +    }
> +
> +    if (vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
> +        ret = -ECANCELED;
> +        goto ret_signal;
> +    }
> +
> +    ret = vfio_load_bufs_thread_load_config(vbasedev);
> +
> +ret_signal:
> +    migration->load_bufs_thread_running = false;
> +    qemu_cond_signal(&migration->load_bufs_thread_finished_cond);
> +
> +    return ret;

Is the error reported to the migration subsytem ?

> +}
> +
>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>                                            Error **errp)
>   {
> @@ -430,6 +726,12 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
>       return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>   }
>   
> +static bool vfio_multifd_transfer_supported(void)
> +{
> +    return migration_has_device_state_support() &&
> +        migrate_send_switchover_start();
> +}
> +
>   /* ---------------------------------------------------------------------- */
>   
>   static int vfio_save_prepare(void *opaque, Error **errp)
> @@ -695,17 +997,73 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>   
>       assert(!migration->load_setup);
>   
> +    /*
> +     * Make a copy of this setting at the start in case it is changed
> +     * mid-migration.
> +     */
> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> +    } else {
> +        migration->multifd_transfer =
> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> +    }
> +
> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return -EINVAL;
> +    }

Can we move these checks ealier ? in vfio_migration_realize() ?
If possible, it would be good to avoid the multifd_transfer attribute also.

>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>                                      migration->device_state, errp);
>       if (ret) {
>           return ret;
>       }
>   
> +    if (migration->multifd_transfer) {
> +        assert(!migration->load_bufs.array);
> +        vfio_state_buffers_init(&migration->load_bufs);
> +
> +        qemu_mutex_init(&migration->load_bufs_mutex);
> +
> +        migration->load_buf_idx = 0;
> +        migration->load_buf_idx_last = UINT32_MAX;
> +        migration->load_buf_queued_pending_buffers = 0;
> +        qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
> +
> +        migration->load_bufs_thread_running = false;
> +        migration->load_bufs_thread_want_exit = false;
> +        qemu_cond_init(&migration->load_bufs_thread_finished_cond);

Please provide an helper routine to initialize all the multifd transfer
attributes. We might want to add a struct to gather them all by the way.

> +    }
> +
>       migration->load_setup = true;
>   
>       return 0;
>   }
>   
> +static void vfio_load_cleanup_load_bufs_thread(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
> +        if (!migration->load_bufs_thread_running) {
> +            break;
> +        }
> +
> +        migration->load_bufs_thread_want_exit = true;
> +
> +        qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
> +        qemu_cond_wait(&migration->load_bufs_thread_finished_cond,
> +                       &migration->load_bufs_mutex);
> +
> +        assert(!migration->load_bufs_thread_running);
> +    }
> +    bql_lock();
> +}
> +
>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> @@ -715,7 +1073,19 @@ static int vfio_load_cleanup(void *opaque)
>           return 0;
>       }
>   
> +    if (migration->multifd_transfer) {
> +        vfio_load_cleanup_load_bufs_thread(vbasedev);
> +    }
> +
>       vfio_migration_cleanup(vbasedev);

Why is the cleanup done in two steps ?

> +
> +    if (migration->multifd_transfer) {
> +        qemu_cond_destroy(&migration->load_bufs_thread_finished_cond);
> +        vfio_state_buffers_destroy(&migration->load_bufs);
> +        qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
> +        qemu_mutex_destroy(&migration->load_bufs_mutex);
> +    }
> +
>       migration->load_setup = false;
>       trace_vfio_load_cleanup(vbasedev->name);
>   
> @@ -725,6 +1095,7 @@ static int vfio_load_cleanup(void *opaque)
>   static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       int ret = 0;
>       uint64_t data;
>   
> @@ -736,6 +1107,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            if (migration->multifd_transfer) {
> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
> +                             vbasedev->name);
> +                return -EINVAL;
> +            }
> +
>               return vfio_load_device_config_state(f, opaque);
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> @@ -801,6 +1178,29 @@ static bool vfio_switchover_ack_needed(void *opaque)
>       return vfio_precopy_supported(vbasedev);
>   }
>   
> +static int vfio_switchover_start(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration->multifd_transfer) {
> +        /* Load thread is only used for multifd transfer */
> +        return 0;
> +    }
> +
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
> +        assert(!migration->load_bufs_thread_running);
> +        migration->load_bufs_thread_running = true;
> +    }
> +    bql_lock();
> +
> +    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
> +
> +    return 0;
> +}
> +
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
> @@ -814,7 +1214,9 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
> +    .switchover_start = vfio_switchover_start,
>   };
>   
>   /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 9d547cb5cdff..72d62ada8a39 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3384,6 +3384,8 @@ static Property vfio_pci_dev_properties[] = {
>                   vbasedev.migration_multifd_transfer,
>                   qdev_prop_on_off_auto_mutable, OnOffAuto,
>                   .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1bebe9877d88..418b378ebd29 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -153,6 +153,12 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
>   vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
> +vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_start(const char *name) " (%s)"
> +vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_end(const char *name) " (%s)"
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index b1c03a82eec8..0954d6981a22 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -61,6 +61,11 @@ typedef struct VFIORegion {
>       uint8_t nr; /* cache the region number for debug */
>   } VFIORegion;
>   
> +/* type safety */
> +typedef struct VFIOStateBuffers {
> +    GArray *array;
> +} VFIOStateBuffers;
> +
>   typedef struct VFIOMigration {
>       struct VFIODevice *vbasedev;
>       VMChangeStateEntry *vm_state;
> @@ -73,10 +78,23 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    bool multifd_transfer;
>       bool initial_data_sent;
>   
>       bool event_save_iterate_started;
>       bool event_precopy_empty_hit;
> +
> +    QemuThread load_bufs_thread;
> +    bool load_bufs_thread_running;
> +    bool load_bufs_thread_want_exit;
> +
> +    VFIOStateBuffers load_bufs;
> +    QemuCond load_bufs_buffer_ready_cond;
> +    QemuCond load_bufs_thread_finished_cond;
> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
> +    uint32_t load_buf_idx;
> +    uint32_t load_buf_idx_last;
> +    uint32_t load_buf_queued_pending_buffers;
>   } VFIOMigration;
>   
>   struct VFIOGroup;
> @@ -136,6 +154,7 @@ typedef struct VFIODevice {
>       OnOffAuto enable_migration;
>       OnOffAuto migration_multifd_transfer;
>       bool migration_events;
> +    uint64_t migration_max_queued_buffers;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
>       unsigned int num_regions;
> 


Thanks,

C.





^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side
  2024-12-02 17:56   ` Cédric Le Goater
@ 2024-12-10 23:04     ` Maciej S. Szmigiero
  2024-12-19 14:13       ` Cédric Le Goater
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:04 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Hi Cédric,

On 2.12.2024 18:56, Cédric Le Goater wrote:
> Hello Maciej,
> 
> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The multifd received data needs to be reassembled since device state
>> packets sent via different multifd channels can arrive out-of-order.
>>
>> Therefore, each VFIO device state packet carries a header indicating its
>> position in the stream.
>>
>> The last such VFIO device state packet should have
>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>
>> Since it's important to finish loading device state transferred via the
>> main migration channel (via save_live_iterate SaveVMHandler) before
>> starting loading the data asynchronously transferred via multifd the thread
>> doing the actual loading of the multifd transferred data is only started
>> from switchover_start SaveVMHandler.
>>
>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>
>> This sub-command is only sent after all save_live_iterate data have already
>> been posted so it is safe to commence loading of the multifd-transferred
>> device state upon receiving it - loading of save_live_iterate data happens
>> synchronously in the main migration thread (much like the processing of
>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>> processed all the proceeding data must have already been loaded.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 402 ++++++++++++++++++++++++++++++++++
> 
> This is quite a significant update to introduce all at once. It lacks a
> comprehensive overview of the design for those who were not involved in
> the earlier discussions adding support for multifd migration of device
> state. There are multiple threads and migration streams involved at
> load time which deserve some descriptions. I think the best place
> would be at the end of :
> 
>     https://qemu.readthedocs.io/en/v9.1.0/devel/migration/vfio.html

Will try to add some design/implementations descriptions to
docs/devel/migration/vfio.rst.

> Could you please break down the patch to progressively introduce the
> various elements needed for the receive sequence ? Something like :
> 
>    - data structures first
>    - init phase
>    - run time
>    - and clean up phase
>    - toggles to enable/disable/tune
>    - finaly, documentation update (under vfio migration)

Obviously I can split the VFIO patch into smaller fragments,
but this means that the intermediate form won't be testable
(I guess that's okay).

> Some more below,
> 
>>   hw/vfio/pci.c                 |   2 +
>>   hw/vfio/trace-events          |   6 +
>>   include/hw/vfio/vfio-common.h |  19 ++
>>   4 files changed, 429 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 683f2ae98d5e..b54879fe6209 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/vfio.h>
>>   #include <sys/ioctl.h>
>> +#include "io/channel-buffer.h"
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "migration/misc.h"
>> @@ -55,6 +56,15 @@
>>    */
>>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>> +
>> +typedef struct VFIODeviceStatePacket {
>> +    uint32_t version;
>> +    uint32_t idx;
>> +    uint32_t flags;
>> +    uint8_t data[0];
>> +} QEMU_PACKED VFIODeviceStatePacket;
>> +
>>   static int64_t bytes_transferred;
>>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>> @@ -254,6 +264,292 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>       return ret;
>>   }
>> +typedef struct VFIOStateBuffer {
>> +    bool is_present;
>> +    char *data;
>> +    size_t len;
>> +} VFIOStateBuffer;
>> +
>> +static void vfio_state_buffer_clear(gpointer data)
>> +{
>> +    VFIOStateBuffer *lb = data;
>> +
>> +    if (!lb->is_present) {
>> +        return;
>> +    }
>> +
>> +    g_clear_pointer(&lb->data, g_free);
>> +    lb->is_present = false;
>> +}
>> +
>> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
>> +{
>> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
>> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
>> +}
>> +
>> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
>> +{
>> +    g_clear_pointer(&bufs->array, g_array_unref);
>> +}
>> +
>> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
>> +{
>> +    assert(bufs->array);
>> +}
>> +
>> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
>> +{
>> +    return bufs->array->len;
>> +}
>> +
>> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
>> +{
>> +    g_array_set_size(bufs->array, size);
>> +}
>> +
>> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>> +{
>> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>> +}
>> +
>> +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>> +                                  Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>> +    VFIOStateBuffer *lb;
>> +
>> +    /*
>> +     * Holding BQL here would violate the lock order and can cause
>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>> +     */
>> +    assert(!bql_locked());
>> +
>> +    if (!migration->multifd_transfer) {
> 
> Hmm, why is 'multifd_transfer' a migration attribute ? Shouldn't it
> be at the device level ? 

I thought migration-time data goes into VFIOMigration?

I don't have any strong objections against moving it into VFIODevice though.

> Or should all devices of a VM support multifd
> transfer ? That said, I'm a bit unclear about the limitations, if there
> are any. Could please you explain a bit more when the migration sequence
> is setup for the  device ?
> 

The reason we need this setting on the receive side is because we
need to know whether to start the load_bufs_thread (the migration
core will later wait for this thread to finish before proceeding further).

We also need to know whether to allocate multifd-related data structures
in the VFIO driver based on this setting.

This setting ultimately comes from "x-migration-multifd-transfer"
VFIOPCIDevice setting, which is a ON_OFF_AUTO setting ("AUTO" value means
that multifd use in the driver is attempted in configurations that
otherwise support it).

> 
>> +        error_setg(errp,
>> +                   "got device state packet but not doing multifd transfer");
>> +        return -1;
>> +    }
>> +
>> +    if (data_size < sizeof(*packet)) {
>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>> +                   data_size, sizeof(*packet));
>> +        return -1;
>> +    }
>> +
>> +    if (packet->version != 0) {
>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>> +                   packet->version);
>> +        return -1;
>> +    }
>> +
>> +    if (packet->idx == UINT32_MAX) {
>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>> +                   packet->idx);
>> +        return -1;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>> +
>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>> +
>> +    /* config state packet should be the last one in the stream */
>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>> +        migration->load_buf_idx_last = packet->idx;
>> +    }
>> +
>> +    vfio_state_buffers_assert_init(&migration->load_bufs);
>> +    if (packet->idx >= vfio_state_buffers_size_get(&migration->load_bufs)) {
>> +        vfio_state_buffers_size_set(&migration->load_bufs, packet->idx + 1);
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&migration->load_bufs, packet->idx);
>> +    if (lb->is_present) {
>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>> +                   packet->idx);
>> +        return -1;
>> +    }
>> +
>> +    assert(packet->idx >= migration->load_buf_idx);
>> +
>> +    migration->load_buf_queued_pending_buffers++;
>> +    if (migration->load_buf_queued_pending_buffers >
>> +        vbasedev->migration_max_queued_buffers) {
>> +        error_setg(errp,
>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>> +        return -1;
>> +    }
>> +
>> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>> +    lb->len = data_size - sizeof(*packet);
>> +    lb->is_present = true;
>> +
>> +    qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>> +
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOStateBuffer *lb;
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    QEMUFile *f_out = NULL, *f_in = NULL;
>> +    uint64_t mig_header;
>> +    int ret;
>> +
>> +    assert(migration->load_buf_idx == migration->load_buf_idx_last);
>> +    lb = vfio_state_buffers_at(&migration->load_bufs, migration->load_buf_idx);
>> +    assert(lb->is_present);
>> +
>> +    bioc = qio_channel_buffer_new(lb->len);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
>> +
>> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
>> +
>> +    ret = qemu_fflush(f_out);
>> +    if (ret) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        return ret;
>> +    }
>> +
>> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
>> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
>> +
>> +    mig_header = qemu_get_be64(f_in);
>> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        g_clear_pointer(&f_in, qemu_fclose);
>> +        return -EINVAL;
>> +    }
> 
> All the above code is using the QIOChannel interface which is sort of an
> internal API of the migration subsystem. Can we move it under migration ?

hw/remote and hw/virtio are also using QIOChannel API, not to mention
qemu-nbd, block/nbd and backends/tpm, so definitely it's not just the
core migration code that uses it.

I don't think introducing a tiny generic migration core helper which takes
VFIO-specific buffer with config data and ends calling VFIO-specific
device config state load function really makes sense.

> 
>> +
>> +    bql_lock();
>> +    ret = vfio_load_device_config_state(f_in, vbasedev);
>> +    bql_unlock();
>> +
>> +    g_clear_pointer(&f_out, qemu_fclose);
>> +    g_clear_pointer(&f_in, qemu_fclose);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static bool vfio_load_bufs_thread_want_abort(VFIODevice *vbasedev,
>> +                                             bool *abort_flag)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->load_bufs_thread_want_exit || qatomic_read(abort_flag);
>> +}
>> +
>> +static int vfio_load_bufs_thread(bool *abort_flag, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>> +    int ret;
>> +
>> +    assert(migration->load_bufs_thread_running);
>> +
>> +    while (!vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
>> +        VFIOStateBuffer *lb;
>> +        guint bufs_len;
>> +        bool starved;
>> +
>> +        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
>> +
>> +        bufs_len = vfio_state_buffers_size_get(&migration->load_bufs);
>> +        if (migration->load_buf_idx >= bufs_len) {
>> +            assert(migration->load_buf_idx == bufs_len);
>> +            starved = true;
>> +        } else {
>> +            lb = vfio_state_buffers_at(&migration->load_bufs,
>> +                                       migration->load_buf_idx);
>> +            starved = !lb->is_present;
>> +        }
>> +
>> +        if (starved) {
>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>> +                                                        migration->load_buf_idx);
>> +            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond,
>> +                           &migration->load_bufs_mutex);
>> +            continue;
>> +        }
>> +
>> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
>> +            break;
>> +        }
>> +
>> +        if (migration->load_buf_idx == 0) {
>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>> +        }
>> +
>> +        if (lb->len) {
>> +            g_autofree char *buf = NULL;
>> +            size_t buf_len;
>> +            ssize_t wr_ret;
>> +            int errno_save;
>> +
>> +            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>> +                                                           migration->load_buf_idx);
>> +
>> +            /* lb might become re-allocated when we drop the lock */
>> +            buf = g_steal_pointer(&lb->data);
>> +            buf_len = lb->len;
>> +
>> +            /*
>> +             * Loading data to the device takes a while,
>> +             * drop the lock during this process.
>> +             */
>> +            qemu_mutex_unlock(&migration->load_bufs_mutex);
>> +            wr_ret = write(migration->data_fd, buf, buf_len);
>> +            errno_save = errno;
>> +            qemu_mutex_lock(&migration->load_bufs_mutex);
>> +
>> +            if (wr_ret < 0) {
>> +                ret = -errno_save;
>> +                goto ret_signal;
>> +            } else if (wr_ret < buf_len) {
>> +                ret = -EINVAL;
>> +                goto ret_signal;
>> +            }
>> +
>> +            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>> +                                                         migration->load_buf_idx);
>> +        }
>> +
>> +        assert(migration->load_buf_queued_pending_buffers > 0);
>> +        migration->load_buf_queued_pending_buffers--;
>> +
>> +        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>> +        }
>> +
>> +        migration->load_buf_idx++;
>> +    }
>> +
>> +    if (vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
>> +        ret = -ECANCELED;
>> +        goto ret_signal;
>> +    }
>> +
>> +    ret = vfio_load_bufs_thread_load_config(vbasedev);
>> +
>> +ret_signal:
>> +    migration->load_bufs_thread_running = false;
>> +    qemu_cond_signal(&migration->load_bufs_thread_finished_cond);
>> +
>> +    return ret;
> 
> Is the error reported to the migration subsytem ?

Yes, via setting "load_threads_ret" in qemu_loadvm_load_thread().

>> +}
>> +
>>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>                                            Error **errp)
>>   {
>> @@ -430,6 +726,12 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>       return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>   }
>> +static bool vfio_multifd_transfer_supported(void)
>> +{
>> +    return migration_has_device_state_support() &&
>> +        migrate_send_switchover_start();
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   static int vfio_save_prepare(void *opaque, Error **errp)
>> @@ -695,17 +997,73 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>       assert(!migration->load_setup);
>> +    /*
>> +     * Make a copy of this setting at the start in case it is changed
>> +     * mid-migration.
>> +     */
>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>> +    } else {
>> +        migration->multifd_transfer =
>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>> +    }
>> +
>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return -EINVAL;
>> +    }
> 
> Can we move these checks ealier ? in vfio_migration_realize() ?
> If possible, it would be good to avoid the multifd_transfer attribute also.

We can't since the value is changeable at runtime, so it could have been
changed after the VFIO device got realized.

>>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>>                                      migration->device_state, errp);
>>       if (ret) {
>>           return ret;
>>       }
>> +    if (migration->multifd_transfer) {
>> +        assert(!migration->load_bufs.array);
>> +        vfio_state_buffers_init(&migration->load_bufs);
>> +
>> +        qemu_mutex_init(&migration->load_bufs_mutex);
>> +
>> +        migration->load_buf_idx = 0;
>> +        migration->load_buf_idx_last = UINT32_MAX;
>> +        migration->load_buf_queued_pending_buffers = 0;
>> +        qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
>> +
>> +        migration->load_bufs_thread_running = false;
>> +        migration->load_bufs_thread_want_exit = false;
>> +        qemu_cond_init(&migration->load_bufs_thread_finished_cond);
> 
> Please provide an helper routine to initialize all the multifd transfer
> attributes. We might want to add a struct to gather them all by the way.

Will move these to a new helper.

>> +    }
>> +
>>       migration->load_setup = true;
>>       return 0;
>>   }
>> +static void vfio_load_cleanup_load_bufs_thread(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>> +    bql_unlock();
>> +    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
>> +        if (!migration->load_bufs_thread_running) {
>> +            break;
>> +        }
>> +
>> +        migration->load_bufs_thread_want_exit = true;
>> +
>> +        qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
>> +        qemu_cond_wait(&migration->load_bufs_thread_finished_cond,
>> +                       &migration->load_bufs_mutex);
>> +
>> +        assert(!migration->load_bufs_thread_running);
>> +    }
>> +    bql_lock();
>> +}
>> +
>>   static int vfio_load_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -715,7 +1073,19 @@ static int vfio_load_cleanup(void *opaque)
>>           return 0;
>>       }
>> +    if (migration->multifd_transfer) {
>> +        vfio_load_cleanup_load_bufs_thread(vbasedev);
>> +    }
>> +
>>       vfio_migration_cleanup(vbasedev);
> 
> Why is the cleanup done in two steps ?

I'm not sure what "two steps" here refer to, but
if you mean to move the "if (migration->multifd_transfer)"
block below to the similar one above then it should be possible.

>> +
>> +    if (migration->multifd_transfer) {
>> +        qemu_cond_destroy(&migration->load_bufs_thread_finished_cond);
>> +        vfio_state_buffers_destroy(&migration->load_bufs);
>> +        qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
>> +        qemu_mutex_destroy(&migration->load_bufs_mutex);
>> +    }
>> +
>>       migration->load_setup = false;
>>       trace_vfio_load_cleanup(vbasedev->name);
(..)

  
> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side
  2024-12-10 23:04     ` Maciej S. Szmigiero
@ 2024-12-19 14:13       ` Cédric Le Goater
  0 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-19 14:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Eric Blake, Peter Xu, Fabiano Rosas,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 12/11/24 00:04, Maciej S. Szmigiero wrote:
> Hi Cédric,
> 
> On 2.12.2024 18:56, Cédric Le Goater wrote:
>> Hello Maciej,
>>
>> On 11/17/24 20:20, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> The multifd received data needs to be reassembled since device state
>>> packets sent via different multifd channels can arrive out-of-order.
>>>
>>> Therefore, each VFIO device state packet carries a header indicating its
>>> position in the stream.
>>>
>>> The last such VFIO device state packet should have
>>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>>
>>> Since it's important to finish loading device state transferred via the
>>> main migration channel (via save_live_iterate SaveVMHandler) before
>>> starting loading the data asynchronously transferred via multifd the thread
>>> doing the actual loading of the multifd transferred data is only started
>>> from switchover_start SaveVMHandler.
>>>
>>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>>
>>> This sub-command is only sent after all save_live_iterate data have already
>>> been posted so it is safe to commence loading of the multifd-transferred
>>> device state upon receiving it - loading of save_live_iterate data happens
>>> synchronously in the main migration thread (much like the processing of
>>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>>> processed all the proceeding data must have already been loaded.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 402 ++++++++++++++++++++++++++++++++++
>>
>> This is quite a significant update to introduce all at once. It lacks a
>> comprehensive overview of the design for those who were not involved in
>> the earlier discussions adding support for multifd migration of device
>> state. There are multiple threads and migration streams involved at
>> load time which deserve some descriptions. I think the best place
>> would be at the end of :
>>
>>     https://qemu.readthedocs.io/en/v9.1.0/devel/migration/vfio.html
> 
> Will try to add some design/implementations descriptions to
> docs/devel/migration/vfio.rst.
> 
>> Could you please break down the patch to progressively introduce the
>> various elements needed for the receive sequence ? Something like :
>>
>>    - data structures first
>>    - init phase
>>    - run time
>>    - and clean up phase
>>    - toggles to enable/disable/tune
>>    - finaly, documentation update (under vfio migration)
> 
> Obviously I can split the VFIO patch into smaller fragments,
> but this means that the intermediate form won't be testable
> (I guess that's okay).

As long as bisect is not broken, it is fine. Typically, the last patch
of a series is the one activating the new proposed feature.

> 
>> Some more below,
>>
>>>   hw/vfio/pci.c                 |   2 +
>>>   hw/vfio/trace-events          |   6 +
>>>   include/hw/vfio/vfio-common.h |  19 ++
>>>   4 files changed, 429 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 683f2ae98d5e..b54879fe6209 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -15,6 +15,7 @@
>>>   #include <linux/vfio.h>
>>>   #include <sys/ioctl.h>
>>> +#include "io/channel-buffer.h"
>>>   #include "sysemu/runstate.h"
>>>   #include "hw/vfio/vfio-common.h"
>>>   #include "migration/misc.h"
>>> @@ -55,6 +56,15 @@
>>>    */
>>>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>>> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>>> +
>>> +typedef struct VFIODeviceStatePacket {
>>> +    uint32_t version;
>>> +    uint32_t idx;
>>> +    uint32_t flags;
>>> +    uint8_t data[0];
>>> +} QEMU_PACKED VFIODeviceStatePacket;
>>> +
>>>   static int64_t bytes_transferred;
>>>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>>> @@ -254,6 +264,292 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>>       return ret;
>>>   }
>>> +typedef struct VFIOStateBuffer {
>>> +    bool is_present;
>>> +    char *data;
>>> +    size_t len;
>>> +} VFIOStateBuffer;
>>> +
>>> +static void vfio_state_buffer_clear(gpointer data)
>>> +{
>>> +    VFIOStateBuffer *lb = data;
>>> +
>>> +    if (!lb->is_present) {
>>> +        return;
>>> +    }
>>> +
>>> +    g_clear_pointer(&lb->data, g_free);
>>> +    lb->is_present = false;
>>> +}
>>> +
>>> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
>>> +{
>>> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
>>> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
>>> +}
>>> +
>>> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
>>> +{
>>> +    g_clear_pointer(&bufs->array, g_array_unref);
>>> +}
>>> +
>>> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
>>> +{
>>> +    assert(bufs->array);
>>> +}
>>> +
>>> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
>>> +{
>>> +    return bufs->array->len;
>>> +}
>>> +
>>> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
>>> +{
>>> +    g_array_set_size(bufs->array, size);
>>> +}
>>> +
>>> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>>> +{
>>> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>> +}
>>> +
>>> +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>>> +                                  Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>>> +    VFIOStateBuffer *lb;
>>> +
>>> +    /*
>>> +     * Holding BQL here would violate the lock order and can cause
>>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>>> +     */
>>> +    assert(!bql_locked());
>>> +
>>> +    if (!migration->multifd_transfer) {
>>
>> Hmm, why is 'multifd_transfer' a migration attribute ? Shouldn't it
>> be at the device level ? 
> 
> I thought migration-time data goes into VFIOMigration?

yes. Sorry, I was confused by the object MigrationState which is global.
VFIOMigration is a device leve object. We are fine.

AFAICT, this supports hybrid configs : some devices using multifd
migration and some others using standard migration.

> I don't have any strong objections against moving it into VFIODevice though.
> 
>> Or should all devices of a VM support multifd
>> transfer ? That said, I'm a bit unclear about the limitations, if there
>> are any. Could please you explain a bit more when the migration sequence
>> is setup for the  device ?
>>
> 
> The reason we need this setting on the receive side is because we
> need to know whether to start the load_bufs_thread (the migration
> core will later wait for this thread to finish before proceeding further).
> 
> We also need to know whether to allocate multifd-related data structures
> in the VFIO driver based on this setting.
> 
> This setting ultimately comes from "x-migration-multifd-transfer"
> VFIOPCIDevice setting, which is a ON_OFF_AUTO setting ("AUTO" value means
> that multifd use in the driver is attempted in configurations that
> otherwise support it).
> 
>>
>>> +        error_setg(errp,
>>> +                   "got device state packet but not doing multifd transfer");
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (data_size < sizeof(*packet)) {
>>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>>> +                   data_size, sizeof(*packet));
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (packet->version != 0) {
>>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>>> +                   packet->version);
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (packet->idx == UINT32_MAX) {
>>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>>> +                   packet->idx);
>>> +        return -1;
>>> +    }
>>> +
>>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>>> +
>>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>>> +
>>> +    /* config state packet should be the last one in the stream */
>>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>>> +        migration->load_buf_idx_last = packet->idx;
>>> +    }
>>> +
>>> +    vfio_state_buffers_assert_init(&migration->load_bufs);
>>> +    if (packet->idx >= vfio_state_buffers_size_get(&migration->load_bufs)) {
>>> +        vfio_state_buffers_size_set(&migration->load_bufs, packet->idx + 1);
>>> +    }
>>> +
>>> +    lb = vfio_state_buffers_at(&migration->load_bufs, packet->idx);
>>> +    if (lb->is_present) {
>>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>>> +                   packet->idx);
>>> +        return -1;
>>> +    }
>>> +
>>> +    assert(packet->idx >= migration->load_buf_idx);
>>> +
>>> +    migration->load_buf_queued_pending_buffers++;
>>> +    if (migration->load_buf_queued_pending_buffers >
>>> +        vbasedev->migration_max_queued_buffers) {
>>> +        error_setg(errp,
>>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>>> +        return -1;
>>> +    }
>>> +
>>> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>>> +    lb->len = data_size - sizeof(*packet);
>>> +    lb->is_present = true;
>>> +
>>> +    qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>>> +
>>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOStateBuffer *lb;
>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>> +    QEMUFile *f_out = NULL, *f_in = NULL;
>>> +    uint64_t mig_header;
>>> +    int ret;
>>> +
>>> +    assert(migration->load_buf_idx == migration->load_buf_idx_last);
>>> +    lb = vfio_state_buffers_at(&migration->load_bufs, migration->load_buf_idx);
>>> +    assert(lb->is_present);
>>> +
>>> +    bioc = qio_channel_buffer_new(lb->len);
>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
>>> +
>>> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
>>> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
>>> +
>>> +    ret = qemu_fflush(f_out);
>>> +    if (ret) {
>>> +        g_clear_pointer(&f_out, qemu_fclose);
>>> +        return ret;
>>> +    }
>>> +
>>> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
>>> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
>>> +
>>> +    mig_header = qemu_get_be64(f_in);
>>> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
>>> +        g_clear_pointer(&f_out, qemu_fclose);
>>> +        g_clear_pointer(&f_in, qemu_fclose);
>>> +        return -EINVAL;
>>> +    }
>>
>> All the above code is using the QIOChannel interface which is sort of an
>> internal API of the migration subsystem. Can we move it under migration ?
> 
> hw/remote and hw/virtio are also using QIOChannel API, not to mention
> qemu-nbd, block/nbd and backends/tpm, so definitely it's not just the
> core migration code that uses it.

These examples are not device models.

> I don't think introducing a tiny generic migration core helper which takes
> VFIO-specific buffer with config data and ends calling VFIO-specific
> device config state load function really makes sense.

qemu_file_new_input/ouput, qio_channel_buffer_new, qio_channel_io_seek,
qio_channel_buffer_new are solely used in migration. That's why I am
reluctant to use them directly in VFIO.

I agree it is small, for now.

> 
>>
>>> +
>>> +    bql_lock();
>>> +    ret = vfio_load_device_config_state(f_in, vbasedev);
>>> +    bql_unlock();
>>> +
>>> +    g_clear_pointer(&f_out, qemu_fclose);
>>> +    g_clear_pointer(&f_in, qemu_fclose);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static bool vfio_load_bufs_thread_want_abort(VFIODevice *vbasedev,
>>> +                                             bool *abort_flag)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    return migration->load_bufs_thread_want_exit || qatomic_read(abort_flag);
>>> +}
>>> +
>>> +static int vfio_load_bufs_thread(bool *abort_flag, void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>>> +    int ret;
>>> +
>>> +    assert(migration->load_bufs_thread_running);
>>> +
>>> +    while (!vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
>>> +        VFIOStateBuffer *lb;
>>> +        guint bufs_len;
>>> +        bool starved;
>>> +
>>> +        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
>>> +
>>> +        bufs_len = vfio_state_buffers_size_get(&migration->load_bufs);
>>> +        if (migration->load_buf_idx >= bufs_len) {
>>> +            assert(migration->load_buf_idx == bufs_len);
>>> +            starved = true;
>>> +        } else {
>>> +            lb = vfio_state_buffers_at(&migration->load_bufs,
>>> +                                       migration->load_buf_idx);
>>> +            starved = !lb->is_present;
>>> +        }
>>> +
>>> +        if (starved) {
>>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>>> +                                                        migration->load_buf_idx);
>>> +            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond,
>>> +                           &migration->load_bufs_mutex);
>>> +            continue;
>>> +        }
>>> +
>>> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
>>> +            break;
>>> +        }
>>> +
>>> +        if (migration->load_buf_idx == 0) {
>>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>>> +        }
>>> +
>>> +        if (lb->len) {
>>> +            g_autofree char *buf = NULL;
>>> +            size_t buf_len;
>>> +            ssize_t wr_ret;
>>> +            int errno_save;
>>> +
>>> +            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>>> +                                                           migration->load_buf_idx);
>>> +
>>> +            /* lb might become re-allocated when we drop the lock */
>>> +            buf = g_steal_pointer(&lb->data);
>>> +            buf_len = lb->len;
>>> +
>>> +            /*
>>> +             * Loading data to the device takes a while,
>>> +             * drop the lock during this process.
>>> +             */
>>> +            qemu_mutex_unlock(&migration->load_bufs_mutex);
>>> +            wr_ret = write(migration->data_fd, buf, buf_len);
>>> +            errno_save = errno;
>>> +            qemu_mutex_lock(&migration->load_bufs_mutex);
>>> +
>>> +            if (wr_ret < 0) {
>>> +                ret = -errno_save;
>>> +                goto ret_signal;
>>> +            } else if (wr_ret < buf_len) {
>>> +                ret = -EINVAL;
>>> +                goto ret_signal;
>>> +            }
>>> +
>>> +            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
>>> +                                                         migration->load_buf_idx);
>>> +        }
>>> +
>>> +        assert(migration->load_buf_queued_pending_buffers > 0);
>>> +        migration->load_buf_queued_pending_buffers--;
>>> +
>>> +        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
>>> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
>>> +        }
>>> +
>>> +        migration->load_buf_idx++;
>>> +    }
>>> +
>>> +    if (vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
>>> +        ret = -ECANCELED;
>>> +        goto ret_signal;
>>> +    }
>>> +
>>> +    ret = vfio_load_bufs_thread_load_config(vbasedev);
>>> +
>>> +ret_signal:
>>> +    migration->load_bufs_thread_running = false;
>>> +    qemu_cond_signal(&migration->load_bufs_thread_finished_cond);
>>> +
>>> +    return ret;
>>
>> Is the error reported to the migration subsytem ?
> 
> Yes, via setting "load_threads_ret" in qemu_loadvm_load_thread().
> 
>>> +}
>>> +
>>>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>>>                                            Error **errp)
>>>   {
>>> @@ -430,6 +726,12 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>>       return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>   }
>>> +static bool vfio_multifd_transfer_supported(void)
>>> +{
>>> +    return migration_has_device_state_support() &&
>>> +        migrate_send_switchover_start();
>>> +}
>>> +
>>>   /* ---------------------------------------------------------------------- */
>>>   static int vfio_save_prepare(void *opaque, Error **errp)
>>> @@ -695,17 +997,73 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>>       assert(!migration->load_setup);
>>> +    /*
>>> +     * Make a copy of this setting at the start in case it is changed
>>> +     * mid-migration.
>>> +     */
>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>> +    } else {
>>> +        migration->multifd_transfer =
>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>> +    }
>>> +
>>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>>> +        error_setg(errp,
>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>> +                   vbasedev->name);
>>> +        return -EINVAL;
>>> +    }
>>
>> Can we move these checks ealier ? in vfio_migration_realize() ?
>> If possible, it would be good to avoid the multifd_transfer attribute also.
> 
> We can't since the value is changeable at runtime, so it could have been
> changed after the VFIO device got realized.

We will need discuss this part again. Let's keep it that way for now.

>>>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>>>                                      migration->device_state, errp);
>>>       if (ret) {
>>>           return ret;
>>>       }
>>> +    if (migration->multifd_transfer) {
>>> +        assert(!migration->load_bufs.array);
>>> +        vfio_state_buffers_init(&migration->load_bufs);
>>> +
>>> +        qemu_mutex_init(&migration->load_bufs_mutex);
>>> +
>>> +        migration->load_buf_idx = 0;
>>> +        migration->load_buf_idx_last = UINT32_MAX;
>>> +        migration->load_buf_queued_pending_buffers = 0;
>>> +        qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
>>> +
>>> +        migration->load_bufs_thread_running = false;
>>> +        migration->load_bufs_thread_want_exit = false;
>>> +        qemu_cond_init(&migration->load_bufs_thread_finished_cond);
>>
>> Please provide an helper routine to initialize all the multifd transfer
>> attributes. We might want to add a struct to gather them all by the way.
> 
> Will move these to a new helper.
> 
>>> +    }
>>> +
>>>       migration->load_setup = true;
>>>       return 0;
>>>   }
>>> +static void vfio_load_cleanup_load_bufs_thread(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
>>> +    bql_unlock();
>>> +    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
>>> +        if (!migration->load_bufs_thread_running) {
>>> +            break;
>>> +        }
>>> +
>>> +        migration->load_bufs_thread_want_exit = true;
>>> +
>>> +        qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
>>> +        qemu_cond_wait(&migration->load_bufs_thread_finished_cond,
>>> +                       &migration->load_bufs_mutex);
>>> +
>>> +        assert(!migration->load_bufs_thread_running);
>>> +    }
>>> +    bql_lock();
>>> +}
>>> +
>>>   static int vfio_load_cleanup(void *opaque)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> @@ -715,7 +1073,19 @@ static int vfio_load_cleanup(void *opaque)
>>>           return 0;
>>>       }
>>> +    if (migration->multifd_transfer) {
>>> +        vfio_load_cleanup_load_bufs_thread(vbasedev);
>>> +    }
>>> +
>>>       vfio_migration_cleanup(vbasedev);
>>
>> Why is the cleanup done in two steps ?
> 
> I'm not sure what "two steps" here refer to, but
> if you mean to move the "if (migration->multifd_transfer)"
> block below to the similar one above then it should be possible.

good. It is preferable.


Thanks,

C.




> 
>>> +
>>> +    if (migration->multifd_transfer) {
>>> +        qemu_cond_destroy(&migration->load_bufs_thread_finished_cond);
>>> +        vfio_state_buffers_destroy(&migration->load_bufs);
>>> +        qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
>>> +        qemu_mutex_destroy(&migration->load_bufs_mutex);
>>> +    }
>>> +
>>>       migration->load_setup = false;
>>>       trace_vfio_load_cleanup(vbasedev->name);
> (..)
> 
> 
>> Thanks,
>>
>> C.
>>
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side
  2024-11-17 19:20 ` [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
  2024-12-02 17:56   ` Cédric Le Goater
@ 2024-12-09  9:13   ` Avihai Horon
  2024-12-10 23:06     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-12-09  9:13 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

Hi Maciej,

On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> The multifd received data needs to be reassembled since device state
> packets sent via different multifd channels can arrive out-of-order.
>
> Therefore, each VFIO device state packet carries a header indicating its
> position in the stream.
>
> The last such VFIO device state packet should have
> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>
> Since it's important to finish loading device state transferred via the
> main migration channel (via save_live_iterate SaveVMHandler) before
> starting loading the data asynchronously transferred via multifd the thread
> doing the actual loading of the multifd transferred data is only started
> from switchover_start SaveVMHandler.
>
> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>
> This sub-command is only sent after all save_live_iterate data have already
> been posted so it is safe to commence loading of the multifd-transferred
> device state upon receiving it - loading of save_live_iterate data happens
> synchronously in the main migration thread (much like the processing of
> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
> processed all the proceeding data must have already been loaded.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c           | 402 ++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.c                 |   2 +
>   hw/vfio/trace-events          |   6 +
>   include/hw/vfio/vfio-common.h |  19 ++
>   4 files changed, 429 insertions(+)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 683f2ae98d5e..b54879fe6209 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -15,6 +15,7 @@
>   #include <linux/vfio.h>
>   #include <sys/ioctl.h>
>
> +#include "io/channel-buffer.h"
>   #include "sysemu/runstate.h"
>   #include "hw/vfio/vfio-common.h"
>   #include "migration/misc.h"
> @@ -55,6 +56,15 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>
> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
> +
> +typedef struct VFIODeviceStatePacket {
> +    uint32_t version;
> +    uint32_t idx;
> +    uint32_t flags;
> +    uint8_t data[0];
> +} QEMU_PACKED VFIODeviceStatePacket;
> +
>   static int64_t bytes_transferred;
>
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
> @@ -254,6 +264,292 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>       return ret;
>   }
>
> +typedef struct VFIOStateBuffer {
> +    bool is_present;
> +    char *data;
> +    size_t len;
> +} VFIOStateBuffer;
> +
> +static void vfio_state_buffer_clear(gpointer data)
> +{
> +    VFIOStateBuffer *lb = data;
> +
> +    if (!lb->is_present) {
> +        return;
> +    }
> +
> +    g_clear_pointer(&lb->data, g_free);
> +    lb->is_present = false;
> +}
> +
> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
> +{
> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
> +}
> +
> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
> +{
> +    g_clear_pointer(&bufs->array, g_array_unref);
> +}
> +
> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
> +{
> +    assert(bufs->array);
> +}
> +
> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
> +{
> +    return bufs->array->len;
> +}
> +
> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
> +{
> +    g_array_set_size(bufs->array, size);
> +}

The above three functions seem a bit too specific.

How about:
Instead of size_set and assert_init, introduce a 
vfio_state_buffers_insert() function that handles buffer insertion to 
the array from the validated packet.

Instead of size_get, introduce vfio_state_buffers_get() that handles the 
array length and is_present checks.
We can also add a vfio_state_buffer_write() function that handles 
writing the buffer to the device.

IMHO this will also make vfio_load_state_buffer() and 
vfio_load_bufs_thread(), which are rather long, clearer.

> +
> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
> +{
> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
> +}
> +
> +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
> +                                  Error **errp)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
> +    VFIOStateBuffer *lb;
> +
> +    /*
> +     * Holding BQL here would violate the lock order and can cause
> +     * a deadlock once we attempt to lock load_bufs_mutex below.
> +     */
> +    assert(!bql_locked());
> +
> +    if (!migration->multifd_transfer) {
> +        error_setg(errp,
> +                   "got device state packet but not doing multifd transfer");
> +        return -1;
> +    }
> +
> +    if (data_size < sizeof(*packet)) {
> +        error_setg(errp, "packet too short at %zu (min is %zu)",
> +                   data_size, sizeof(*packet));
> +        return -1;
> +    }
> +
> +    if (packet->version != 0) {
> +        error_setg(errp, "packet has unknown version %" PRIu32,
> +                   packet->version);
> +        return -1;
> +    }
> +
> +    if (packet->idx == UINT32_MAX) {
> +        error_setg(errp, "packet has too high idx %" PRIu32,
> +                   packet->idx);
> +        return -1;
> +    }
> +
> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
> +
> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
> +
> +    /* config state packet should be the last one in the stream */
> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
> +        migration->load_buf_idx_last = packet->idx;
> +    }
> +
> +    vfio_state_buffers_assert_init(&migration->load_bufs);
> +    if (packet->idx >= vfio_state_buffers_size_get(&migration->load_bufs)) {
> +        vfio_state_buffers_size_set(&migration->load_bufs, packet->idx + 1);
> +    }
> +
> +    lb = vfio_state_buffers_at(&migration->load_bufs, packet->idx);
> +    if (lb->is_present) {
> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
> +                   packet->idx);
> +        return -1;
> +    }
> +
> +    assert(packet->idx >= migration->load_buf_idx);
> +
> +    migration->load_buf_queued_pending_buffers++;
> +    if (migration->load_buf_queued_pending_buffers >
> +        vbasedev->migration_max_queued_buffers) {
> +        error_setg(errp,
> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
> +                   packet->idx, vbasedev->migration_max_queued_buffers);
> +        return -1;
> +    }

Copying my question from v2:

Should we count bytes instead of buffers? Current buffer size is 1MB but 
this could change, and the normal user should not care or know what is 
the buffer size.
So maybe rename to migration_max_pending_bytes or such?

And Maciej replied:

Since it's Peter that asked for this limit to be introduced in the first 
place
I would like to ask him what his preference here.
@Peter: max queued buffers or bytes?

So Peter, what's your opinion here?

> +
> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
> +    lb->len = data_size - sizeof(*packet);
> +    lb->is_present = true;
> +
> +    qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
> +
> +    return 0;
> +}
> +
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
> +
> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIOStateBuffer *lb;
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    QEMUFile *f_out = NULL, *f_in = NULL;
> +    uint64_t mig_header;
> +    int ret;
> +
> +    assert(migration->load_buf_idx == migration->load_buf_idx_last);
> +    lb = vfio_state_buffers_at(&migration->load_bufs, migration->load_buf_idx);
> +    assert(lb->is_present);
> +
> +    bioc = qio_channel_buffer_new(lb->len);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
> +
> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
> +
> +    ret = qemu_fflush(f_out);
> +    if (ret) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        return ret;
> +    }
> +
> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
> +
> +    mig_header = qemu_get_be64(f_in);
> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
> +        g_clear_pointer(&f_out, qemu_fclose);
> +        g_clear_pointer(&f_in, qemu_fclose);
> +        return -EINVAL;
> +    }
> +
> +    bql_lock();
> +    ret = vfio_load_device_config_state(f_in, vbasedev);
> +    bql_unlock();
> +
> +    g_clear_pointer(&f_out, qemu_fclose);
> +    g_clear_pointer(&f_in, qemu_fclose);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static bool vfio_load_bufs_thread_want_abort(VFIODevice *vbasedev,
> +                                             bool *abort_flag)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->load_bufs_thread_want_exit || qatomic_read(abort_flag);
> +}
> +
> +static int vfio_load_bufs_thread(bool *abort_flag, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);

Move QEMU_LOCK_GUARD() below the local var declaration?
I usually don't expect to see mutex lockings as part of local var 
declaration block, which makes it easy to miss when reading the code.
(Although QEMU_LOCK_GUARD declares a local variable under the hood, it's 
implicit and not visible to the user).

> +    int ret;
> +
> +    assert(migration->load_bufs_thread_running);
> +
> +    while (!vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
> +        VFIOStateBuffer *lb;
> +        guint bufs_len;
> +        bool starved;
> +
> +        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
> +
> +        bufs_len = vfio_state_buffers_size_get(&migration->load_bufs);
> +        if (migration->load_buf_idx >= bufs_len) {
> +            assert(migration->load_buf_idx == bufs_len);
> +            starved = true;
> +        } else {
> +            lb = vfio_state_buffers_at(&migration->load_bufs,
> +                                       migration->load_buf_idx);
> +            starved = !lb->is_present;
> +        }
> +
> +        if (starved) {
> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
> +                                                        migration->load_buf_idx);
> +            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond,
> +                           &migration->load_bufs_mutex);
> +            continue;
> +        }
> +
> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
> +            break;
> +        }
> +
> +        if (migration->load_buf_idx == 0) {
> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
> +        }
> +
> +        if (lb->len) {
> +            g_autofree char *buf = NULL;
> +            size_t buf_len;
> +            ssize_t wr_ret;
> +            int errno_save;
> +
> +            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
> +                                                           migration->load_buf_idx);
> +
> +            /* lb might become re-allocated when we drop the lock */
> +            buf = g_steal_pointer(&lb->data);
> +            buf_len = lb->len;
> +
> +            /*
> +             * Loading data to the device takes a while,
> +             * drop the lock during this process.
> +             */
> +            qemu_mutex_unlock(&migration->load_bufs_mutex);
> +            wr_ret = write(migration->data_fd, buf, buf_len);
> +            errno_save = errno;
> +            qemu_mutex_lock(&migration->load_bufs_mutex);
> +
> +            if (wr_ret < 0) {
> +                ret = -errno_save;
> +                goto ret_signal;
> +            } else if (wr_ret < buf_len) {
> +                ret = -EINVAL;
> +                goto ret_signal;
> +            }

Should we loop the write until reaching buf_len bytes?
Partial write is not considered error according to write(2) manpage.

Thanks.

> +
> +            trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
> +                                                         migration->load_buf_idx);
> +        }
> +
> +        assert(migration->load_buf_queued_pending_buffers > 0);
> +        migration->load_buf_queued_pending_buffers--;
> +
> +        if (migration->load_buf_idx == migration->load_buf_idx_last - 1) {
> +            trace_vfio_load_state_device_buffer_end(vbasedev->name);
> +        }
> +
> +        migration->load_buf_idx++;
> +    }
> +
> +    if (vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
> +        ret = -ECANCELED;
> +        goto ret_signal;
> +    }
> +
> +    ret = vfio_load_bufs_thread_load_config(vbasedev);
> +
> +ret_signal:
> +    migration->load_bufs_thread_running = false;
> +    qemu_cond_signal(&migration->load_bufs_thread_finished_cond);
> +
> +    return ret;
> +}
> +
>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque,
>                                            Error **errp)
>   {
> @@ -430,6 +726,12 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
>       return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>   }
>
> +static bool vfio_multifd_transfer_supported(void)
> +{
> +    return migration_has_device_state_support() &&
> +        migrate_send_switchover_start();
> +}
> +
>   /* ---------------------------------------------------------------------- */
>
>   static int vfio_save_prepare(void *opaque, Error **errp)
> @@ -695,17 +997,73 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
>
>       assert(!migration->load_setup);
>
> +    /*
> +     * Make a copy of this setting at the start in case it is changed
> +     * mid-migration.
> +     */
> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> +    } else {
> +        migration->multifd_transfer =
> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> +    }
> +
> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return -EINVAL;
> +    }
> +
>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>                                      migration->device_state, errp);
>       if (ret) {
>           return ret;
>       }
>
> +    if (migration->multifd_transfer) {
> +        assert(!migration->load_bufs.array);
> +        vfio_state_buffers_init(&migration->load_bufs);
> +
> +        qemu_mutex_init(&migration->load_bufs_mutex);
> +
> +        migration->load_buf_idx = 0;
> +        migration->load_buf_idx_last = UINT32_MAX;
> +        migration->load_buf_queued_pending_buffers = 0;
> +        qemu_cond_init(&migration->load_bufs_buffer_ready_cond);
> +
> +        migration->load_bufs_thread_running = false;
> +        migration->load_bufs_thread_want_exit = false;
> +        qemu_cond_init(&migration->load_bufs_thread_finished_cond);
> +    }
> +
>       migration->load_setup = true;
>
>       return 0;
>   }
>
> +static void vfio_load_cleanup_load_bufs_thread(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
> +        if (!migration->load_bufs_thread_running) {
> +            break;
> +        }
> +
> +        migration->load_bufs_thread_want_exit = true;
> +
> +        qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
> +        qemu_cond_wait(&migration->load_bufs_thread_finished_cond,
> +                       &migration->load_bufs_mutex);
> +
> +        assert(!migration->load_bufs_thread_running);
> +    }
> +    bql_lock();
> +}
> +
>   static int vfio_load_cleanup(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> @@ -715,7 +1073,19 @@ static int vfio_load_cleanup(void *opaque)
>           return 0;
>       }
>
> +    if (migration->multifd_transfer) {
> +        vfio_load_cleanup_load_bufs_thread(vbasedev);
> +    }
> +
>       vfio_migration_cleanup(vbasedev);
> +
> +    if (migration->multifd_transfer) {
> +        qemu_cond_destroy(&migration->load_bufs_thread_finished_cond);
> +        vfio_state_buffers_destroy(&migration->load_bufs);
> +        qemu_cond_destroy(&migration->load_bufs_buffer_ready_cond);
> +        qemu_mutex_destroy(&migration->load_bufs_mutex);
> +    }
> +
>       migration->load_setup = false;
>       trace_vfio_load_cleanup(vbasedev->name);
>
> @@ -725,6 +1095,7 @@ static int vfio_load_cleanup(void *opaque)
>   static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       int ret = 0;
>       uint64_t data;
>
> @@ -736,6 +1107,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            if (migration->multifd_transfer) {
> +                error_report("%s: got DEV_CONFIG_STATE but doing multifd transfer",
> +                             vbasedev->name);
> +                return -EINVAL;
> +            }
> +
>               return vfio_load_device_config_state(f, opaque);
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> @@ -801,6 +1178,29 @@ static bool vfio_switchover_ack_needed(void *opaque)
>       return vfio_precopy_supported(vbasedev);
>   }
>
> +static int vfio_switchover_start(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration->multifd_transfer) {
> +        /* Load thread is only used for multifd transfer */
> +        return 0;
> +    }
> +
> +    /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> +    bql_unlock();
> +    WITH_QEMU_LOCK_GUARD(&migration->load_bufs_mutex) {
> +        assert(!migration->load_bufs_thread_running);
> +        migration->load_bufs_thread_running = true;
> +    }
> +    bql_lock();
> +
> +    qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
> +
> +    return 0;
> +}
> +
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
> @@ -814,7 +1214,9 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .load_state_buffer = vfio_load_state_buffer,
>       .switchover_ack_needed = vfio_switchover_ack_needed,
> +    .switchover_start = vfio_switchover_start,
>   };
>
>   /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 9d547cb5cdff..72d62ada8a39 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3384,6 +3384,8 @@ static Property vfio_pci_dev_properties[] = {
>                   vbasedev.migration_multifd_transfer,
>                   qdev_prop_on_off_auto_mutable, OnOffAuto,
>                   .set_default = true, .defval.i = ON_OFF_AUTO_AUTO),
> +    DEFINE_PROP_UINT64("x-migration-max-queued-buffers", VFIOPCIDevice,
> +                       vbasedev.migration_max_queued_buffers, UINT64_MAX),
>       DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                        vbasedev.migration_events, false),
>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1bebe9877d88..418b378ebd29 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -153,6 +153,12 @@ vfio_load_device_config_state_start(const char *name) " (%s)"
>   vfio_load_device_config_state_end(const char *name) " (%s)"
>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>   vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size %"PRIu64" ret %d"
> +vfio_load_state_device_buffer_incoming(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_start(const char *name) " (%s)"
> +vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
> +vfio_load_state_device_buffer_end(const char *name) " (%s)"
>   vfio_migration_realize(const char *name) " (%s)"
>   vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
>   vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index b1c03a82eec8..0954d6981a22 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -61,6 +61,11 @@ typedef struct VFIORegion {
>       uint8_t nr; /* cache the region number for debug */
>   } VFIORegion;
>
> +/* type safety */
> +typedef struct VFIOStateBuffers {
> +    GArray *array;
> +} VFIOStateBuffers;
> +
>   typedef struct VFIOMigration {
>       struct VFIODevice *vbasedev;
>       VMChangeStateEntry *vm_state;
> @@ -73,10 +78,23 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    bool multifd_transfer;
>       bool initial_data_sent;
>
>       bool event_save_iterate_started;
>       bool event_precopy_empty_hit;
> +
> +    QemuThread load_bufs_thread;
> +    bool load_bufs_thread_running;
> +    bool load_bufs_thread_want_exit;
> +
> +    VFIOStateBuffers load_bufs;
> +    QemuCond load_bufs_buffer_ready_cond;
> +    QemuCond load_bufs_thread_finished_cond;
> +    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
> +    uint32_t load_buf_idx;
> +    uint32_t load_buf_idx_last;
> +    uint32_t load_buf_queued_pending_buffers;
>   } VFIOMigration;
>
>   struct VFIOGroup;
> @@ -136,6 +154,7 @@ typedef struct VFIODevice {
>       OnOffAuto enable_migration;
>       OnOffAuto migration_multifd_transfer;
>       bool migration_events;
> +    uint64_t migration_max_queued_buffers;
>       VFIODeviceOps *ops;
>       unsigned int num_irqs;
>       unsigned int num_regions;


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side
  2024-12-09  9:13   ` Avihai Horon
@ 2024-12-10 23:06     ` Maciej S. Szmigiero
  2024-12-12 14:33       ` Avihai Horon
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:06 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

Hi Avihai,

On 9.12.2024 10:13, Avihai Horon wrote:
> Hi Maciej,
> 
> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The multifd received data needs to be reassembled since device state
>> packets sent via different multifd channels can arrive out-of-order.
>>
>> Therefore, each VFIO device state packet carries a header indicating its
>> position in the stream.
>>
>> The last such VFIO device state packet should have
>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.
>>
>> Since it's important to finish loading device state transferred via the
>> main migration channel (via save_live_iterate SaveVMHandler) before
>> starting loading the data asynchronously transferred via multifd the thread
>> doing the actual loading of the multifd transferred data is only started
>> from switchover_start SaveVMHandler.
>>
>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>> sub-command of QEMU_VM_COMMAND is received via the main migration channel.
>>
>> This sub-command is only sent after all save_live_iterate data have already
>> been posted so it is safe to commence loading of the multifd-transferred
>> device state upon receiving it - loading of save_live_iterate data happens
>> synchronously in the main migration thread (much like the processing of
>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>> processed all the proceeding data must have already been loaded.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c           | 402 ++++++++++++++++++++++++++++++++++
>>   hw/vfio/pci.c                 |   2 +
>>   hw/vfio/trace-events          |   6 +
>>   include/hw/vfio/vfio-common.h |  19 ++
>>   4 files changed, 429 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 683f2ae98d5e..b54879fe6209 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/vfio.h>
>>   #include <sys/ioctl.h>
>>
>> +#include "io/channel-buffer.h"
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "migration/misc.h"
>> @@ -55,6 +56,15 @@
>>    */
>>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>>
>> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>> +
>> +typedef struct VFIODeviceStatePacket {
>> +    uint32_t version;
>> +    uint32_t idx;
>> +    uint32_t flags;
>> +    uint8_t data[0];
>> +} QEMU_PACKED VFIODeviceStatePacket;
>> +
>>   static int64_t bytes_transferred;
>>
>>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>> @@ -254,6 +264,292 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>       return ret;
>>   }
>>
>> +typedef struct VFIOStateBuffer {
>> +    bool is_present;
>> +    char *data;
>> +    size_t len;
>> +} VFIOStateBuffer;
>> +
>> +static void vfio_state_buffer_clear(gpointer data)
>> +{
>> +    VFIOStateBuffer *lb = data;
>> +
>> +    if (!lb->is_present) {
>> +        return;
>> +    }
>> +
>> +    g_clear_pointer(&lb->data, g_free);
>> +    lb->is_present = false;
>> +}
>> +
>> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
>> +{
>> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
>> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
>> +}
>> +
>> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
>> +{
>> +    g_clear_pointer(&bufs->array, g_array_unref);
>> +}
>> +
>> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
>> +{
>> +    assert(bufs->array);
>> +}
>> +
>> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
>> +{
>> +    return bufs->array->len;
>> +}
>> +
>> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, guint size)
>> +{
>> +    g_array_set_size(bufs->array, size);
>> +}
> 
> The above three functions seem a bit too specific.

You asked to have "full API for this [VFIOStateBuffers - MSS],
that wraps the g_array_* calls and holds the extra members"
during the review of the previous version of this patch set so here it is.

> 
> How about:
> Instead of size_set and assert_init, introduce a vfio_state_buffers_insert() function that handles buffer insertion to the array from the validated packet.
> 
> Instead of size_get, introduce vfio_state_buffers_get() that handles the array length and is_present checks.
> We can also add a vfio_state_buffer_write() function that handles writing the buffer to the device.
> 
> IMHO this will also make vfio_load_state_buffer() and vfio_load_bufs_thread(), which are rather long, clearer.

I think it would be even nicer to keep vfio_state_buffer_*() methods as thin wrappers
(low level API) and introduce intermediate API doing more or less what you have
described above to simplify vfio_load_bufs_thread() (and possibly vfio_load_state_buffer() too).

>> +
>> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
>> +{
>> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>> +}
>> +
>> +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
>> +                                  Error **errp)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>> +    VFIOStateBuffer *lb;
>> +
>> +    /*
>> +     * Holding BQL here would violate the lock order and can cause
>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>> +     */
>> +    assert(!bql_locked());
>> +
>> +    if (!migration->multifd_transfer) {
>> +        error_setg(errp,
>> +                   "got device state packet but not doing multifd transfer");
>> +        return -1;
>> +    }
>> +
>> +    if (data_size < sizeof(*packet)) {
>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>> +                   data_size, sizeof(*packet));
>> +        return -1;
>> +    }
>> +
>> +    if (packet->version != 0) {
>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>> +                   packet->version);
>> +        return -1;
>> +    }
>> +
>> +    if (packet->idx == UINT32_MAX) {
>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>> +                   packet->idx);
>> +        return -1;
>> +    }
>> +
>> +    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
>> +
>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>> +
>> +    /* config state packet should be the last one in the stream */
>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>> +        migration->load_buf_idx_last = packet->idx;
>> +    }
>> +
>> +    vfio_state_buffers_assert_init(&migration->load_bufs);
>> +    if (packet->idx >= vfio_state_buffers_size_get(&migration->load_bufs)) {
>> +        vfio_state_buffers_size_set(&migration->load_bufs, packet->idx + 1);
>> +    }
>> +
>> +    lb = vfio_state_buffers_at(&migration->load_bufs, packet->idx);
>> +    if (lb->is_present) {
>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>> +                   packet->idx);
>> +        return -1;
>> +    }
>> +
>> +    assert(packet->idx >= migration->load_buf_idx);
>> +
>> +    migration->load_buf_queued_pending_buffers++;
>> +    if (migration->load_buf_queued_pending_buffers >
>> +        vbasedev->migration_max_queued_buffers) {
>> +        error_setg(errp,
>> +                   "queuing state buffer %" PRIu32 " would exceed the max of %" PRIu64,
>> +                   packet->idx, vbasedev->migration_max_queued_buffers);
>> +        return -1;
>> +    }
> 
> Copying my question from v2:
> 
> Should we count bytes instead of buffers? Current buffer size is 1MB but this could change, and the normal user should not care or know what is the buffer size.
> So maybe rename to migration_max_pending_bytes or such?
> 
> And Maciej replied:
> 
> Since it's Peter that asked for this limit to be introduced in the first place
> I would like to ask him what his preference here.
> @Peter: max queued buffers or bytes?
> 
> So Peter, what's your opinion here?
> 
>> +
>> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>> +    lb->len = data_size - sizeof(*packet);
>> +    lb->is_present = true;
>> +
>> +    qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>> +
>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIOStateBuffer *lb;
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    QEMUFile *f_out = NULL, *f_in = NULL;
>> +    uint64_t mig_header;
>> +    int ret;
>> +
>> +    assert(migration->load_buf_idx == migration->load_buf_idx_last);
>> +    lb = vfio_state_buffers_at(&migration->load_bufs, migration->load_buf_idx);
>> +    assert(lb->is_present);
>> +
>> +    bioc = qio_channel_buffer_new(lb->len);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-load");
>> +
>> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
>> +
>> +    ret = qemu_fflush(f_out);
>> +    if (ret) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        return ret;
>> +    }
>> +
>> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
>> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
>> +
>> +    mig_header = qemu_get_be64(f_in);
>> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
>> +        g_clear_pointer(&f_out, qemu_fclose);
>> +        g_clear_pointer(&f_in, qemu_fclose);
>> +        return -EINVAL;
>> +    }
>> +
>> +    bql_lock();
>> +    ret = vfio_load_device_config_state(f_in, vbasedev);
>> +    bql_unlock();
>> +
>> +    g_clear_pointer(&f_out, qemu_fclose);
>> +    g_clear_pointer(&f_in, qemu_fclose);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static bool vfio_load_bufs_thread_want_abort(VFIODevice *vbasedev,
>> +                                             bool *abort_flag)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->load_bufs_thread_want_exit || qatomic_read(abort_flag);
>> +}
>> +
>> +static int vfio_load_bufs_thread(bool *abort_flag, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
> 
> Move QEMU_LOCK_GUARD() below the local var declaration?
> I usually don't expect to see mutex lockings as part of local var declaration block, which makes it easy to miss when reading the code.
> (Although QEMU_LOCK_GUARD declares a local variable under the hood, it's implicit and not visible to the user).

I guess you mean moving it..

>> +    int ret;

^ ..here.

Will do.

>> +    assert(migration->load_bufs_thread_running);
>> +
>> +    while (!vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
>> +        VFIOStateBuffer *lb;
>> +        guint bufs_len;
>> +        bool starved;
>> +
>> +        assert(migration->load_buf_idx <= migration->load_buf_idx_last);
>> +
>> +        bufs_len = vfio_state_buffers_size_get(&migration->load_bufs);
>> +        if (migration->load_buf_idx >= bufs_len) {
>> +            assert(migration->load_buf_idx == bufs_len);
>> +            starved = true;
>> +        } else {
>> +            lb = vfio_state_buffers_at(&migration->load_bufs,
>> +                                       migration->load_buf_idx);
>> +            starved = !lb->is_present;
>> +        }
>> +
>> +        if (starved) {
>> +            trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>> +                                                        migration->load_buf_idx);
>> +            qemu_cond_wait(&migration->load_bufs_buffer_ready_cond,
>> +                           &migration->load_bufs_mutex);
>> +            continue;
>> +        }
>> +
>> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
>> +            break;
>> +        }
>> +
>> +        if (migration->load_buf_idx == 0) {
>> +            trace_vfio_load_state_device_buffer_start(vbasedev->name);
>> +        }
>> +
>> +        if (lb->len) {
>> +            g_autofree char *buf = NULL;
>> +            size_t buf_len;
>> +            ssize_t wr_ret;
>> +            int errno_save;
>> +
>> +            trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>> +                                                           migration->load_buf_idx);
>> +
>> +            /* lb might become re-allocated when we drop the lock */
>> +            buf = g_steal_pointer(&lb->data);
>> +            buf_len = lb->len;
>> +
>> +            /*
>> +             * Loading data to the device takes a while,
>> +             * drop the lock during this process.
>> +             */
>> +            qemu_mutex_unlock(&migration->load_bufs_mutex);
>> +            wr_ret = write(migration->data_fd, buf, buf_len);
>> +            errno_save = errno;
>> +            qemu_mutex_lock(&migration->load_bufs_mutex);
>> +
>> +            if (wr_ret < 0) {
>> +                ret = -errno_save;
>> +                goto ret_signal;
>> +            } else if (wr_ret < buf_len) {
>> +                ret = -EINVAL;
>> +                goto ret_signal;
>> +            }
> 
> Should we loop the write until reaching buf_len bytes?
> Partial write is not considered error according to write(2) manpage.

Yes, it's probably better to allow partial writes in case
some VFIO kernel driver actually makes use of them.

> 
> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side
  2024-12-10 23:06     ` Maciej S. Szmigiero
@ 2024-12-12 14:33       ` Avihai Horon
  0 siblings, 0 replies; 140+ messages in thread
From: Avihai Horon @ 2024-12-12 14:33 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 11/12/2024 1:06, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Avihai,
>
> On 9.12.2024 10:13, Avihai Horon wrote:
>> Hi Maciej,
>>
>> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> The multifd received data needs to be reassembled since device state
>>> packets sent via different multifd channels can arrive out-of-order.
>>>
>>> Therefore, each VFIO device state packet carries a header indicating 
>>> its
>>> position in the stream.
>>>
>>> The last such VFIO device state packet should have
>>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config 
>>> state.
>>>
>>> Since it's important to finish loading device state transferred via the
>>> main migration channel (via save_live_iterate SaveVMHandler) before
>>> starting loading the data asynchronously transferred via multifd the 
>>> thread
>>> doing the actual loading of the multifd transferred data is only 
>>> started
>>> from switchover_start SaveVMHandler.
>>>
>>> switchover_start handler is called when MIG_CMD_SWITCHOVER_START
>>> sub-command of QEMU_VM_COMMAND is received via the main migration 
>>> channel.
>>>
>>> This sub-command is only sent after all save_live_iterate data have 
>>> already
>>> been posted so it is safe to commence loading of the 
>>> multifd-transferred
>>> device state upon receiving it - loading of save_live_iterate data 
>>> happens
>>> synchronously in the main migration thread (much like the processing of
>>> MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
>>> processed all the proceeding data must have already been loaded.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c           | 402 
>>> ++++++++++++++++++++++++++++++++++
>>>   hw/vfio/pci.c                 |   2 +
>>>   hw/vfio/trace-events          |   6 +
>>>   include/hw/vfio/vfio-common.h |  19 ++
>>>   4 files changed, 429 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 683f2ae98d5e..b54879fe6209 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -15,6 +15,7 @@
>>>   #include <linux/vfio.h>
>>>   #include <sys/ioctl.h>
>>>
>>> +#include "io/channel-buffer.h"
>>>   #include "sysemu/runstate.h"
>>>   #include "hw/vfio/vfio-common.h"
>>>   #include "migration/misc.h"
>>> @@ -55,6 +56,15 @@
>>>    */
>>>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>>>
>>> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
>>> +
>>> +typedef struct VFIODeviceStatePacket {
>>> +    uint32_t version;
>>> +    uint32_t idx;
>>> +    uint32_t flags;
>>> +    uint8_t data[0];
>>> +} QEMU_PACKED VFIODeviceStatePacket;
>>> +
>>>   static int64_t bytes_transferred;
>>>
>>>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
>>> @@ -254,6 +264,292 @@ static int vfio_load_buffer(QEMUFile *f, 
>>> VFIODevice *vbasedev,
>>>       return ret;
>>>   }
>>>
>>> +typedef struct VFIOStateBuffer {
>>> +    bool is_present;
>>> +    char *data;
>>> +    size_t len;
>>> +} VFIOStateBuffer;
>>> +
>>> +static void vfio_state_buffer_clear(gpointer data)
>>> +{
>>> +    VFIOStateBuffer *lb = data;
>>> +
>>> +    if (!lb->is_present) {
>>> +        return;
>>> +    }
>>> +
>>> +    g_clear_pointer(&lb->data, g_free);
>>> +    lb->is_present = false;
>>> +}
>>> +
>>> +static void vfio_state_buffers_init(VFIOStateBuffers *bufs)
>>> +{
>>> +    bufs->array = g_array_new(FALSE, TRUE, sizeof(VFIOStateBuffer));
>>> +    g_array_set_clear_func(bufs->array, vfio_state_buffer_clear);
>>> +}
>>> +
>>> +static void vfio_state_buffers_destroy(VFIOStateBuffers *bufs)
>>> +{
>>> +    g_clear_pointer(&bufs->array, g_array_unref);
>>> +}
>>> +
>>> +static void vfio_state_buffers_assert_init(VFIOStateBuffers *bufs)
>>> +{
>>> +    assert(bufs->array);
>>> +}
>>> +
>>> +static guint vfio_state_buffers_size_get(VFIOStateBuffers *bufs)
>>> +{
>>> +    return bufs->array->len;
>>> +}
>>> +
>>> +static void vfio_state_buffers_size_set(VFIOStateBuffers *bufs, 
>>> guint size)
>>> +{
>>> +    g_array_set_size(bufs->array, size);
>>> +}
>>
>> The above three functions seem a bit too specific.
>
> You asked to have "full API for this [VFIOStateBuffers - MSS],
> that wraps the g_array_* calls and holds the extra members"
> during the review of the previous version of this patch set so here it 
> is.
>
>>
>> How about:
>> Instead of size_set and assert_init, introduce a 
>> vfio_state_buffers_insert() function that handles buffer insertion to 
>> the array from the validated packet.
>>
>> Instead of size_get, introduce vfio_state_buffers_get() that handles 
>> the array length and is_present checks.
>> We can also add a vfio_state_buffer_write() function that handles 
>> writing the buffer to the device.
>>
>> IMHO this will also make vfio_load_state_buffer() and 
>> vfio_load_bufs_thread(), which are rather long, clearer.
>
> I think it would be even nicer to keep vfio_state_buffer_*() methods 
> as thin wrappers
> (low level API) and introduce intermediate API doing more or less what 
> you have
> described above to simplify vfio_load_bufs_thread() (and possibly 
> vfio_load_state_buffer() too).

Yes, enriching the APIs sounds good.

Thanks.

>
>>> +
>>> +static VFIOStateBuffer *vfio_state_buffers_at(VFIOStateBuffers 
>>> *bufs, guint idx)
>>> +{
>>> +    return &g_array_index(bufs->array, VFIOStateBuffer, idx);
>>> +}
>>> +
>>> +static int vfio_load_state_buffer(void *opaque, char *data, size_t 
>>> data_size,
>>> +                                  Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
>>> +    VFIOStateBuffer *lb;
>>> +
>>> +    /*
>>> +     * Holding BQL here would violate the lock order and can cause
>>> +     * a deadlock once we attempt to lock load_bufs_mutex below.
>>> +     */
>>> +    assert(!bql_locked());
>>> +
>>> +    if (!migration->multifd_transfer) {
>>> +        error_setg(errp,
>>> +                   "got device state packet but not doing multifd 
>>> transfer");
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (data_size < sizeof(*packet)) {
>>> +        error_setg(errp, "packet too short at %zu (min is %zu)",
>>> +                   data_size, sizeof(*packet));
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (packet->version != 0) {
>>> +        error_setg(errp, "packet has unknown version %" PRIu32,
>>> +                   packet->version);
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (packet->idx == UINT32_MAX) {
>>> +        error_setg(errp, "packet has too high idx %" PRIu32,
>>> +                   packet->idx);
>>> +        return -1;
>>> +    }
>>> +
>>> + trace_vfio_load_state_device_buffer_incoming(vbasedev->name, 
>>> packet->idx);
>>> +
>>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>>> +
>>> +    /* config state packet should be the last one in the stream */
>>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
>>> +        migration->load_buf_idx_last = packet->idx;
>>> +    }
>>> +
>>> + vfio_state_buffers_assert_init(&migration->load_bufs);
>>> +    if (packet->idx >= 
>>> vfio_state_buffers_size_get(&migration->load_bufs)) {
>>> + vfio_state_buffers_size_set(&migration->load_bufs, packet->idx + 1);
>>> +    }
>>> +
>>> +    lb = vfio_state_buffers_at(&migration->load_bufs, packet->idx);
>>> +    if (lb->is_present) {
>>> +        error_setg(errp, "state buffer %" PRIu32 " already filled",
>>> +                   packet->idx);
>>> +        return -1;
>>> +    }
>>> +
>>> +    assert(packet->idx >= migration->load_buf_idx);
>>> +
>>> +    migration->load_buf_queued_pending_buffers++;
>>> +    if (migration->load_buf_queued_pending_buffers >
>>> +        vbasedev->migration_max_queued_buffers) {
>>> +        error_setg(errp,
>>> +                   "queuing state buffer %" PRIu32 " would exceed 
>>> the max of %" PRIu64,
>>> +                   packet->idx, 
>>> vbasedev->migration_max_queued_buffers);
>>> +        return -1;
>>> +    }
>>
>> Copying my question from v2:
>>
>> Should we count bytes instead of buffers? Current buffer size is 1MB 
>> but this could change, and the normal user should not care or know 
>> what is the buffer size.
>> So maybe rename to migration_max_pending_bytes or such?
>>
>> And Maciej replied:
>>
>> Since it's Peter that asked for this limit to be introduced in the 
>> first place
>> I would like to ask him what his preference here.
>> @Peter: max queued buffers or bytes?
>>
>> So Peter, what's your opinion here?
>>
>>> +
>>> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
>>> +    lb->len = data_size - sizeof(*packet);
>>> +    lb->is_present = true;
>>> +
>>> + qemu_cond_signal(&migration->load_bufs_buffer_ready_cond);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque);
>>> +
>>> +static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIOStateBuffer *lb;
>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>> +    QEMUFile *f_out = NULL, *f_in = NULL;
>>> +    uint64_t mig_header;
>>> +    int ret;
>>> +
>>> +    assert(migration->load_buf_idx == migration->load_buf_idx_last);
>>> +    lb = vfio_state_buffers_at(&migration->load_bufs, 
>>> migration->load_buf_idx);
>>> +    assert(lb->is_present);
>>> +
>>> +    bioc = qio_channel_buffer_new(lb->len);
>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), 
>>> "vfio-device-config-load");
>>> +
>>> +    f_out = qemu_file_new_output(QIO_CHANNEL(bioc));
>>> +    qemu_put_buffer(f_out, (uint8_t *)lb->data, lb->len);
>>> +
>>> +    ret = qemu_fflush(f_out);
>>> +    if (ret) {
>>> +        g_clear_pointer(&f_out, qemu_fclose);
>>> +        return ret;
>>> +    }
>>> +
>>> +    qio_channel_io_seek(QIO_CHANNEL(bioc), 0, 0, NULL);
>>> +    f_in = qemu_file_new_input(QIO_CHANNEL(bioc));
>>> +
>>> +    mig_header = qemu_get_be64(f_in);
>>> +    if (mig_header != VFIO_MIG_FLAG_DEV_CONFIG_STATE) {
>>> +        g_clear_pointer(&f_out, qemu_fclose);
>>> +        g_clear_pointer(&f_in, qemu_fclose);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    bql_lock();
>>> +    ret = vfio_load_device_config_state(f_in, vbasedev);
>>> +    bql_unlock();
>>> +
>>> +    g_clear_pointer(&f_out, qemu_fclose);
>>> +    g_clear_pointer(&f_in, qemu_fclose);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static bool vfio_load_bufs_thread_want_abort(VFIODevice *vbasedev,
>>> +                                             bool *abort_flag)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    return migration->load_bufs_thread_want_exit || 
>>> qatomic_read(abort_flag);
>>> +}
>>> +
>>> +static int vfio_load_bufs_thread(bool *abort_flag, void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex);
>>
>> Move QEMU_LOCK_GUARD() below the local var declaration?
>> I usually don't expect to see mutex lockings as part of local var 
>> declaration block, which makes it easy to miss when reading the code.
>> (Although QEMU_LOCK_GUARD declares a local variable under the hood, 
>> it's implicit and not visible to the user).
>
> I guess you mean moving it..
>
>>> +    int ret;
>
> ^ ..here.
>
> Will do.
>
>>> + assert(migration->load_bufs_thread_running);
>>> +
>>> +    while (!vfio_load_bufs_thread_want_abort(vbasedev, abort_flag)) {
>>> +        VFIOStateBuffer *lb;
>>> +        guint bufs_len;
>>> +        bool starved;
>>> +
>>> +        assert(migration->load_buf_idx <= 
>>> migration->load_buf_idx_last);
>>> +
>>> +        bufs_len = vfio_state_buffers_size_get(&migration->load_bufs);
>>> +        if (migration->load_buf_idx >= bufs_len) {
>>> +            assert(migration->load_buf_idx == bufs_len);
>>> +            starved = true;
>>> +        } else {
>>> +            lb = vfio_state_buffers_at(&migration->load_bufs,
>>> + migration->load_buf_idx);
>>> +            starved = !lb->is_present;
>>> +        }
>>> +
>>> +        if (starved) {
>>> + trace_vfio_load_state_device_buffer_starved(vbasedev->name,
>>> + migration->load_buf_idx);
>>> + qemu_cond_wait(&migration->load_bufs_buffer_ready_cond,
>>> + &migration->load_bufs_mutex);
>>> +            continue;
>>> +        }
>>> +
>>> +        if (migration->load_buf_idx == migration->load_buf_idx_last) {
>>> +            break;
>>> +        }
>>> +
>>> +        if (migration->load_buf_idx == 0) {
>>> + trace_vfio_load_state_device_buffer_start(vbasedev->name);
>>> +        }
>>> +
>>> +        if (lb->len) {
>>> +            g_autofree char *buf = NULL;
>>> +            size_t buf_len;
>>> +            ssize_t wr_ret;
>>> +            int errno_save;
>>> +
>>> + trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
>>> + migration->load_buf_idx);
>>> +
>>> +            /* lb might become re-allocated when we drop the lock */
>>> +            buf = g_steal_pointer(&lb->data);
>>> +            buf_len = lb->len;
>>> +
>>> +            /*
>>> +             * Loading data to the device takes a while,
>>> +             * drop the lock during this process.
>>> +             */
>>> + qemu_mutex_unlock(&migration->load_bufs_mutex);
>>> +            wr_ret = write(migration->data_fd, buf, buf_len);
>>> +            errno_save = errno;
>>> + qemu_mutex_lock(&migration->load_bufs_mutex);
>>> +
>>> +            if (wr_ret < 0) {
>>> +                ret = -errno_save;
>>> +                goto ret_signal;
>>> +            } else if (wr_ret < buf_len) {
>>> +                ret = -EINVAL;
>>> +                goto ret_signal;
>>> +            }
>>
>> Should we loop the write until reaching buf_len bytes?
>> Partial write is not considered error according to write(2) manpage.
>
> Yes, it's probably better to allow partial writes in case
> some VFIO kernel driver actually makes use of them.
>
>>
>> Thanks.
>
> Thanks,
> Maciej
>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (21 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-11-26 21:01   ` Fabiano Rosas
  2024-12-05 19:49   ` Peter Xu
  2024-11-17 19:20 ` [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  25 siblings, 2 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Automatic memory management helps avoid memory safety issues.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 migration/qemu-file.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 11c2120edd72..fdf21324df07 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -33,6 +33,8 @@ QEMUFile *qemu_file_new_input(QIOChannel *ioc);
 QEMUFile *qemu_file_new_output(QIOChannel *ioc);
 int qemu_fclose(QEMUFile *f);
 
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(QEMUFile, qemu_fclose)
+
 /*
  * qemu_file_transferred:
  *


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  2024-11-17 19:20 ` [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
@ 2024-11-26 21:01   ` Fabiano Rosas
  2024-12-05 19:49   ` Peter Xu
  1 sibling, 0 replies; 140+ messages in thread
From: Fabiano Rosas @ 2024-11-26 21:01 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Automatic memory management helps avoid memory safety issues.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  migration/qemu-file.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index 11c2120edd72..fdf21324df07 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -33,6 +33,8 @@ QEMUFile *qemu_file_new_input(QIOChannel *ioc);
>  QEMUFile *qemu_file_new_output(QIOChannel *ioc);
>  int qemu_fclose(QEMUFile *f);
>  
> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(QEMUFile, qemu_fclose)
> +
>  /*
>   * qemu_file_transferred:
>   *

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
  2024-11-17 19:20 ` [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
  2024-11-26 21:01   ` Fabiano Rosas
@ 2024-12-05 19:49   ` Peter Xu
  1 sibling, 0 replies; 140+ messages in thread
From: Peter Xu @ 2024-12-05 19:49 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:20:18PM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Automatic memory management helps avoid memory safety issues.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (22 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
@ 2024-11-17 19:20 ` Maciej S. Szmigiero
  2024-12-09  9:28   ` Avihai Horon
  2024-12-04 19:10 ` [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
  2024-12-05 21:27 ` Cédric Le Goater
  25 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-11-17 19:20 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Implement the multifd device state transfer via additional per-device
thread inside save_live_complete_precopy_thread handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
x-migration-multifd-transfer device property value.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   2 +
 2 files changed, 157 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index b54879fe6209..8709672ada48 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
+    /*
+     * Make a copy of this setting at the start in case it is changed
+     * mid-migration.
+     */
+    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
+        migration->multifd_transfer = vfio_multifd_transfer_supported();
+    } else {
+        migration->multifd_transfer =
+            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
+    }
+
+    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
+        error_setg(errp,
+                   "%s: Multifd device transfer requested but unsupported in the current config",
+                   vbasedev->name);
+        return -EINVAL;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
@@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     return !migration->precopy_init_size && !migration->precopy_dirty_size;
 }
 
+static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    assert(migration->multifd_transfer);
+
+    /*
+     * Emit dummy NOP data on the main migration channel since the actual
+     * device state transfer is done via multifd channels.
+     */
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+}
+
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     ssize_t data_size;
     int ret;
     Error *local_err = NULL;
 
+    if (migration->multifd_transfer) {
+        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
+        return 0;
+    }
+
     trace_vfio_save_complete_precopy_start(vbasedev->name);
 
     /* We reach here with device state STOP or STOP_COPY only */
@@ -974,12 +1011,129 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int
+vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
+                                                     char *idstr,
+                                                     uint32_t instance_id,
+                                                     uint32_t idx)
+{
+    g_autoptr(QIOChannelBuffer) bioc = NULL;
+    g_autoptr(QEMUFile) f = NULL;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    size_t packet_len;
+
+    bioc = qio_channel_buffer_new(0);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+    f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+    ret = vfio_save_device_config_state(f, vbasedev, NULL);
+    if (ret) {
+        return ret;
+    }
+
+    ret = qemu_fflush(f);
+    if (ret) {
+        return ret;
+    }
+
+    packet_len = sizeof(*packet) + bioc->usage;
+    packet = g_malloc0(packet_len);
+    packet->idx = idx;
+    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+    memcpy(&packet->data, bioc->data, bioc->usage);
+
+    if (!multifd_queue_device_state(idstr, instance_id,
+                                    (char *)packet, packet_len)) {
+        return -1;
+    }
+
+    qatomic_add(&bytes_transferred, packet_len);
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy_thread(char *idstr,
+                                             uint32_t instance_id,
+                                             bool *abort_flag,
+                                             void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+    g_autofree VFIODeviceStatePacket *packet = NULL;
+    uint32_t idx;
+
+    if (!migration->multifd_transfer) {
+        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
+        return 0;
+    }
+
+    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
+                                                  idstr, instance_id);
+
+    /* We reach here with device state STOP or STOP_COPY only */
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   VFIO_DEVICE_STATE_STOP, NULL);
+    if (ret) {
+        goto ret_finish;
+    }
+
+    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+
+    for (idx = 0; ; idx++) {
+        ssize_t data_size;
+        size_t packet_size;
+
+        if (qatomic_read(abort_flag)) {
+            ret = -ECANCELED;
+            goto ret_finish;
+        }
+
+        data_size = read(migration->data_fd, &packet->data,
+                         migration->data_buffer_size);
+        if (data_size < 0) {
+            ret = -errno;
+            goto ret_finish;
+        } else if (data_size == 0) {
+            break;
+        }
+
+        packet->idx = idx;
+        packet_size = sizeof(*packet) + data_size;
+
+        if (!multifd_queue_device_state(idstr, instance_id,
+                                        (char *)packet, packet_size)) {
+            ret = -1;
+            goto ret_finish;
+        }
+
+        qatomic_add(&bytes_transferred, packet_size);
+    }
+
+    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
+                                                               instance_id,
+                                                               idx);
+
+ret_finish:
+    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
+
+    return ret;
+}
+
 static void vfio_save_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     Error *local_err = NULL;
     int ret;
 
+    if (migration->multifd_transfer) {
+        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
+        return;
+    }
+
     ret = vfio_save_device_config_state(f, opaque, &local_err);
     if (ret) {
         error_prepend(&local_err,
@@ -1210,6 +1364,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 418b378ebd29..039979bdd98f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -168,6 +168,8 @@ vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_complete_precopy_start(const char *name) " (%s)"
+vfio_save_complete_precopy_thread_start(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
+vfio_save_complete_precopy_thread_end(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_save_iterate_start(const char *name) " (%s)"


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-11-17 19:20 ` [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2024-12-09  9:28   ` Avihai Horon
  2024-12-10 23:06     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-12-09  9:28 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel


On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> Implement the multifd device state transfer via additional per-device
> thread inside save_live_complete_precopy_thread handler.
>
> Switch between doing the data transfer in the new handler and doing it
> in the old save_state handler depending on the
> x-migration-multifd-transfer device property value.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>   hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/trace-events |   2 +
>   2 files changed, 157 insertions(+)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index b54879fe6209..8709672ada48 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>       int ret;
>
> +    /*
> +     * Make a copy of this setting at the start in case it is changed
> +     * mid-migration.
> +     */
> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> +    } else {
> +        migration->multifd_transfer =
> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> +    }
> +
> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
> +        error_setg(errp,
> +                   "%s: Multifd device transfer requested but unsupported in the current config",
> +                   vbasedev->name);
> +        return -EINVAL;
> +    }
> +
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> @@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>       return !migration->precopy_init_size && !migration->precopy_dirty_size;
>   }
>
> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    assert(migration->multifd_transfer);
> +
> +    /*
> +     * Emit dummy NOP data on the main migration channel since the actual
> +     * device state transfer is done via multifd channels.
> +     */
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +}
> +
>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       ssize_t data_size;
>       int ret;
>       Error *local_err = NULL;
>
> +    if (migration->multifd_transfer) {
> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
> +        return 0;
> +    }

I wonder whether we should add a .save_live_use_thread SaveVMHandlers 
through which a device can indicate if it wants to save its data with 
the async or sync handler.
This will allow migration layer (i.e., 
qemu_savevm_state_complete_precopy_iterable) to know which handler to 
call instead of calling both of them and letting each device implicitly 
decide.
IMHO it will make the code clearer and will allow us to drop 
vfio_save_multifd_emit_dummy_eos().

> +
>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>
>       /* We reach here with device state STOP or STOP_COPY only */
> @@ -974,12 +1011,129 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       return ret;
>   }
>
> +static int
> +vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
> +                                                     char *idstr,
> +                                                     uint32_t instance_id,
> +                                                     uint32_t idx)
> +{
> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
> +    g_autoptr(QEMUFile) f = NULL;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    size_t packet_len;
> +
> +    bioc = qio_channel_buffer_new(0);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
> +
> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
> +
> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = qemu_fflush(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    packet_len = sizeof(*packet) + bioc->usage;
> +    packet = g_malloc0(packet_len);
> +    packet->idx = idx;
> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
> +    memcpy(&packet->data, bioc->data, bioc->usage);
> +
> +    if (!multifd_queue_device_state(idstr, instance_id,
> +                                    (char *)packet, packet_len)) {
> +        return -1;
> +    }
> +
> +    qatomic_add(&bytes_transferred, packet_len);
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy_thread(char *idstr,
> +                                             uint32_t instance_id,
> +                                             bool *abort_flag,
> +                                             void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +    g_autofree VFIODeviceStatePacket *packet = NULL;
> +    uint32_t idx;
> +
> +    if (!migration->multifd_transfer) {
> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
> +        return 0;
> +    }
> +
> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
> +                                                  idstr, instance_id);
> +
> +    /* We reach here with device state STOP or STOP_COPY only */
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +                                   VFIO_DEVICE_STATE_STOP, NULL);
> +    if (ret) {
> +        goto ret_finish;
> +    }
> +
> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
> +
> +    for (idx = 0; ; idx++) {
> +        ssize_t data_size;
> +        size_t packet_size;
> +
> +        if (qatomic_read(abort_flag)) {
> +            ret = -ECANCELED;
> +            goto ret_finish;
> +        }
> +
> +        data_size = read(migration->data_fd, &packet->data,
> +                         migration->data_buffer_size);
> +        if (data_size < 0) {
> +            ret = -errno;
> +            goto ret_finish;
> +        } else if (data_size == 0) {
> +            break;
> +        }
> +
> +        packet->idx = idx;
> +        packet_size = sizeof(*packet) + data_size;
> +
> +        if (!multifd_queue_device_state(idstr, instance_id,
> +                                        (char *)packet, packet_size)) {
> +            ret = -1;
> +            goto ret_finish;
> +        }
> +
> +        qatomic_add(&bytes_transferred, packet_size);
> +    }
> +
> +    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
> +                                                               instance_id,
> +                                                               idx);

I am not sure it's safe to save the config space asyncly in the thread, 
as it might be dependent on other device's non-iterable state being 
loaded first.
See commit d329f5032e17 ("vfio: Move the saving of the config space to 
the right place in VFIO migration") which moved config space saving to 
the non-iterable state saving.

Thanks.

> +
> +ret_finish:
> +    trace_vfio_save_complete_precopy_thread_end(vbasedev->name, ret);
> +
> +    return ret;
> +}
> +
>   static void vfio_save_state(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       Error *local_err = NULL;
>       int ret;
>
> +    if (migration->multifd_transfer) {
> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
> +        return;
> +    }
> +
>       ret = vfio_save_device_config_state(f, opaque, &local_err);
>       if (ret) {
>           error_prepend(&local_err,
> @@ -1210,6 +1364,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_live_complete_precopy_thread = vfio_save_complete_precopy_thread,
>       .save_state = vfio_save_state,
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 418b378ebd29..039979bdd98f 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -168,6 +168,8 @@ vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>   vfio_save_complete_precopy_start(const char *name) " (%s)"
> +vfio_save_complete_precopy_thread_start(const char *name, const char *idstr, uint32_t instance_id) " (%s) idstr %s instance %"PRIu32
> +vfio_save_complete_precopy_thread_end(const char *name, int ret) " (%s) ret %d"
>   vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_save_iterate_start(const char *name) " (%s)"


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-09  9:28   ` Avihai Horon
@ 2024-12-10 23:06     ` Maciej S. Szmigiero
  2024-12-12 11:10       ` Cédric Le Goater
  2024-12-12 14:54       ` Avihai Horon
  0 siblings, 2 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:06 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 9.12.2024 10:28, Avihai Horon wrote:
> 
> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Implement the multifd device state transfer via additional per-device
>> thread inside save_live_complete_precopy_thread handler.
>>
>> Switch between doing the data transfer in the new handler and doing it
>> in the old save_state handler depending on the
>> x-migration-multifd-transfer device property value.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   2 +
>>   2 files changed, 157 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index b54879fe6209..8709672ada48 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>       int ret;
>>
>> +    /*
>> +     * Make a copy of this setting at the start in case it is changed
>> +     * mid-migration.
>> +     */
>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>> +    } else {
>> +        migration->multifd_transfer =
>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>> +    }
>> +
>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>> +        error_setg(errp,
>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>> +                   vbasedev->name);
>> +        return -EINVAL;
>> +    }
>> +
>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>> @@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>       return !migration->precopy_init_size && !migration->precopy_dirty_size;
>>   }
>>
>> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    assert(migration->multifd_transfer);
>> +
>> +    /*
>> +     * Emit dummy NOP data on the main migration channel since the actual
>> +     * device state transfer is done via multifd channels.
>> +     */
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +}
>> +
>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       ssize_t data_size;
>>       int ret;
>>       Error *local_err = NULL;
>>
>> +    if (migration->multifd_transfer) {
>> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
>> +        return 0;
>> +    }
> 
> I wonder whether we should add a .save_live_use_thread SaveVMHandlers through which a device can indicate if it wants to save its data with the async or sync handler.
> This will allow migration layer (i.e., qemu_savevm_state_complete_precopy_iterable) to know which handler to call instead of calling both of them and letting each device implicitly decide.
> IMHO it will make the code clearer and will allow us to drop vfio_save_multifd_emit_dummy_eos().

I think that it's not worth adding a new SaveVMHandler just for this specific
use case, considering that it's easy to handle it inside driver by emitting that
FLAG_END_OF_STATE.

Especially considering that for compatibility with other drivers that do not
define that hypothetical new SaveVMHandler not having it defined would need to
have the same effect as it always returning "false".

>> +
>>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>>
>>       /* We reach here with device state STOP or STOP_COPY only */
>> @@ -974,12 +1011,129 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>>
>> +static int
>> +vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
>> +                                                     char *idstr,
>> +                                                     uint32_t instance_id,
>> +                                                     uint32_t idx)
>> +{
>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>> +    g_autoptr(QEMUFile) f = NULL;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    size_t packet_len;
>> +
>> +    bioc = qio_channel_buffer_new(0);
>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>> +
>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>> +
>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = qemu_fflush(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    packet_len = sizeof(*packet) + bioc->usage;
>> +    packet = g_malloc0(packet_len);
>> +    packet->idx = idx;
>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>> +
>> +    if (!multifd_queue_device_state(idstr, instance_id,
>> +                                    (char *)packet, packet_len)) {
>> +        return -1;
>> +    }
>> +
>> +    qatomic_add(&bytes_transferred, packet_len);
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_complete_precopy_thread(char *idstr,
>> +                                             uint32_t instance_id,
>> +                                             bool *abort_flag,
>> +                                             void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>> +    uint32_t idx;
>> +
>> +    if (!migration->multifd_transfer) {
>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>> +        return 0;
>> +    }
>> +
>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>> +                                                  idstr, instance_id);
>> +
>> +    /* We reach here with device state STOP or STOP_COPY only */
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>> +                                   VFIO_DEVICE_STATE_STOP, NULL);
>> +    if (ret) {
>> +        goto ret_finish;
>> +    }
>> +
>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>> +
>> +    for (idx = 0; ; idx++) {
>> +        ssize_t data_size;
>> +        size_t packet_size;
>> +
>> +        if (qatomic_read(abort_flag)) {
>> +            ret = -ECANCELED;
>> +            goto ret_finish;
>> +        }
>> +
>> +        data_size = read(migration->data_fd, &packet->data,
>> +                         migration->data_buffer_size);
>> +        if (data_size < 0) {
>> +            ret = -errno;
>> +            goto ret_finish;
>> +        } else if (data_size == 0) {
>> +            break;
>> +        }
>> +
>> +        packet->idx = idx;
>> +        packet_size = sizeof(*packet) + data_size;
>> +
>> +        if (!multifd_queue_device_state(idstr, instance_id,
>> +                                        (char *)packet, packet_size)) {
>> +            ret = -1;
>> +            goto ret_finish;
>> +        }
>> +
>> +        qatomic_add(&bytes_transferred, packet_size);
>> +    }
>> +
>> +    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
>> +                                                               instance_id,
>> +                                                               idx);
> 
> I am not sure it's safe to save the config space asyncly in the thread, as it might be dependent on other device's non-iterable state being loaded first.
> See commit d329f5032e17 ("vfio: Move the saving of the config space to the right place in VFIO migration") which moved config space saving to the non-iterable state saving.

That's an important information - thanks for pointing this out.

Since we don't want to lose this config state saving parallelism
(and the future config state saving parallelism) on unaffected platform
we'll probably need to disable this functionality for ARM64.

By the way, this kind of an implicit dependency in VMState between devices
is really hard to manage, there should be a way to specify it in code somehow...
  
> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-10 23:06     ` Maciej S. Szmigiero
@ 2024-12-12 11:10       ` Cédric Le Goater
  2024-12-12 22:52         ` Maciej S. Szmigiero
  2024-12-12 14:54       ` Avihai Horon
  1 sibling, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-12 11:10 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Avihai Horon
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Joao Martins,
	qemu-devel

On 12/11/24 00:06, Maciej S. Szmigiero wrote:
> On 9.12.2024 10:28, Avihai Horon wrote:
>>
>> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Implement the multifd device state transfer via additional per-device
>>> thread inside save_live_complete_precopy_thread handler.
>>>
>>> Switch between doing the data transfer in the new handler and doing it
>>> in the old save_state handler depending on the
>>> x-migration-multifd-transfer device property value.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/trace-events |   2 +
>>>   2 files changed, 157 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index b54879fe6209..8709672ada48 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>       int ret;
>>>
>>> +    /*
>>> +     * Make a copy of this setting at the start in case it is changed
>>> +     * mid-migration.
>>> +     */
>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>> +    } else {
>>> +        migration->multifd_transfer =
>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>> +    }
>>> +
>>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>>> +        error_setg(errp,
>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>> +                   vbasedev->name);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>
>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>> @@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>       return !migration->precopy_init_size && !migration->precopy_dirty_size;
>>>   }
>>>
>>> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    assert(migration->multifd_transfer);
>>> +
>>> +    /*
>>> +     * Emit dummy NOP data on the main migration channel since the actual
>>> +     * device state transfer is done via multifd channels.
>>> +     */
>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> +}
>>> +
>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>>       ssize_t data_size;
>>>       int ret;
>>>       Error *local_err = NULL;
>>>
>>> +    if (migration->multifd_transfer) {
>>> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
>>> +        return 0;
>>> +    }
>>
>> I wonder whether we should add a .save_live_use_thread SaveVMHandlers through which a device can indicate if it wants to save its data with the async or sync handler.
>> This will allow migration layer (i.e., qemu_savevm_state_complete_precopy_iterable) to know which handler to call instead of calling both of them and letting each device implicitly decide.
>> IMHO it will make the code clearer and will allow us to drop vfio_save_multifd_emit_dummy_eos().
> 
> I think that it's not worth adding a new SaveVMHandler just for this specific
> use case, considering that it's easy to handle it inside driver by emitting that
> FLAG_END_OF_STATE.
> 
> Especially considering that for compatibility with other drivers that do not
> define that hypothetical new SaveVMHandler not having it defined would need to
> have the same effect as it always returning "false".
> 
>>> +
>>>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>>>
>>>       /* We reach here with device state STOP or STOP_COPY only */
>>> @@ -974,12 +1011,129 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>       return ret;
>>>   }
>>>
>>> +static int
>>> +vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
>>> +                                                     char *idstr,
>>> +                                                     uint32_t instance_id,
>>> +                                                     uint32_t idx)
>>> +{
>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>> +    g_autoptr(QEMUFile) f = NULL;
>>> +    int ret;
>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>> +    size_t packet_len;
>>> +
>>> +    bioc = qio_channel_buffer_new(0);
>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>>> +
>>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>>> +
>>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    ret = qemu_fflush(f);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    packet_len = sizeof(*packet) + bioc->usage;
>>> +    packet = g_malloc0(packet_len);
>>> +    packet->idx = idx;
>>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>>> +
>>> +    if (!multifd_queue_device_state(idstr, instance_id,
>>> +                                    (char *)packet, packet_len)) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    qatomic_add(&bytes_transferred, packet_len);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_save_complete_precopy_thread(char *idstr,
>>> +                                             uint32_t instance_id,
>>> +                                             bool *abort_flag,
>>> +                                             void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    int ret;
>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>> +    uint32_t idx;
>>> +
>>> +    if (!migration->multifd_transfer) {
>>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>>> +        return 0;
>>> +    }
>>> +
>>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>>> +                                                  idstr, instance_id);
>>> +
>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>> +                                   VFIO_DEVICE_STATE_STOP, NULL);
>>> +    if (ret) {
>>> +        goto ret_finish;
>>> +    }
>>> +
>>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>>> +
>>> +    for (idx = 0; ; idx++) {
>>> +        ssize_t data_size;
>>> +        size_t packet_size;
>>> +
>>> +        if (qatomic_read(abort_flag)) {
>>> +            ret = -ECANCELED;
>>> +            goto ret_finish;
>>> +        }
>>> +
>>> +        data_size = read(migration->data_fd, &packet->data,
>>> +                         migration->data_buffer_size);
>>> +        if (data_size < 0) {
>>> +            ret = -errno;
>>> +            goto ret_finish;
>>> +        } else if (data_size == 0) {
>>> +            break;
>>> +        }
>>> +
>>> +        packet->idx = idx;
>>> +        packet_size = sizeof(*packet) + data_size;
>>> +
>>> +        if (!multifd_queue_device_state(idstr, instance_id,
>>> +                                        (char *)packet, packet_size)) {
>>> +            ret = -1;
>>> +            goto ret_finish;
>>> +        }
>>> +
>>> +        qatomic_add(&bytes_transferred, packet_size);
>>> +    }
>>> +
>>> +    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
>>> +                                                               instance_id,
>>> +                                                               idx);
>>
>> I am not sure it's safe to save the config space asyncly in the thread, as it might be dependent on other device's non-iterable state being loaded first.
>> See commit d329f5032e17 ("vfio: Move the saving of the config space to the right place in VFIO migration") which moved config space saving to the non-iterable state saving.
> 
> That's an important information - thanks for pointing this out.
> 
> Since we don't want to lose this config state saving parallelism
> (and the future config state saving parallelism) on unaffected platform
> we'll probably need to disable this functionality for ARM64.
> 
> By the way, this kind of an implicit dependency in VMState between devices
> is really hard to manage, there should be a way to specify it in code somehow..

vmstate has a MigrationPriority field to order loading between
devices. Maybe we could extend but I think it is better to handle
ordering at the device level when there are no external dependencies.
It should be well documented though in the code.



Thanks,

C.


> 
>> Thanks.
> 
> Thanks,
> Maciej
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-12 11:10       ` Cédric Le Goater
@ 2024-12-12 22:52         ` Maciej S. Szmigiero
  2024-12-13 11:08           ` Cédric Le Goater
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-12 22:52 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Avihai Horon, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 12.12.2024 12:10, Cédric Le Goater wrote:
> On 12/11/24 00:06, Maciej S. Szmigiero wrote:
>> On 9.12.2024 10:28, Avihai Horon wrote:
>>>
>>> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Implement the multifd device state transfer via additional per-device
>>>> thread inside save_live_complete_precopy_thread handler.
>>>>
>>>> Switch between doing the data transfer in the new handler and doing it
>>>> in the old save_state handler depending on the
>>>> x-migration-multifd-transfer device property value.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/trace-events |   2 +
>>>>   2 files changed, 157 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index b54879fe6209..8709672ada48 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
(...)
>>>> +
>>>>       trace_vfio_save_complete_precopy_start(vbasedev->name);
>>>>
>>>>       /* We reach here with device state STOP or STOP_COPY only */
>>>> @@ -974,12 +1011,129 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>>       return ret;
>>>>   }
>>>>
>>>> +static int
>>>> +vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev,
>>>> +                                                     char *idstr,
>>>> +                                                     uint32_t instance_id,
>>>> +                                                     uint32_t idx)
>>>> +{
>>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>>> +    g_autoptr(QEMUFile) f = NULL;
>>>> +    int ret;
>>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>>> +    size_t packet_len;
>>>> +
>>>> +    bioc = qio_channel_buffer_new(0);
>>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
>>>> +
>>>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>>>> +
>>>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    ret = qemu_fflush(f);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    packet_len = sizeof(*packet) + bioc->usage;
>>>> +    packet = g_malloc0(packet_len);
>>>> +    packet->idx = idx;
>>>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>>>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>>>> +
>>>> +    if (!multifd_queue_device_state(idstr, instance_id,
>>>> +                                    (char *)packet, packet_len)) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    qatomic_add(&bytes_transferred, packet_len);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int vfio_save_complete_precopy_thread(char *idstr,
>>>> +                                             uint32_t instance_id,
>>>> +                                             bool *abort_flag,
>>>> +                                             void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    int ret;
>>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>>> +    uint32_t idx;
>>>> +
>>>> +    if (!migration->multifd_transfer) {
>>>> +        /* Nothing to do, vfio_save_complete_precopy() does the transfer. */
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>>>> +                                                  idstr, instance_id);
>>>> +
>>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>>> +                                   VFIO_DEVICE_STATE_STOP, NULL);
>>>> +    if (ret) {
>>>> +        goto ret_finish;
>>>> +    }
>>>> +
>>>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>>>> +
>>>> +    for (idx = 0; ; idx++) {
>>>> +        ssize_t data_size;
>>>> +        size_t packet_size;
>>>> +
>>>> +        if (qatomic_read(abort_flag)) {
>>>> +            ret = -ECANCELED;
>>>> +            goto ret_finish;
>>>> +        }
>>>> +
>>>> +        data_size = read(migration->data_fd, &packet->data,
>>>> +                         migration->data_buffer_size);
>>>> +        if (data_size < 0) {
>>>> +            ret = -errno;
>>>> +            goto ret_finish;
>>>> +        } else if (data_size == 0) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        packet->idx = idx;
>>>> +        packet_size = sizeof(*packet) + data_size;
>>>> +
>>>> +        if (!multifd_queue_device_state(idstr, instance_id,
>>>> +                                        (char *)packet, packet_size)) {
>>>> +            ret = -1;
>>>> +            goto ret_finish;
>>>> +        }
>>>> +
>>>> +        qatomic_add(&bytes_transferred, packet_size);
>>>> +    }
>>>> +
>>>> +    ret = vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
>>>> +                                                               instance_id,
>>>> +                                                               idx);
>>>
>>> I am not sure it's safe to save the config space asyncly in the thread, as it might be dependent on other device's non-iterable state being loaded first.
>>> See commit d329f5032e17 ("vfio: Move the saving of the config space to the right place in VFIO migration") which moved config space saving to the non-iterable state saving.
>>
>> That's an important information - thanks for pointing this out.
>>
>> Since we don't want to lose this config state saving parallelism
>> (and the future config state saving parallelism) on unaffected platform
>> we'll probably need to disable this functionality for ARM64.
>>
>> By the way, this kind of an implicit dependency in VMState between devices
>> is really hard to manage, there should be a way to specify it in code somehow..
> 
> vmstate has a MigrationPriority field to order loading between
> devices. Maybe we could extend but I think it is better to handle
> ordering at the device level when there are no external dependencies.
> It should be well documented though in the code.
> 

To be clear, by "handling ordering at the device level" you mean
just disabling this functionality for ARM64 as proposed above?

> 
> Thanks,
> 
> C.
> 
> 
>>
>>> Thanks.


Thanks,
Maciej




^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-12 22:52         ` Maciej S. Szmigiero
@ 2024-12-13 11:08           ` Cédric Le Goater
  2024-12-13 18:25             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-13 11:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Avihai Horon, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

>>> By the way, this kind of an implicit dependency in VMState between devices
>>> is really hard to manage, there should be a way to specify it in code somehow..
>>
>> vmstate has a MigrationPriority field to order loading between
>> devices. Maybe we could extend but I think it is better to handle
>> ordering at the device level when there are no external dependencies.
>> It should be well documented though in the code.
>>
> 
> To be clear, by "handling ordering at the device level" you mean
> just disabling this functionality for ARM64 as proposed above?

I meant handling the migration ordering in the device load/save
handlers without making assumptions on other devices.

Regarding ARM64, it would be unfortunate to deactivate the feature
since migration works correctly today, on AMR64 64k kernels too,
and this series should improve also downtime. Support can be added
gradually though.


Thanks,

C.
      



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-13 11:08           ` Cédric Le Goater
@ 2024-12-13 18:25             ` Maciej S. Szmigiero
  0 siblings, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-13 18:25 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Avihai Horon, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel

On 13.12.2024 12:08, Cédric Le Goater wrote:
>>>> By the way, this kind of an implicit dependency in VMState between devices
>>>> is really hard to manage, there should be a way to specify it in code somehow..
>>>
>>> vmstate has a MigrationPriority field to order loading between
>>> devices. Maybe we could extend but I think it is better to handle
>>> ordering at the device level when there are no external dependencies.
>>> It should be well documented though in the code.
>>>
>>
>> To be clear, by "handling ordering at the device level" you mean
>> just disabling this functionality for ARM64 as proposed above?
> 
> I meant handling the migration ordering in the device load/save
> handlers without making assumptions on other devices.
> 
> Regarding ARM64, it would be unfortunate to deactivate the feature
> since migration works correctly today, on AMR64 64k kernels too,
> and this series should improve also downtime. Support can be added
> gradually though.

I wasn't thinking about disabling the whole multifd migration support
in VFIO on ARM64, but just the config state transfer part.

While adding a proper migration device ordering is probably a good
future goal it's not quite fitting this patch set scope.

> Thanks,
> 
> C.
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-10 23:06     ` Maciej S. Szmigiero
  2024-12-12 11:10       ` Cédric Le Goater
@ 2024-12-12 14:54       ` Avihai Horon
  2024-12-12 22:53         ` Maciej S. Szmigiero
  1 sibling, 1 reply; 140+ messages in thread
From: Avihai Horon @ 2024-12-12 14:54 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Cédric Le Goater,
	Eric Blake, Markus Armbruster, Daniel P . Berrangé,
	Joao Martins, qemu-devel


On 11/12/2024 1:06, Maciej S. Szmigiero wrote:
> External email: Use caution opening links or attachments
>
>
> On 9.12.2024 10:28, Avihai Horon wrote:
>>
>> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> Implement the multifd device state transfer via additional per-device
>>> thread inside save_live_complete_precopy_thread handler.
>>>
>>> Switch between doing the data transfer in the new handler and doing it
>>> in the old save_state handler depending on the
>>> x-migration-multifd-transfer device property value.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>> ---
>>>   hw/vfio/migration.c  | 155 
>>> +++++++++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/trace-events |   2 +
>>>   2 files changed, 157 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index b54879fe6209..8709672ada48 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void 
>>> *opaque, Error **errp)
>>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>       int ret;
>>>
>>> +    /*
>>> +     * Make a copy of this setting at the start in case it is changed
>>> +     * mid-migration.
>>> +     */
>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>> +        migration->multifd_transfer = 
>>> vfio_multifd_transfer_supported();
>>> +    } else {
>>> +        migration->multifd_transfer =
>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>> +    }
>>> +
>>> +    if (migration->multifd_transfer && 
>>> !vfio_multifd_transfer_supported()) {
>>> +        error_setg(errp,
>>> +                   "%s: Multifd device transfer requested but 
>>> unsupported in the current config",
>>> +                   vbasedev->name);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>
>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>> @@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void 
>>> *opaque)
>>>       return !migration->precopy_init_size && 
>>> !migration->precopy_dirty_size;
>>>   }
>>>
>>> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, 
>>> QEMUFile *f)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    assert(migration->multifd_transfer);
>>> +
>>> +    /*
>>> +     * Emit dummy NOP data on the main migration channel since the 
>>> actual
>>> +     * device state transfer is done via multifd channels.
>>> +     */
>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> +}
>>> +
>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>>       ssize_t data_size;
>>>       int ret;
>>>       Error *local_err = NULL;
>>>
>>> +    if (migration->multifd_transfer) {
>>> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
>>> +        return 0;
>>> +    }
>>
>> I wonder whether we should add a .save_live_use_thread SaveVMHandlers 
>> through which a device can indicate if it wants to save its data with 
>> the async or sync handler.
>> This will allow migration layer (i.e., 
>> qemu_savevm_state_complete_precopy_iterable) to know which handler to 
>> call instead of calling both of them and letting each device 
>> implicitly decide.
>> IMHO it will make the code clearer and will allow us to drop 
>> vfio_save_multifd_emit_dummy_eos().
>
> I think that it's not worth adding a new SaveVMHandler just for this 
> specific
> use case, considering that it's easy to handle it inside driver by 
> emitting that
> FLAG_END_OF_STATE.
>
> Especially considering that for compatibility with other drivers that 
> do not
> define that hypothetical new SaveVMHandler not having it defined would 
> need to
> have the same effect as it always returning "false".

We already have such handlers like .is_active, .has_postcopy and 
.is_active_iterate.
Since VFIO migration with multifd involves a lot of threads and 
convoluted code paths, I thought this could put some order (even if 
small) into things, especially if it allows us to avoid the 
vfio_save_multifd_emit_dummy_eos() which feels a bit hackish.

But anyway, that's only my opinion, and I can understand why this could 
be seen as an overkill.

Thanks.

>
>>> +
>>> trace_vfio_save_complete_precopy_start(vbasedev->name);
>>>
>>>       /* We reach here with device state STOP or STOP_COPY only */
>>> @@ -974,12 +1011,129 @@ static int 
>>> vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>       return ret;
>>>   }
>>>
>>> +static int
>>> +vfio_save_complete_precopy_async_thread_config_state(VFIODevice 
>>> *vbasedev,
>>> +                                                     char *idstr,
>>> +                                                     uint32_t 
>>> instance_id,
>>> +                                                     uint32_t idx)
>>> +{
>>> +    g_autoptr(QIOChannelBuffer) bioc = NULL;
>>> +    g_autoptr(QEMUFile) f = NULL;
>>> +    int ret;
>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>> +    size_t packet_len;
>>> +
>>> +    bioc = qio_channel_buffer_new(0);
>>> +    qio_channel_set_name(QIO_CHANNEL(bioc), 
>>> "vfio-device-config-save");
>>> +
>>> +    f = qemu_file_new_output(QIO_CHANNEL(bioc));
>>> +
>>> +    ret = vfio_save_device_config_state(f, vbasedev, NULL);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    ret = qemu_fflush(f);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    packet_len = sizeof(*packet) + bioc->usage;
>>> +    packet = g_malloc0(packet_len);
>>> +    packet->idx = idx;
>>> +    packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
>>> +    memcpy(&packet->data, bioc->data, bioc->usage);
>>> +
>>> +    if (!multifd_queue_device_state(idstr, instance_id,
>>> +                                    (char *)packet, packet_len)) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    qatomic_add(&bytes_transferred, packet_len);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_save_complete_precopy_thread(char *idstr,
>>> +                                             uint32_t instance_id,
>>> +                                             bool *abort_flag,
>>> +                                             void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    int ret;
>>> +    g_autofree VFIODeviceStatePacket *packet = NULL;
>>> +    uint32_t idx;
>>> +
>>> +    if (!migration->multifd_transfer) {
>>> +        /* Nothing to do, vfio_save_complete_precopy() does the 
>>> transfer. */
>>> +        return 0;
>>> +    }
>>> +
>>> + trace_vfio_save_complete_precopy_thread_start(vbasedev->name,
>>> +                                                  idstr, instance_id);
>>> +
>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>> +    ret = vfio_migration_set_state(vbasedev, 
>>> VFIO_DEVICE_STATE_STOP_COPY,
>>> +                                   VFIO_DEVICE_STATE_STOP, NULL);
>>> +    if (ret) {
>>> +        goto ret_finish;
>>> +    }
>>> +
>>> +    packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
>>> +
>>> +    for (idx = 0; ; idx++) {
>>> +        ssize_t data_size;
>>> +        size_t packet_size;
>>> +
>>> +        if (qatomic_read(abort_flag)) {
>>> +            ret = -ECANCELED;
>>> +            goto ret_finish;
>>> +        }
>>> +
>>> +        data_size = read(migration->data_fd, &packet->data,
>>> +                         migration->data_buffer_size);
>>> +        if (data_size < 0) {
>>> +            ret = -errno;
>>> +            goto ret_finish;
>>> +        } else if (data_size == 0) {
>>> +            break;
>>> +        }
>>> +
>>> +        packet->idx = idx;
>>> +        packet_size = sizeof(*packet) + data_size;
>>> +
>>> +        if (!multifd_queue_device_state(idstr, instance_id,
>>> +                                        (char *)packet, 
>>> packet_size)) {
>>> +            ret = -1;
>>> +            goto ret_finish;
>>> +        }
>>> +
>>> +        qatomic_add(&bytes_transferred, packet_size);
>>> +    }
>>> +
>>> +    ret = 
>>> vfio_save_complete_precopy_async_thread_config_state(vbasedev, idstr,
>>> + instance_id,
>>> +                                                               idx);
>>
>> I am not sure it's safe to save the config space asyncly in the 
>> thread, as it might be dependent on other device's non-iterable state 
>> being loaded first.
>> See commit d329f5032e17 ("vfio: Move the saving of the config space 
>> to the right place in VFIO migration") which moved config space 
>> saving to the non-iterable state saving.
>
> That's an important information - thanks for pointing this out.
>
> Since we don't want to lose this config state saving parallelism
> (and the future config state saving parallelism) on unaffected platform
> we'll probably need to disable this functionality for ARM64.
>
> By the way, this kind of an implicit dependency in VMState between 
> devices
> is really hard to manage, there should be a way to specify it in code 
> somehow...
>
>> Thanks.
>
> Thanks,
> Maciej
>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-12 14:54       ` Avihai Horon
@ 2024-12-12 22:53         ` Maciej S. Szmigiero
  2024-12-16 17:33           ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-12 22:53 UTC (permalink / raw)
  To: Avihai Horon, Cédric Le Goater, Peter Xu
  Cc: Alex Williamson, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel

On 12.12.2024 15:54, Avihai Horon wrote:
> 
> On 11/12/2024 1:06, Maciej S. Szmigiero wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 9.12.2024 10:28, Avihai Horon wrote:
>>>
>>> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> Implement the multifd device state transfer via additional per-device
>>>> thread inside save_live_complete_precopy_thread handler.
>>>>
>>>> Switch between doing the data transfer in the new handler and doing it
>>>> in the old save_state handler depending on the
>>>> x-migration-multifd-transfer device property value.
>>>>
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>> ---
>>>>   hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/trace-events |   2 +
>>>>   2 files changed, 157 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index b54879fe6209..8709672ada48 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>>       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>>       int ret;
>>>>
>>>> +    /*
>>>> +     * Make a copy of this setting at the start in case it is changed
>>>> +     * mid-migration.
>>>> +     */
>>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>>> +    } else {
>>>> +        migration->multifd_transfer =
>>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>>> +    }
>>>> +
>>>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>>>> +        error_setg(errp,
>>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>>> +                   vbasedev->name);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>>
>>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>> @@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>>       return !migration->precopy_init_size && !migration->precopy_dirty_size;
>>>>   }
>>>>
>>>> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    assert(migration->multifd_transfer);
>>>> +
>>>> +    /*
>>>> +     * Emit dummy NOP data on the main migration channel since the actual
>>>> +     * device state transfer is done via multifd channels.
>>>> +     */
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +}
>>>> +
>>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>       ssize_t data_size;
>>>>       int ret;
>>>>       Error *local_err = NULL;
>>>>
>>>> +    if (migration->multifd_transfer) {
>>>> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
>>>> +        return 0;
>>>> +    }
>>>
>>> I wonder whether we should add a .save_live_use_thread SaveVMHandlers through which a device can indicate if it wants to save its data with the async or sync handler.
>>> This will allow migration layer (i.e., qemu_savevm_state_complete_precopy_iterable) to know which handler to call instead of calling both of them and letting each device implicitly decide.
>>> IMHO it will make the code clearer and will allow us to drop vfio_save_multifd_emit_dummy_eos().
>>
>> I think that it's not worth adding a new SaveVMHandler just for this specific
>> use case, considering that it's easy to handle it inside driver by emitting that
>> FLAG_END_OF_STATE.
>>
>> Especially considering that for compatibility with other drivers that do not
>> define that hypothetical new SaveVMHandler not having it defined would need to
>> have the same effect as it always returning "false".
> 
> We already have such handlers like .is_active, .has_postcopy and .is_active_iterate.
> Since VFIO migration with multifd involves a lot of threads and convoluted code paths, I thought this could put some order (even if small) into things, especially if it allows us to avoid the vfio_save_multifd_emit_dummy_eos() which feels a bit hackish.
> 
> But anyway, that's only my opinion, and I can understand why this could be seen as an overkill.

@Cedric, @Peter:
what's your opinion here?

Is it better to add a new "flag" SaveVMHandler or keep handling
the multifd/non-multifd transfer difference in the VFIO driver
by emitting VFIO_MIG_FLAG_END_OF_STATE in
vfio_save_complete_precopy() and vfio_save_state()?

Note that this new "flag" SaveVMHandler would need to have
semantics of disabling both save_live_complete_precopy and
save_state handlers and enabling save_live_complete_precopy_thread
instead.

> Thanks.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-12 22:53         ` Maciej S. Szmigiero
@ 2024-12-16 17:33           ` Peter Xu
  2024-12-19  9:50             ` Cédric Le Goater
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-16 17:33 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Avihai Horon, Cédric Le Goater, Alex Williamson,
	Fabiano Rosas, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Joao Martins, qemu-devel

On Thu, Dec 12, 2024 at 11:53:05PM +0100, Maciej S. Szmigiero wrote:
> On 12.12.2024 15:54, Avihai Horon wrote:
> > 
> > On 11/12/2024 1:06, Maciej S. Szmigiero wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > On 9.12.2024 10:28, Avihai Horon wrote:
> > > > 
> > > > On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
> > > > > External email: Use caution opening links or attachments
> > > > > 
> > > > > 
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > Implement the multifd device state transfer via additional per-device
> > > > > thread inside save_live_complete_precopy_thread handler.
> > > > > 
> > > > > Switch between doing the data transfer in the new handler and doing it
> > > > > in the old save_state handler depending on the
> > > > > x-migration-multifd-transfer device property value.
> > > > > 
> > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > > > > ---
> > > > >   hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
> > > > >   hw/vfio/trace-events |   2 +
> > > > >   2 files changed, 157 insertions(+)
> > > > > 
> > > > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > > > index b54879fe6209..8709672ada48 100644
> > > > > --- a/hw/vfio/migration.c
> > > > > +++ b/hw/vfio/migration.c
> > > > > @@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
> > > > >       uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
> > > > >       int ret;
> > > > > 
> > > > > +    /*
> > > > > +     * Make a copy of this setting at the start in case it is changed
> > > > > +     * mid-migration.
> > > > > +     */
> > > > > +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
> > > > > +        migration->multifd_transfer = vfio_multifd_transfer_supported();
> > > > > +    } else {
> > > > > +        migration->multifd_transfer =
> > > > > +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
> > > > > +    }
> > > > > +
> > > > > +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
> > > > > +        error_setg(errp,
> > > > > +                   "%s: Multifd device transfer requested but unsupported in the current config",
> > > > > +                   vbasedev->name);
> > > > > +        return -EINVAL;
> > > > > +    }
> > > > > +
> > > > >       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> > > > > 
> > > > >       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> > > > > @@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > > > >       return !migration->precopy_init_size && !migration->precopy_dirty_size;
> > > > >   }
> > > > > 
> > > > > +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
> > > > > +{
> > > > > +    VFIOMigration *migration = vbasedev->migration;
> > > > > +
> > > > > +    assert(migration->multifd_transfer);
> > > > > +
> > > > > +    /*
> > > > > +     * Emit dummy NOP data on the main migration channel since the actual
> > > > > +     * device state transfer is done via multifd channels.
> > > > > +     */
> > > > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > > > +}
> > > > > +
> > > > >   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > > > >   {
> > > > >       VFIODevice *vbasedev = opaque;
> > > > > +    VFIOMigration *migration = vbasedev->migration;
> > > > >       ssize_t data_size;
> > > > >       int ret;
> > > > >       Error *local_err = NULL;
> > > > > 
> > > > > +    if (migration->multifd_transfer) {
> > > > > +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
> > > > > +        return 0;
> > > > > +    }
> > > > 
> > > > I wonder whether we should add a .save_live_use_thread SaveVMHandlers through which a device can indicate if it wants to save its data with the async or sync handler.
> > > > This will allow migration layer (i.e., qemu_savevm_state_complete_precopy_iterable) to know which handler to call instead of calling both of them and letting each device implicitly decide.
> > > > IMHO it will make the code clearer and will allow us to drop vfio_save_multifd_emit_dummy_eos().
> > > 
> > > I think that it's not worth adding a new SaveVMHandler just for this specific
> > > use case, considering that it's easy to handle it inside driver by emitting that
> > > FLAG_END_OF_STATE.
> > > 
> > > Especially considering that for compatibility with other drivers that do not
> > > define that hypothetical new SaveVMHandler not having it defined would need to
> > > have the same effect as it always returning "false".
> > 
> > We already have such handlers like .is_active, .has_postcopy and .is_active_iterate.
> > Since VFIO migration with multifd involves a lot of threads and convoluted code paths, I thought this could put some order (even if small) into things, especially if it allows us to avoid the vfio_save_multifd_emit_dummy_eos() which feels a bit hackish.
> > 
> > But anyway, that's only my opinion, and I can understand why this could be seen as an overkill.
> 
> @Cedric, @Peter:
> what's your opinion here?
> 
> Is it better to add a new "flag" SaveVMHandler or keep handling
> the multifd/non-multifd transfer difference in the VFIO driver
> by emitting VFIO_MIG_FLAG_END_OF_STATE in
> vfio_save_complete_precopy() and vfio_save_state()?
> 
> Note that this new "flag" SaveVMHandler would need to have
> semantics of disabling both save_live_complete_precopy and
> save_state handlers and enabling save_live_complete_precopy_thread
> instead.

If it's about adding one more global vmstate hook (even if only used in
vfio), only to conditionally disable two other random vmstate hooks, then
it isn't very attractive idea to me indeed.

PS: when I look at is_active (which is only used by two sites but needs to
be invoked in literally all the rest hooks, and I doubt whether it could
change during migration at all..), or has_postcopy (which is weird in
another way, e.g., should we simply forbid pmem+postcopy setup upfront?), I
doubt whether they were the best solution for the problems.. but that's
separate questions to ask.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side
  2024-12-16 17:33           ` Peter Xu
@ 2024-12-19  9:50             ` Cédric Le Goater
  0 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-19  9:50 UTC (permalink / raw)
  To: Peter Xu, Maciej S. Szmigiero
  Cc: Avihai Horon, Alex Williamson, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Joao Martins,
	qemu-devel

On 12/16/24 18:33, Peter Xu wrote:
> On Thu, Dec 12, 2024 at 11:53:05PM +0100, Maciej S. Szmigiero wrote:
>> On 12.12.2024 15:54, Avihai Horon wrote:
>>>
>>> On 11/12/2024 1:06, Maciej S. Szmigiero wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 9.12.2024 10:28, Avihai Horon wrote:
>>>>>
>>>>> On 17/11/2024 21:20, Maciej S. Szmigiero wrote:
>>>>>> External email: Use caution opening links or attachments
>>>>>>
>>>>>>
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>>>
>>>>>> Implement the multifd device state transfer via additional per-device
>>>>>> thread inside save_live_complete_precopy_thread handler.
>>>>>>
>>>>>> Switch between doing the data transfer in the new handler and doing it
>>>>>> in the old save_state handler depending on the
>>>>>> x-migration-multifd-transfer device property value.
>>>>>>
>>>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>>>> ---
>>>>>>    hw/vfio/migration.c  | 155 +++++++++++++++++++++++++++++++++++++++++++
>>>>>>    hw/vfio/trace-events |   2 +
>>>>>>    2 files changed, 157 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>>> index b54879fe6209..8709672ada48 100644
>>>>>> --- a/hw/vfio/migration.c
>>>>>> +++ b/hw/vfio/migration.c
>>>>>> @@ -771,6 +771,24 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>>>>        uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>>>>>>        int ret;
>>>>>>
>>>>>> +    /*
>>>>>> +     * Make a copy of this setting at the start in case it is changed
>>>>>> +     * mid-migration.
>>>>>> +     */
>>>>>> +    if (vbasedev->migration_multifd_transfer == ON_OFF_AUTO_AUTO) {
>>>>>> +        migration->multifd_transfer = vfio_multifd_transfer_supported();
>>>>>> +    } else {
>>>>>> +        migration->multifd_transfer =
>>>>>> +            vbasedev->migration_multifd_transfer == ON_OFF_AUTO_ON;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (migration->multifd_transfer && !vfio_multifd_transfer_supported()) {
>>>>>> +        error_setg(errp,
>>>>>> +                   "%s: Multifd device transfer requested but unsupported in the current config",
>>>>>> +                   vbasedev->name);
>>>>>> +        return -EINVAL;
>>>>>> +    }
>>>>>> +
>>>>>>        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>>>>
>>>>>>        vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>>>> @@ -942,13 +960,32 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>>>>        return !migration->precopy_init_size && !migration->precopy_dirty_size;
>>>>>>    }
>>>>>>
>>>>>> +static void vfio_save_multifd_emit_dummy_eos(VFIODevice *vbasedev, QEMUFile *f)
>>>>>> +{
>>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>>> +
>>>>>> +    assert(migration->multifd_transfer);
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Emit dummy NOP data on the main migration channel since the actual
>>>>>> +     * device state transfer is done via multifd channels.
>>>>>> +     */
>>>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>>>> +}
>>>>>> +
>>>>>>    static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>>>>    {
>>>>>>        VFIODevice *vbasedev = opaque;
>>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>>>        ssize_t data_size;
>>>>>>        int ret;
>>>>>>        Error *local_err = NULL;
>>>>>>
>>>>>> +    if (migration->multifd_transfer) {
>>>>>> +        vfio_save_multifd_emit_dummy_eos(vbasedev, f);
>>>>>> +        return 0;
>>>>>> +    }
>>>>>
>>>>> I wonder whether we should add a .save_live_use_thread SaveVMHandlers through which a device can indicate if it wants to save its data with the async or sync handler.
>>>>> This will allow migration layer (i.e., qemu_savevm_state_complete_precopy_iterable) to know which handler to call instead of calling both of them and letting each device implicitly decide.
>>>>> IMHO it will make the code clearer and will allow us to drop vfio_save_multifd_emit_dummy_eos().
>>>>
>>>> I think that it's not worth adding a new SaveVMHandler just for this specific
>>>> use case, considering that it's easy to handle it inside driver by emitting that
>>>> FLAG_END_OF_STATE.
>>>>
>>>> Especially considering that for compatibility with other drivers that do not
>>>> define that hypothetical new SaveVMHandler not having it defined would need to
>>>> have the same effect as it always returning "false".
>>>
>>> We already have such handlers like .is_active, .has_postcopy and .is_active_iterate.
>>> Since VFIO migration with multifd involves a lot of threads and convoluted code paths, I thought this could put some order (even if small) into things, especially if it allows us to avoid the vfio_save_multifd_emit_dummy_eos() which feels a bit hackish.
>>>
>>> But anyway, that's only my opinion, and I can understand why this could be seen as an overkill.
>>
>> @Cedric, @Peter:
>> what's your opinion here?
>>
>> Is it better to add a new "flag" SaveVMHandler or keep handling
>> the multifd/non-multifd transfer difference in the VFIO driver
>> by emitting VFIO_MIG_FLAG_END_OF_STATE in
>> vfio_save_complete_precopy() and vfio_save_state()?
>>
>> Note that this new "flag" SaveVMHandler would need to have
>> semantics of disabling both save_live_complete_precopy and
>> save_state handlers and enabling save_live_complete_precopy_thread
>> instead.
> 
> If it's about adding one more global vmstate hook (even if only used in
> vfio), only to conditionally disable two other random vmstate hooks, then
> it isn't very attractive idea to me indeed.

We will need the 'multifd_transfer' VFIO field anyhow (not sure why it
is not at the device level yet though). So I guess it is fine to keep
it that  way. However, I would rename vfio_save_multifd_emit_dummy_eos()
to something more explicit like vfio_multifd_complete_precopy().

Thanks,

C.



> 
> PS: when I look at is_active (which is only used by two sites but needs to
> be invoked in literally all the rest hooks, and I doubt whether it could
> change during migration at all..), or has_postcopy (which is weird in
> another way, e.g., should we simply forbid pmem+postcopy setup upfront?), I
> doubt whether they were the best solution for the problems.. but that's
> separate questions to ask.
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (23 preceding siblings ...)
  2024-11-17 19:20 ` [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
@ 2024-12-04 19:10 ` Peter Xu
  2024-12-06 18:03   ` Maciej S. Szmigiero
  2024-12-05 21:27 ` Cédric Le Goater
  25 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-04 19:10 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Sun, Nov 17, 2024 at 08:19:55PM +0100, Maciej S. Szmigiero wrote:
> Important note:
> 4 VF benchmarks were done with commit 5504a8126115
> ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
> reverted since this seems to improve performance in this VM config if the
> multifd transfer is enabled: the downtime performance with this commit
> present is 1141 ms enabled / 1730 ms disabled.
> 
> Smaller VF counts actually do seem to benefit from this commit, so it's
> likely that in the future adding some kind of a memslot pre-allocation
> bit stream message might make sense to avoid this downtime regression for
> 4 VF configs (and likely higher VF count too).

I'm confused why revert 5504a8126115 could be faster, and it affects as
much as 600ms.  Also how that effect differs can relevant to num of VFs.

Could you share more on this regression?  Because if that's problematic we
need to fix it, or upstream QEMU (after this series merged) will still not
work.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-04 19:10 ` [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
@ 2024-12-06 18:03   ` Maciej S. Szmigiero
  2024-12-06 22:20     ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 18:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 4.12.2024 20:10, Peter Xu wrote:
> On Sun, Nov 17, 2024 at 08:19:55PM +0100, Maciej S. Szmigiero wrote:
>> Important note:
>> 4 VF benchmarks were done with commit 5504a8126115
>> ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
>> reverted since this seems to improve performance in this VM config if the
>> multifd transfer is enabled: the downtime performance with this commit
>> present is 1141 ms enabled / 1730 ms disabled.
>>
>> Smaller VF counts actually do seem to benefit from this commit, so it's
>> likely that in the future adding some kind of a memslot pre-allocation
>> bit stream message might make sense to avoid this downtime regression for
>> 4 VF configs (and likely higher VF count too).
> 
> I'm confused why revert 5504a8126115 could be faster, and it affects as
> much as 600ms.  Also how that effect differs can relevant to num of VFs.
> 
> Could you share more on this regression?  Because if that's problematic we
> need to fix it, or upstream QEMU (after this series merged) will still not
> work.
> 

The number of memslots that the VM uses seems to differ depending on its
VF count, each VF using 2 memslots:
2 VFs, used slots: 13
4 VFs, used slots: 17
5 VFs, used slots: 19

So I suspect this performance difference is due to these higher counts
of memslots possibly benefiting from being preallocated on the previous
QEMU code (before commit 5504a8126115).

I can see that with this commit:
> #define  KVM_MEMSLOTS_NR_ALLOC_DEFAULT                      16

So it would explain why the difference is visible on 4 VFs only (and
possibly higher VF counts, just I don't have an ability to test migrating
it) since with 4 VF configs we exceed KVM_MEMSLOTS_NR_ALLOC_DEFAULT.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-06 18:03   ` Maciej S. Szmigiero
@ 2024-12-06 22:20     ` Peter Xu
  2024-12-10 23:06       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-06 22:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Fri, Dec 06, 2024 at 07:03:36PM +0100, Maciej S. Szmigiero wrote:
> On 4.12.2024 20:10, Peter Xu wrote:
> > On Sun, Nov 17, 2024 at 08:19:55PM +0100, Maciej S. Szmigiero wrote:
> > > Important note:
> > > 4 VF benchmarks were done with commit 5504a8126115
> > > ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
> > > reverted since this seems to improve performance in this VM config if the
> > > multifd transfer is enabled: the downtime performance with this commit
> > > present is 1141 ms enabled / 1730 ms disabled.
> > > 
> > > Smaller VF counts actually do seem to benefit from this commit, so it's
> > > likely that in the future adding some kind of a memslot pre-allocation
> > > bit stream message might make sense to avoid this downtime regression for
> > > 4 VF configs (and likely higher VF count too).
> > 
> > I'm confused why revert 5504a8126115 could be faster, and it affects as
> > much as 600ms.  Also how that effect differs can relevant to num of VFs.
> > 
> > Could you share more on this regression?  Because if that's problematic we
> > need to fix it, or upstream QEMU (after this series merged) will still not
> > work.
> > 
> 
> The number of memslots that the VM uses seems to differ depending on its
> VF count, each VF using 2 memslots:
> 2 VFs, used slots: 13
> 4 VFs, used slots: 17
> 5 VFs, used slots: 19

It's still pretty less.

> 
> So I suspect this performance difference is due to these higher counts
> of memslots possibly benefiting from being preallocated on the previous
> QEMU code (before commit 5504a8126115).
> 
> I can see that with this commit:
> > #define  KVM_MEMSLOTS_NR_ALLOC_DEFAULT                      16
> 
> So it would explain why the difference is visible on 4 VFs only (and
> possibly higher VF counts, just I don't have an ability to test migrating
> it) since with 4 VF configs we exceed KVM_MEMSLOTS_NR_ALLOC_DEFAULT.

I suppose it means kvm_slots_grow() is called once, but I don't understand
why it caused 500ms downtime!

Not to mention, that patchset should at least reduce downtime OTOH due to
the small num of slots, because some of the dirty sync / clear path would
need to walk the whole slot array (our lookup is pretty slow for now, but
probably no good reason to rework it yet if it's mostly 10-20).

In general, I would still expect that dynamic memslot work to speedup
(instead of slowing down) VFIO migrations.

There's something off here, or something I overlooked.  I suggest we figure
it out..  Even if we need to revert the kvm series on master, but I so far
doubt it.

Otherwise we should at least report the number with things on the master
branch, and we evaluate merging this series with that real number, because
fundamentally that's the numbers people will get when start using this
feature on master later.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-06 22:20     ` Peter Xu
@ 2024-12-10 23:06       ` Maciej S. Szmigiero
  2024-12-12 17:35         ` Peter Xu
  0 siblings, 1 reply; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-10 23:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 6.12.2024 23:20, Peter Xu wrote:
> On Fri, Dec 06, 2024 at 07:03:36PM +0100, Maciej S. Szmigiero wrote:
>> On 4.12.2024 20:10, Peter Xu wrote:
>>> On Sun, Nov 17, 2024 at 08:19:55PM +0100, Maciej S. Szmigiero wrote:
>>>> Important note:
>>>> 4 VF benchmarks were done with commit 5504a8126115
>>>> ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
>>>> reverted since this seems to improve performance in this VM config if the
>>>> multifd transfer is enabled: the downtime performance with this commit
>>>> present is 1141 ms enabled / 1730 ms disabled.
>>>>
>>>> Smaller VF counts actually do seem to benefit from this commit, so it's
>>>> likely that in the future adding some kind of a memslot pre-allocation
>>>> bit stream message might make sense to avoid this downtime regression for
>>>> 4 VF configs (and likely higher VF count too).
>>>
>>> I'm confused why revert 5504a8126115 could be faster, and it affects as
>>> much as 600ms.  Also how that effect differs can relevant to num of VFs.
>>>
>>> Could you share more on this regression?  Because if that's problematic we
>>> need to fix it, or upstream QEMU (after this series merged) will still not
>>> work.
>>>
>>
>> The number of memslots that the VM uses seems to differ depending on its
>> VF count, each VF using 2 memslots:
>> 2 VFs, used slots: 13
>> 4 VFs, used slots: 17
>> 5 VFs, used slots: 19
> 
> It's still pretty less.
> 
>>
>> So I suspect this performance difference is due to these higher counts
>> of memslots possibly benefiting from being preallocated on the previous
>> QEMU code (before commit 5504a8126115).
>>
>> I can see that with this commit:
>>> #define  KVM_MEMSLOTS_NR_ALLOC_DEFAULT                      16
>>
>> So it would explain why the difference is visible on 4 VFs only (and
>> possibly higher VF counts, just I don't have an ability to test migrating
>> it) since with 4 VF configs we exceed KVM_MEMSLOTS_NR_ALLOC_DEFAULT.
> 
> I suppose it means kvm_slots_grow() is called once, but I don't understand
> why it caused 500ms downtime!

In this cover letter sentence:
> "the downtime performance with this commit present is 1141 ms enabled / 1730 ms disabled"
"enabled" and "disabled" refer to *multifd transfer* being enabled, not
your patch being present (sorry for not being 100% clear there).

So the difference that the memslot patch makes is 1141 ms - 1095ms = 46 ms extra
downtime, not 500 ms.

I can guess this is because of extra contention on BQL, with unfortunate timing.

> Not to mention, that patchset should at least reduce downtime OTOH due to
> the small num of slots, because some of the dirty sync / clear path would
> need to walk the whole slot array (our lookup is pretty slow for now, but
> probably no good reason to rework it yet if it's mostly 10-20).

With multifd transfer being disabled your memslot patch indeed improves the
downtime by 1900 ms - 1730 ms = 170 ms.

> In general, I would still expect that dynamic memslot work to speedup
> (instead of slowing down) VFIO migrations.
> 
> There's something off here, or something I overlooked.  I suggest we figure
> it out..  Even if we need to revert the kvm series on master, but I so far
> doubt it.
> 
> Otherwise we should at least report the number with things on the master
> branch, and we evaluate merging this series with that real number, because
> fundamentally that's the numbers people will get when start using this
> feature on master later.

Sure, that's why in the cover letter I provided the numbers with your commit
present, too.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-10 23:06       ` Maciej S. Szmigiero
@ 2024-12-12 17:35         ` Peter Xu
  2024-12-19  7:55           ` Yanghang Liu
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-12 17:35 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Wed, Dec 11, 2024 at 12:06:04AM +0100, Maciej S. Szmigiero wrote:
> On 6.12.2024 23:20, Peter Xu wrote:
> > On Fri, Dec 06, 2024 at 07:03:36PM +0100, Maciej S. Szmigiero wrote:
> > > On 4.12.2024 20:10, Peter Xu wrote:
> > > > On Sun, Nov 17, 2024 at 08:19:55PM +0100, Maciej S. Szmigiero wrote:
> > > > > Important note:
> > > > > 4 VF benchmarks were done with commit 5504a8126115
> > > > > ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
> > > > > reverted since this seems to improve performance in this VM config if the
> > > > > multifd transfer is enabled: the downtime performance with this commit
> > > > > present is 1141 ms enabled / 1730 ms disabled.
> > > > > 
> > > > > Smaller VF counts actually do seem to benefit from this commit, so it's
> > > > > likely that in the future adding some kind of a memslot pre-allocation
> > > > > bit stream message might make sense to avoid this downtime regression for
> > > > > 4 VF configs (and likely higher VF count too).
> > > > 
> > > > I'm confused why revert 5504a8126115 could be faster, and it affects as
> > > > much as 600ms.  Also how that effect differs can relevant to num of VFs.
> > > > 
> > > > Could you share more on this regression?  Because if that's problematic we
> > > > need to fix it, or upstream QEMU (after this series merged) will still not
> > > > work.
> > > > 
> > > 
> > > The number of memslots that the VM uses seems to differ depending on its
> > > VF count, each VF using 2 memslots:
> > > 2 VFs, used slots: 13
> > > 4 VFs, used slots: 17
> > > 5 VFs, used slots: 19
> > 
> > It's still pretty less.
> > 
> > > 
> > > So I suspect this performance difference is due to these higher counts
> > > of memslots possibly benefiting from being preallocated on the previous
> > > QEMU code (before commit 5504a8126115).
> > > 
> > > I can see that with this commit:
> > > > #define  KVM_MEMSLOTS_NR_ALLOC_DEFAULT                      16
> > > 
> > > So it would explain why the difference is visible on 4 VFs only (and
> > > possibly higher VF counts, just I don't have an ability to test migrating
> > > it) since with 4 VF configs we exceed KVM_MEMSLOTS_NR_ALLOC_DEFAULT.
> > 
> > I suppose it means kvm_slots_grow() is called once, but I don't understand
> > why it caused 500ms downtime!
> 
> In this cover letter sentence:
> > "the downtime performance with this commit present is 1141 ms enabled / 1730 ms disabled"
> "enabled" and "disabled" refer to *multifd transfer* being enabled, not
> your patch being present (sorry for not being 100% clear there).
> 
> So the difference that the memslot patch makes is 1141 ms - 1095ms = 46 ms extra
> downtime, not 500 ms.
> 
> I can guess this is because of extra contention on BQL, with unfortunate timing.

Hmm, I wonder why the address space changed during switchover.  I was
expecting the whole address space is updated on qemu boots up, and should
keep as is during migration.  Especially why that matters with mulitifd at
all..  I could have overlooked something.

> 
> > Not to mention, that patchset should at least reduce downtime OTOH due to
> > the small num of slots, because some of the dirty sync / clear path would
> > need to walk the whole slot array (our lookup is pretty slow for now, but
> > probably no good reason to rework it yet if it's mostly 10-20).
> 
> With multifd transfer being disabled your memslot patch indeed improves the
> downtime by 1900 ms - 1730 ms = 170 ms.

That's probably the other side of the change when slots grow, comparing to
the pure win where the series definitely should speedup the dirty track
operations quite a bit.

> 
> > In general, I would still expect that dynamic memslot work to speedup
> > (instead of slowing down) VFIO migrations.
> > 
> > There's something off here, or something I overlooked.  I suggest we figure
> > it out..  Even if we need to revert the kvm series on master, but I so far
> > doubt it.
> > 
> > Otherwise we should at least report the number with things on the master
> > branch, and we evaluate merging this series with that real number, because
> > fundamentally that's the numbers people will get when start using this
> > feature on master later.
> 
> Sure, that's why in the cover letter I provided the numbers with your commit
> present, too.

It seems to me we're not far away from the truth.  Anyway, feel free to
update if you figure out the reason, or got some news on profiling.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-12 17:35         ` Peter Xu
@ 2024-12-19  7:55           ` Yanghang Liu
  2024-12-19  8:53             ` Cédric Le Goater
  0 siblings, 1 reply; 140+ messages in thread
From: Yanghang Liu @ 2024-12-19  7:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Cédric Le Goater, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé, Avihai Horon, Joao Martins, qemu-devel,
	Chao Yang

FYI.  The following data comes from the first ping-pong mlx VF
migration after rebooting the host.


1. Test for multifd=0:

1.1 Outgoing migration:
VF number:                     1 VF                           4 VF
Time elapsed:             10194 ms                   10650 ms
Memory processed:    903.911 MiB             783.698 MiB
Memory bandwidth:    108.722 MiB/s          101.978 MiB/s
Iteration:                               4                              6
Normal data:                881.297 MiB             747.613 MiB
Total downtime                358ms                       518ms
Setup time                        52ms                        450ms

1.2 In coming migration:
VF number:                       1 VF                            4 VF
Time elapsed:                10161 ms                    10569 ms
Memory processed:     903.881 MiB                785.400 MiB
Memory bandwidth:     107.952 MiB/s             100.512 MiB/s
Iteration:                               4                                7
Normal data:                 881.262 MiB               749.297 MiB
Total downtime                315ms                        513ms
Setup time                        47ms                         414ms


2. Test for multifd=1:

2.1 Outgoing migration:
VF number                     1 VF                           1 VF
Channel number               4                                  5
Time elapsed:              10962 ms                  10071 ms
Memory processed:     908.968 MiB             908.424 MiB
Memory bandwidth:     108.378 MiB/s         110.109 MiB/s
Iteration:                               4
  4
Normal data:               882.852 MiB              882.566 MiB
Total downtime                318ms                       255ms
Setup time                         54ms                        43ms


VF number                    4 VFs                        4 VFs
Channel number             8                               16
Time elapsed:            10805 ms                  10943 ms
Setup time                   445 ms                       463ms
Memory processed:  786.334 MiB            784.926 MiB
Memory bandwidth   109.062 MiB/s         108.610 MiB/s
Iteration:                              5                           7
Normal data:            746.758 MiB             744.938 MiB
Total downtime            344 ms                     335ms


2.2 Incoming migration:
VF number                       1 VF                      1 VF
Channel number                4                            5
Time elapsed:                10064ms               10072 ms
Memory processed:     909.786 MiB           923.746 MiB
Memory bandwidth:      109.997 MiB/s       111.308 MiB/s
Iteration:                               4                          4
Normal data:               883.664 MiB            897.848 MiB
Total downtime                 313ms                   328ms
Setup time                        46ms                      47ms

VF number                   4 VFs                        4 VFs
Channel number             8                              16
Time elapsed:             10126 ms                 9941 ms
Memory processed:   791.308 MiB           779.560 MiB
Memory bandwidth:  108.876 MiB/s         110.170 MiB/s
Iteration:                          7                               5
 Normal data:             751.672 MiB           739.680 MiB
Total downtime             304 ms                    309ms
Setup time                    442 ms                    446ms


Best Regards,
Yanghang Liu




On Fri, Dec 13, 2024 at 1:36 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Dec 11, 2024 at 12:06:04AM +0100, Maciej S. Szmigiero wrote:
> > On 6.12.2024 23:20, Peter Xu wrote:
> > > On Fri, Dec 06, 2024 at 07:03:36PM +0100, Maciej S. Szmigiero wrote:
> > > > On 4.12.2024 20:10, Peter Xu wrote:
> > > > > On Sun, Nov 17, 2024 at 08:19:55PM +0100, Maciej S. Szmigiero wrote:
> > > > > > Important note:
> > > > > > 4 VF benchmarks were done with commit 5504a8126115
> > > > > > ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
> > > > > > reverted since this seems to improve performance in this VM config if the
> > > > > > multifd transfer is enabled: the downtime performance with this commit
> > > > > > present is 1141 ms enabled / 1730 ms disabled.
> > > > > >
> > > > > > Smaller VF counts actually do seem to benefit from this commit, so it's
> > > > > > likely that in the future adding some kind of a memslot pre-allocation
> > > > > > bit stream message might make sense to avoid this downtime regression for
> > > > > > 4 VF configs (and likely higher VF count too).
> > > > >
> > > > > I'm confused why revert 5504a8126115 could be faster, and it affects as
> > > > > much as 600ms.  Also how that effect differs can relevant to num of VFs.
> > > > >
> > > > > Could you share more on this regression?  Because if that's problematic we
> > > > > need to fix it, or upstream QEMU (after this series merged) will still not
> > > > > work.
> > > > >
> > > >
> > > > The number of memslots that the VM uses seems to differ depending on its
> > > > VF count, each VF using 2 memslots:
> > > > 2 VFs, used slots: 13
> > > > 4 VFs, used slots: 17
> > > > 5 VFs, used slots: 19
> > >
> > > It's still pretty less.
> > >
> > > >
> > > > So I suspect this performance difference is due to these higher counts
> > > > of memslots possibly benefiting from being preallocated on the previous
> > > > QEMU code (before commit 5504a8126115).
> > > >
> > > > I can see that with this commit:
> > > > > #define  KVM_MEMSLOTS_NR_ALLOC_DEFAULT                      16
> > > >
> > > > So it would explain why the difference is visible on 4 VFs only (and
> > > > possibly higher VF counts, just I don't have an ability to test migrating
> > > > it) since with 4 VF configs we exceed KVM_MEMSLOTS_NR_ALLOC_DEFAULT.
> > >
> > > I suppose it means kvm_slots_grow() is called once, but I don't understand
> > > why it caused 500ms downtime!
> >
> > In this cover letter sentence:
> > > "the downtime performance with this commit present is 1141 ms enabled / 1730 ms disabled"
> > "enabled" and "disabled" refer to *multifd transfer* being enabled, not
> > your patch being present (sorry for not being 100% clear there).
> >
> > So the difference that the memslot patch makes is 1141 ms - 1095ms = 46 ms extra
> > downtime, not 500 ms.
> >
> > I can guess this is because of extra contention on BQL, with unfortunate timing.
>
> Hmm, I wonder why the address space changed during switchover.  I was
> expecting the whole address space is updated on qemu boots up, and should
> keep as is during migration.  Especially why that matters with mulitifd at
> all..  I could have overlooked something.
>
> >
> > > Not to mention, that patchset should at least reduce downtime OTOH due to
> > > the small num of slots, because some of the dirty sync / clear path would
> > > need to walk the whole slot array (our lookup is pretty slow for now, but
> > > probably no good reason to rework it yet if it's mostly 10-20).
> >
> > With multifd transfer being disabled your memslot patch indeed improves the
> > downtime by 1900 ms - 1730 ms = 170 ms.
>
> That's probably the other side of the change when slots grow, comparing to
> the pure win where the series definitely should speedup the dirty track
> operations quite a bit.
>
> >
> > > In general, I would still expect that dynamic memslot work to speedup
> > > (instead of slowing down) VFIO migrations.
> > >
> > > There's something off here, or something I overlooked.  I suggest we figure
> > > it out..  Even if we need to revert the kvm series on master, but I so far
> > > doubt it.
> > >
> > > Otherwise we should at least report the number with things on the master
> > > branch, and we evaluate merging this series with that real number, because
> > > fundamentally that's the numbers people will get when start using this
> > > feature on master later.
> >
> > Sure, that's why in the cover letter I provided the numbers with your commit
> > present, too.
>
> It seems to me we're not far away from the truth.  Anyway, feel free to
> update if you figure out the reason, or got some news on profiling.
>
> Thanks,
>
> --
> Peter Xu
>
>



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-19  7:55           ` Yanghang Liu
@ 2024-12-19  8:53             ` Cédric Le Goater
  2024-12-19 13:00               ` Yanghang Liu
  0 siblings, 1 reply; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-19  8:53 UTC (permalink / raw)
  To: Yanghang Liu, Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel, Chao Yang

Hello Yanghang

On 12/19/24 08:55, Yanghang Liu wrote:
> FYI.  The following data comes from the first ping-pong mlx VF
> migration after rebooting the host.
> 
> 
> 1. Test for multifd=0:
> 
> 1.1 Outgoing migration:
> VF number:                     1 VF                           4 VF
> Time elapsed:             10194 ms                   10650 ms
> Memory processed:    903.911 MiB             783.698 MiB
> Memory bandwidth:    108.722 MiB/s          101.978 MiB/s
> Iteration:                               4                              6
> Normal data:                881.297 MiB             747.613 MiB
> Total downtime                358ms                       518ms
> Setup time                        52ms                        450ms
> 
> 1.2 In coming migration:
> VF number:                       1 VF                            4 VF
> Time elapsed:                10161 ms                    10569 ms
> Memory processed:     903.881 MiB                785.400 MiB
> Memory bandwidth:     107.952 MiB/s             100.512 MiB/s
> Iteration:                               4                                7
> Normal data:                 881.262 MiB               749.297 MiB
> Total downtime                315ms                        513ms
> Setup time                        47ms                         414ms
> 
> 
> 2. Test for multifd=1:
> 
> 2.1 Outgoing migration:
> VF number                     1 VF                           1 VF
> Channel number               4                                  5
> Time elapsed:              10962 ms                  10071 ms
> Memory processed:     908.968 MiB             908.424 MiB
> Memory bandwidth:     108.378 MiB/s         110.109 MiB/s
> Iteration:                               4
>    4
> Normal data:               882.852 MiB              882.566 MiB
> Total downtime                318ms                       255ms
> Setup time                         54ms                        43ms
> 
> 
> VF number                    4 VFs                        4 VFs
> Channel number             8                               16
> Time elapsed:            10805 ms                  10943 ms
> Setup time                   445 ms                       463ms
> Memory processed:  786.334 MiB            784.926 MiB
> Memory bandwidth   109.062 MiB/s         108.610 MiB/s
> Iteration:                              5                           7
> Normal data:            746.758 MiB             744.938 MiB
> Total downtime            344 ms                     335ms
> 
> 
> 2.2 Incoming migration:
> VF number                       1 VF                      1 VF
> Channel number                4                            5
> Time elapsed:                10064ms               10072 ms
> Memory processed:     909.786 MiB           923.746 MiB
> Memory bandwidth:      109.997 MiB/s       111.308 MiB/s
> Iteration:                               4                          4
> Normal data:               883.664 MiB            897.848 MiB
> Total downtime                 313ms                   328ms
> Setup time                        46ms                      47ms
> 
> VF number                   4 VFs                        4 VFs
> Channel number             8                              16
> Time elapsed:             10126 ms                 9941 ms
> Memory processed:   791.308 MiB           779.560 MiB
> Memory bandwidth:  108.876 MiB/s         110.170 MiB/s
> Iteration:                          7                               5
>   Normal data:             751.672 MiB           739.680 MiB
> Total downtime             304 ms                    309ms
> Setup time                    442 ms                    446ms
> 

This is difficult to read. Could you please resend with a fixed
indentation ?

We would need more information on the host and vm config too.

Thanks,

C.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-19  8:53             ` Cédric Le Goater
@ 2024-12-19 13:00               ` Yanghang Liu
  0 siblings, 0 replies; 140+ messages in thread
From: Yanghang Liu @ 2024-12-19 13:00 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Peter Xu, Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson,
	Eric Blake, Markus Armbruster, Daniel P. Berrangé,
	Avihai Horon, Joao Martins, qemu-devel, Chao Yang

Sorry for the inconvenience.  Let me try to re-send my data via my Gmail client.

Test environment:
Host : Dell 7625
CPU : EPYC-Genoa
VM config : 4 vCPU, 8G memory
Network Device: MT2910

Test report:
+------------------+---------------+----------------+
| multifd=0        |     outgoing migration         |
+------------------+---------------+----------------+
| VF(s) number     | 1             | 4              |
| Time elapsed     | 10194 ms      | 10650 ms       |
| Memory processed | 903.911 MiB   | 783.698 MiB    |
| Memory bandwidth | 108.722 MiB/s | 101.978 MiB/s  |
| Iteration        | 4             | 6              |
| Normal data      | 881.297 MiB   | 747.613 MiB    |
| Total downtime   | 358ms         | 518ms          |
| Setup time       | 52ms          | 450ms          |
+------------------+---------------+----------------+

+------------------+---------------+----------------+
| multifd=0        |     incoming migration         |
+------------------+---------------+----------------+
| VF(s) number     | 1             | 4              |
| Time elapsed     | 10161 ms      | 10569 ms       |
| Memory processed | 903.881 MiB   | 785.400 MiB    |
| Memory bandwidth | 107.952 MiB/s | 100.512 MiB/s  |
| Iteration        | 4             | 7              |
| Normal data      | 881.262 MiB   | 749.297 MiB    |
| Total downtime   | 315ms         | 513ms          |
| Setup time       | 47ms          | 414ms          |
+------------------+---------------+----------------+



+------------------+---------------+---------------+
| multifd=1        |     outgoing migration        |
+------------------+---------------+---------------+
| VF(s) number     | 1             | 1             |
| Channel          | 4             | 5             |
| Time elapsed     | 10962 ms      | 10071 ms      |
| Memory processed | 908.968 MiB   | 908.424 MiB   |
| Memory bandwidth | 108.378 MiB/s | 110.109 MiB/s |
| Iteration        | 4             | 4             |
| Normal data      | 882.852 MiB   | 882.566 MiB   |
| Total downtime   | 318ms         | 255ms         |
| Setup time       | 54ms          | 43ms          |
+------------------+---------------+---------------+


+------------------+---------------+----------------+
| multifd=1        |     incoming migration         |
+------------------+---------------+----------------+
| VF(s) number     | 1             | 1              |
| Channel          | 4             | 5              |
| Time elapsed     | 10064ms       | 10072 ms       |
| Memory processed | 909.786 MiB   | 923.746 MiB    |
| Memory bandwidth | 109.997 MiB/s | 111.308 MiB/s  |
| Iteration        | 4             | 4              |
| Normal data      | 883.664 MiB   | 897.848 MiB    |
| Total downtime   | 313ms         | 328ms          |
| Setup time       | 46ms          | 47ms           |
+------------------+---------------+----------------+


+------------------+---------------+----------------+
| multifd=1        |     outgoing migration         |
+------------------+---------------+----------------+
| VF(s) number     | 4             | 4              |
| Channel          | 8             | 16             |
| Time elapsed     | 10805 ms      | 10943 ms       |
| Memory processed | 786.334 MiB   | 784.926 MiB    |
| Memory bandwidth | 109.062 MiB/s | 108.610 MiB/s  |
| Iteration        | 5             | 7              |
| Normal data      | 746.758 MiB   | 744.938 MiB    |
| Total downtime   | 344 ms        | 335ms          |
| Setup time       | 445 ms        | 463ms          |
+------------------+---------------+----------------+


+------------------+---------------+------------------+
| multifd=1        |     incoming migration           |
+------------------+---------------+------------------+
| VF(s) number     | 4             | 4                |
| Channel          | 8             | 16               |
| Time elapsed     | 10126 ms      | 9941 ms          |
| Memory processed | 791.308 MiB   | 779.560 MiB      |
| Memory bandwidth | 108.876 MiB/s | 110.170 MiB/s    |
| Iteration        | 7             | 5                |
| Normal data      | 751.672 MiB   | 739.680 MiB      |
| Total downtime   | 304 ms        | 309ms            |
| Setup time       | 442 ms        | 446ms            |
+------------------+---------------+------------------+

Best Regards,
Yanghang Liu


On Thu, Dec 19, 2024 at 4:53 PM Cédric Le Goater <clg@redhat.com> wrote:
>
> Hello Yanghang
>
> On 12/19/24 08:55, Yanghang Liu wrote:
> > FYI.  The following data comes from the first ping-pong mlx VF
> > migration after rebooting the host.
> >
> >
> > 1. Test for multifd=0:
> >
> > 1.1 Outgoing migration:
> > VF number:                     1 VF                           4 VF
> > Time elapsed:             10194 ms                   10650 ms
> > Memory processed:    903.911 MiB             783.698 MiB
> > Memory bandwidth:    108.722 MiB/s          101.978 MiB/s
> > Iteration:                               4                              6
> > Normal data:                881.297 MiB             747.613 MiB
> > Total downtime                358ms                       518ms
> > Setup time                        52ms                        450ms
> >
> > 1.2 In coming migration:
> > VF number:                       1 VF                            4 VF
> > Time elapsed:                10161 ms                    10569 ms
> > Memory processed:     903.881 MiB                785.400 MiB
> > Memory bandwidth:     107.952 MiB/s             100.512 MiB/s
> > Iteration:                               4                                7
> > Normal data:                 881.262 MiB               749.297 MiB
> > Total downtime                315ms                        513ms
> > Setup time                        47ms                         414ms
> >
> >
> > 2. Test for multifd=1:
> >
> > 2.1 Outgoing migration:
> > VF number                     1 VF                           1 VF
> > Channel number               4                                  5
> > Time elapsed:              10962 ms                  10071 ms
> > Memory processed:     908.968 MiB             908.424 MiB
> > Memory bandwidth:     108.378 MiB/s         110.109 MiB/s
> > Iteration:                               4
> >    4
> > Normal data:               882.852 MiB              882.566 MiB
> > Total downtime                318ms                       255ms
> > Setup time                         54ms                        43ms
> >
> >
> > VF number                    4 VFs                        4 VFs
> > Channel number             8                               16
> > Time elapsed:            10805 ms                  10943 ms
> > Setup time                   445 ms                       463ms
> > Memory processed:  786.334 MiB            784.926 MiB
> > Memory bandwidth   109.062 MiB/s         108.610 MiB/s
> > Iteration:                              5                           7
> > Normal data:            746.758 MiB             744.938 MiB
> > Total downtime            344 ms                     335ms
> >
> >
> > 2.2 Incoming migration:
> > VF number                       1 VF                      1 VF
> > Channel number                4                            5
> > Time elapsed:                10064ms               10072 ms
> > Memory processed:     909.786 MiB           923.746 MiB
> > Memory bandwidth:      109.997 MiB/s       111.308 MiB/s
> > Iteration:                               4                          4
> > Normal data:               883.664 MiB            897.848 MiB
> > Total downtime                 313ms                   328ms
> > Setup time                        46ms                      47ms
> >
> > VF number                   4 VFs                        4 VFs
> > Channel number             8                              16
> > Time elapsed:             10126 ms                 9941 ms
> > Memory processed:   791.308 MiB           779.560 MiB
> > Memory bandwidth:  108.876 MiB/s         110.170 MiB/s
> > Iteration:                          7                               5
> >   Normal data:             751.672 MiB           739.680 MiB
> > Total downtime             304 ms                    309ms
> > Setup time                    442 ms                    446ms
> >
>
> This is difficult to read. Could you please resend with a fixed
> indentation ?
>
> We would need more information on the host and vm config too.
>
> Thanks,
>
> C.
>



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
                   ` (24 preceding siblings ...)
  2024-12-04 19:10 ` [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
@ 2024-12-05 21:27 ` Cédric Le Goater
  2024-12-05 21:42   ` Peter Xu
  2024-12-06 18:44   ` Maciej S. Szmigiero
  25 siblings, 2 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-05 21:27 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Eric Blake, Markus Armbruster,
	Daniel P . Berrangé, Avihai Horon, Joao Martins, qemu-devel

On 11/17/24 20:19, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This is an updated v3 patch series of the v2 series located here:
> https://lore.kernel.org/qemu-devel/cover.1724701542.git.maciej.szmigiero@oracle.com/
> 
> Changes from v2:
> * Reworked the non-AIO (generic) thread pool to use Glib's GThreadPool
> instead of making the current QEMU AIO thread pool generic.
> 
> * Added QEMU_VM_COMMAND MIG_CMD_SWITCHOVER_START sub-command to the
> migration bit stream protocol via migration compatibility flag.
> Used this new bit stream sub-command to achieve barrier between main
> migration channel device state data and multifd device state data instead
> of introducing save_live_complete_precopy_{begin,end} handlers for that as
> the previous patch set version did,
> 
> * Added a new migration core thread pool of optional load threads and used
> it to implement VFIO load thread instead of introducing load_finish handler
> as the previous patch set version did.
> 
> * Made VFIO device config state load operation happen from that device load
> thread instead of from (now gone) load_finish handler that did such load on
> the main migration thread.
> In the future this may allow pushing BQL deeper into the device config
> state load operation internals and so doing more of it in parallel.
> 
> * Switched multifd_send() to using a serializing mutex for thread safety
> instead of atomics as suggested by Peter since this seems to not cause
> any performance regression while being simpler.
> 
> * Added two patches improving SaveVMHandlers documentation: one documenting
> the BQL behavior of load SaveVMHandlers, another one explaining
> {load,save}_cleanup handlers semantics.
> 
> * Added Peter's proposed patch making MultiFDSendData a struct from
> https://lore.kernel.org/qemu-devel/ZuCickYhs3nf2ERC@x1n/
> Other two patches from that message bring no performance benefits so they
> were skipped (as discussed in that e-mail thread).
> 
> * Switched x-migration-multifd-transfer VFIO property to tri-state (On,
> Off, Auto), with Auto being now the default value.
> This means hat VFIO device state transfer via multifd channels is
> automatically attempted in configurations that otherwise support it.
> Note that in this patch set version (in contrast with the previous version)
> x-migration-multifd-transfer setting is meaningful both on source AND
> destination QEMU.
> 
> * Fixed a race condition with respect to the final multifd channel SYNC
> packet sent by the RAM transfer code.
> 
> * Made VFIO's bytes_transferred counter atomic since it is accessed from
> multiple threads (thanks Avihai for spotting it).
> 
> * Fixed an issue where VFIO device config sender QEMUFile wouldn't be
> closed in some error conditions, switched to QEMUFile g_autoptr() automatic
> memory management there to avoid such bugs in the future (also thanks
> to Avihai for spotting the issue).
> 
> * Many, MANY small changes, like renamed functions, added review tags,
> locks annotations, code formatting, split out changes into separate
> commits, etc.
> 
> * Redid benchmarks.
> 
> ========================================================================
> 
> Benchmark results:
> These are 25th percentile of downtime results from 70-100 back-and-forth
> live migrations with the same VM config (guest wasn't restarted during
> these migrations).
> 
> Previous benchmarks reported the lowest downtime results ("0th percentile")
> instead but these were subject to variation due to often being one of
> outliers.
> 
> The used setup for bechmarking was the same as the RFC version of patch set
> used.
> 
> 
> Results with 6 multifd channels:
>              4 VFs   2 VFs    1 VF
> Disabled: 1900 ms  859 ms  487 ms
> Enabled:  1095 ms  556 ms  366 ms
> 
> Results with 4 VFs but varied multifd channel count:
>               6 ch     8 ch    15 ch
> Enabled:  1095 ms  1104 ms  1125 ms
> 
> 
> Important note:
> 4 VF benchmarks were done with commit 5504a8126115
> ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
> reverted since this seems to improve performance in this VM config if the
> multifd transfer is enabled: the downtime performance with this commit
> present is 1141 ms enabled / 1730 ms disabled.
> 
> Smaller VF counts actually do seem to benefit from this commit, so it's
> likely that in the future adding some kind of a memslot pre-allocation
> bit stream message might make sense to avoid this downtime regression for
> 4 VF configs (and likely higher VF count too).
> 
> ========================================================================
> 
> This series is obviously targeting post QEMU 9.2 release by now
> (AFAIK called 10.0).
> 
> Will need to be changed to use hw_compat_10_0 once these become available.
> 
> ========================================================================
> 
> Maciej S. Szmigiero (23):
>    migration: Clarify that {load,save}_cleanup handlers can run without
>      setup
>    thread-pool: Remove thread_pool_submit() function
>    thread-pool: Rename AIO pool functions to *_aio() and data types to
>      *Aio
>    thread-pool: Implement generic (non-AIO) pool support
>    migration: Add MIG_CMD_SWITCHOVER_START and its load handler
>    migration: Add qemu_loadvm_load_state_buffer() and its handler
>    migration: Document the BQL behavior of load SaveVMHandlers
>    migration: Add thread pool of optional load threads
>    migration/multifd: Split packet into header and RAM data
>    migration/multifd: Device state transfer support - receive side
>    migration/multifd: Make multifd_send() thread safe
>    migration/multifd: Add an explicit MultiFDSendData destructor
>    migration/multifd: Device state transfer support - send side
>    migration/multifd: Add migration_has_device_state_support()
>    migration/multifd: Send final SYNC only after device state is complete
>    migration: Add save_live_complete_precopy_thread handler
>    vfio/migration: Don't run load cleanup if load setup didn't run
>    vfio/migration: Add x-migration-multifd-transfer VFIO property
>    vfio/migration: Add load_device_config_state_start trace event
>    vfio/migration: Convert bytes_transferred counter to atomic
>    vfio/migration: Multifd device state transfer support - receive side
>    migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
>    vfio/migration: Multifd device state transfer support - send side
> 
> Peter Xu (1):
>    migration/multifd: Make MultiFDSendData a struct
> 
>   hw/core/machine.c                  |   2 +
>   hw/vfio/migration.c                | 588 ++++++++++++++++++++++++++++-
>   hw/vfio/pci.c                      |  11 +
>   hw/vfio/trace-events               |  11 +-
>   include/block/aio.h                |   8 +-
>   include/block/thread-pool.h        |  20 +-
>   include/hw/vfio/vfio-common.h      |  21 ++
>   include/migration/client-options.h |   4 +
>   include/migration/misc.h           |  16 +
>   include/migration/register.h       |  67 +++-
>   include/qemu/typedefs.h            |   5 +
>   migration/colo.c                   |   3 +
>   migration/meson.build              |   1 +
>   migration/migration-hmp-cmds.c     |   2 +
>   migration/migration.c              |   3 +
>   migration/migration.h              |   2 +
>   migration/multifd-device-state.c   | 193 ++++++++++
>   migration/multifd-nocomp.c         |  45 ++-
>   migration/multifd.c                | 228 +++++++++--
>   migration/multifd.h                |  73 +++-
>   migration/options.c                |   9 +
>   migration/qemu-file.h              |   2 +
>   migration/ram.c                    |  10 +-
>   migration/savevm.c                 | 183 ++++++++-
>   migration/savevm.h                 |   4 +
>   migration/trace-events             |   1 +
>   scripts/analyze-migration.py       |  11 +
>   tests/unit/test-thread-pool.c      |   2 +-
>   util/async.c                       |   6 +-
>   util/thread-pool.c                 | 174 +++++++--
>   util/trace-events                  |   6 +-
>   31 files changed, 1586 insertions(+), 125 deletions(-)
>   create mode 100644 migration/multifd-device-state.c


I did a quick run of a VM with a mlx5 VF and a vGPU and I didn't see
any issue when migrating. I used 4 channels for multifd. The trace
events looked ok and useful. We will tune these with time. I wished
we had some way to dump the thread and channel usage on each side.

A build was provided to RHEL QE. This to get more results when under
stress and with larger device states. Don't expect feedback before
next year though !

Having a small cookbook to run the migration from QEMU and from
libvirt would be a plus.

Thanks,

C.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-05 21:27 ` Cédric Le Goater
@ 2024-12-05 21:42   ` Peter Xu
  2024-12-06 10:24     ` Cédric Le Goater
  2024-12-06 18:44   ` Maciej S. Szmigiero
  1 sibling, 1 reply; 140+ messages in thread
From: Peter Xu @ 2024-12-05 21:42 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On Thu, Dec 05, 2024 at 10:27:09PM +0100, Cédric Le Goater wrote:

[...]

> > Important note:
> > 4 VF benchmarks were done with commit 5504a8126115
> > ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
> > reverted since this seems to improve performance in this VM config if the
> > multifd transfer is enabled: the downtime performance with this commit
> > present is 1141 ms enabled / 1730 ms disabled.

[1]

> > 
> > Smaller VF counts actually do seem to benefit from this commit, so it's
> > likely that in the future adding some kind of a memslot pre-allocation
> > bit stream message might make sense to avoid this downtime regression for
> > 4 VF configs (and likely higher VF count too).

[...]

> I did a quick run of a VM with a mlx5 VF and a vGPU and I didn't see
> any issue when migrating. I used 4 channels for multifd. The trace
> events looked ok and useful. We will tune these with time. I wished
> we had some way to dump the thread and channel usage on each side.
> 
> A build was provided to RHEL QE. This to get more results when under
> stress and with larger device states. Don't expect feedback before
> next year though !
> 
> Having a small cookbook to run the migration from QEMU and from
> libvirt would be a plus.

Cédric,

Did you test also with commit 5504a8126115 and relevant patches reverted,
per mentioned above [1]?  Or vanilla master branch?

I wonder whether it shows the same regression in your setup.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-05 21:42   ` Peter Xu
@ 2024-12-06 10:24     ` Cédric Le Goater
  0 siblings, 0 replies; 140+ messages in thread
From: Cédric Le Goater @ 2024-12-06 10:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Maciej S. Szmigiero, Fabiano Rosas, Alex Williamson, Eric Blake,
	Markus Armbruster, Daniel P. Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

Hello !

[ ... ]

> Did you test also with commit 5504a8126115 and relevant patches reverted,
> per mentioned above [1]?  Or vanilla master branch?

I am on master and I didn't revert the "Dynamic sized kvm memslots array"
series, which we know has benefits for other VM configs and workloads.

For testing purposing, we could add a toggle defining a constant number of
memslots maybe ? and 0 would mean grow-on-demand ?

> I wonder whether it shows the same regression in your setup.

I didn't look at performance yet and downtime with mlx5 VFs was not a
big issue either. So I am expecting QE to share test results with vGPUs
under load.

Thanks,

C.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer
  2024-12-05 21:27 ` Cédric Le Goater
  2024-12-05 21:42   ` Peter Xu
@ 2024-12-06 18:44   ` Maciej S. Szmigiero
  1 sibling, 0 replies; 140+ messages in thread
From: Maciej S. Szmigiero @ 2024-12-06 18:44 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Daniel P . Berrangé, Avihai Horon,
	Joao Martins, qemu-devel

On 5.12.2024 22:27, Cédric Le Goater wrote:
> On 11/17/24 20:19, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This is an updated v3 patch series of the v2 series located here:
>> https://lore.kernel.org/qemu-devel/cover.1724701542.git.maciej.szmigiero@oracle.com/
>>
>> Changes from v2:
>> * Reworked the non-AIO (generic) thread pool to use Glib's GThreadPool
>> instead of making the current QEMU AIO thread pool generic.
>>
>> * Added QEMU_VM_COMMAND MIG_CMD_SWITCHOVER_START sub-command to the
>> migration bit stream protocol via migration compatibility flag.
>> Used this new bit stream sub-command to achieve barrier between main
>> migration channel device state data and multifd device state data instead
>> of introducing save_live_complete_precopy_{begin,end} handlers for that as
>> the previous patch set version did,
>>
>> * Added a new migration core thread pool of optional load threads and used
>> it to implement VFIO load thread instead of introducing load_finish handler
>> as the previous patch set version did.
>>
>> * Made VFIO device config state load operation happen from that device load
>> thread instead of from (now gone) load_finish handler that did such load on
>> the main migration thread.
>> In the future this may allow pushing BQL deeper into the device config
>> state load operation internals and so doing more of it in parallel.
>>
>> * Switched multifd_send() to using a serializing mutex for thread safety
>> instead of atomics as suggested by Peter since this seems to not cause
>> any performance regression while being simpler.
>>
>> * Added two patches improving SaveVMHandlers documentation: one documenting
>> the BQL behavior of load SaveVMHandlers, another one explaining
>> {load,save}_cleanup handlers semantics.
>>
>> * Added Peter's proposed patch making MultiFDSendData a struct from
>> https://lore.kernel.org/qemu-devel/ZuCickYhs3nf2ERC@x1n/
>> Other two patches from that message bring no performance benefits so they
>> were skipped (as discussed in that e-mail thread).
>>
>> * Switched x-migration-multifd-transfer VFIO property to tri-state (On,
>> Off, Auto), with Auto being now the default value.
>> This means hat VFIO device state transfer via multifd channels is
>> automatically attempted in configurations that otherwise support it.
>> Note that in this patch set version (in contrast with the previous version)
>> x-migration-multifd-transfer setting is meaningful both on source AND
>> destination QEMU.
>>
>> * Fixed a race condition with respect to the final multifd channel SYNC
>> packet sent by the RAM transfer code.
>>
>> * Made VFIO's bytes_transferred counter atomic since it is accessed from
>> multiple threads (thanks Avihai for spotting it).
>>
>> * Fixed an issue where VFIO device config sender QEMUFile wouldn't be
>> closed in some error conditions, switched to QEMUFile g_autoptr() automatic
>> memory management there to avoid such bugs in the future (also thanks
>> to Avihai for spotting the issue).
>>
>> * Many, MANY small changes, like renamed functions, added review tags,
>> locks annotations, code formatting, split out changes into separate
>> commits, etc.
>>
>> * Redid benchmarks.
>>
>> ========================================================================
>>
>> Benchmark results:
>> These are 25th percentile of downtime results from 70-100 back-and-forth
>> live migrations with the same VM config (guest wasn't restarted during
>> these migrations).
>>
>> Previous benchmarks reported the lowest downtime results ("0th percentile")
>> instead but these were subject to variation due to often being one of
>> outliers.
>>
>> The used setup for bechmarking was the same as the RFC version of patch set
>> used.
>>
>>
>> Results with 6 multifd channels:
>>              4 VFs   2 VFs    1 VF
>> Disabled: 1900 ms  859 ms  487 ms
>> Enabled:  1095 ms  556 ms  366 ms
>>
>> Results with 4 VFs but varied multifd channel count:
>>               6 ch     8 ch    15 ch
>> Enabled:  1095 ms  1104 ms  1125 ms
>>
>>
>> Important note:
>> 4 VF benchmarks were done with commit 5504a8126115
>> ("KVM: Dynamic sized kvm memslots array") and its revert-dependencies
>> reverted since this seems to improve performance in this VM config if the
>> multifd transfer is enabled: the downtime performance with this commit
>> present is 1141 ms enabled / 1730 ms disabled.
>>
>> Smaller VF counts actually do seem to benefit from this commit, so it's
>> likely that in the future adding some kind of a memslot pre-allocation
>> bit stream message might make sense to avoid this downtime regression for
>> 4 VF configs (and likely higher VF count too).
>>
>> ========================================================================
>>
>> This series is obviously targeting post QEMU 9.2 release by now
>> (AFAIK called 10.0).
>>
>> Will need to be changed to use hw_compat_10_0 once these become available.
>>
>> ========================================================================
>>
>> Maciej S. Szmigiero (23):
>>    migration: Clarify that {load,save}_cleanup handlers can run without
>>      setup
>>    thread-pool: Remove thread_pool_submit() function
>>    thread-pool: Rename AIO pool functions to *_aio() and data types to
>>      *Aio
>>    thread-pool: Implement generic (non-AIO) pool support
>>    migration: Add MIG_CMD_SWITCHOVER_START and its load handler
>>    migration: Add qemu_loadvm_load_state_buffer() and its handler
>>    migration: Document the BQL behavior of load SaveVMHandlers
>>    migration: Add thread pool of optional load threads
>>    migration/multifd: Split packet into header and RAM data
>>    migration/multifd: Device state transfer support - receive side
>>    migration/multifd: Make multifd_send() thread safe
>>    migration/multifd: Add an explicit MultiFDSendData destructor
>>    migration/multifd: Device state transfer support - send side
>>    migration/multifd: Add migration_has_device_state_support()
>>    migration/multifd: Send final SYNC only after device state is complete
>>    migration: Add save_live_complete_precopy_thread handler
>>    vfio/migration: Don't run load cleanup if load setup didn't run
>>    vfio/migration: Add x-migration-multifd-transfer VFIO property
>>    vfio/migration: Add load_device_config_state_start trace event
>>    vfio/migration: Convert bytes_transferred counter to atomic
>>    vfio/migration: Multifd device state transfer support - receive side
>>    migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile
>>    vfio/migration: Multifd device state transfer support - send side
>>
>> Peter Xu (1):
>>    migration/multifd: Make MultiFDSendData a struct
>>
>>   hw/core/machine.c                  |   2 +
>>   hw/vfio/migration.c                | 588 ++++++++++++++++++++++++++++-
>>   hw/vfio/pci.c                      |  11 +
>>   hw/vfio/trace-events               |  11 +-
>>   include/block/aio.h                |   8 +-
>>   include/block/thread-pool.h        |  20 +-
>>   include/hw/vfio/vfio-common.h      |  21 ++
>>   include/migration/client-options.h |   4 +
>>   include/migration/misc.h           |  16 +
>>   include/migration/register.h       |  67 +++-
>>   include/qemu/typedefs.h            |   5 +
>>   migration/colo.c                   |   3 +
>>   migration/meson.build              |   1 +
>>   migration/migration-hmp-cmds.c     |   2 +
>>   migration/migration.c              |   3 +
>>   migration/migration.h              |   2 +
>>   migration/multifd-device-state.c   | 193 ++++++++++
>>   migration/multifd-nocomp.c         |  45 ++-
>>   migration/multifd.c                | 228 +++++++++--
>>   migration/multifd.h                |  73 +++-
>>   migration/options.c                |   9 +
>>   migration/qemu-file.h              |   2 +
>>   migration/ram.c                    |  10 +-
>>   migration/savevm.c                 | 183 ++++++++-
>>   migration/savevm.h                 |   4 +
>>   migration/trace-events             |   1 +
>>   scripts/analyze-migration.py       |  11 +
>>   tests/unit/test-thread-pool.c      |   2 +-
>>   util/async.c                       |   6 +-
>>   util/thread-pool.c                 | 174 +++++++--
>>   util/trace-events                  |   6 +-
>>   31 files changed, 1586 insertions(+), 125 deletions(-)
>>   create mode 100644 migration/multifd-device-state.c
> 
> 
> I did a quick run of a VM with a mlx5 VF and a vGPU and I didn't see
> any issue when migrating. I used 4 channels for multifd. The trace
> events looked ok and useful. We will tune these with time. I wished
> we had some way to dump the thread and channel usage on each side.
> 
> A build was provided to RHEL QE. This to get more results when under
> stress and with larger device states. Don't expect feedback before
> next year though !

Thanks Cédric, more testing of a complex code change is always
appreciated.

Especially that your test environment is probably significantly
different from mine.

> Having a small cookbook to run the migration from QEMU and from
> libvirt would be a plus.
> 
> Thanks,
> 
> C.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 140+ messages in thread

end of thread, other threads:[~2024-12-19 16:22 UTC | newest]

Thread overview: 140+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-17 19:19 [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-11-17 19:19 ` [PATCH v3 01/24] migration: Clarify that {load, save}_cleanup handlers can run without setup Maciej S. Szmigiero
2024-11-25 19:08   ` Fabiano Rosas
2024-11-26 16:25   ` [PATCH v3 01/24] migration: Clarify that {load,save}_cleanup " Cédric Le Goater
2024-11-17 19:19 ` [PATCH v3 02/24] thread-pool: Remove thread_pool_submit() function Maciej S. Szmigiero
2024-11-25 19:13   ` Fabiano Rosas
2024-11-26 16:25   ` Cédric Le Goater
2024-12-04 19:24   ` Peter Xu
2024-12-06 21:11     ` Maciej S. Szmigiero
2024-11-17 19:19 ` [PATCH v3 03/24] thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio Maciej S. Szmigiero
2024-11-25 19:15   ` Fabiano Rosas
2024-11-26 16:26   ` Cédric Le Goater
2024-12-04 19:26   ` Peter Xu
2024-11-17 19:19 ` [PATCH v3 04/24] thread-pool: Implement generic (non-AIO) pool support Maciej S. Szmigiero
2024-11-25 19:41   ` Fabiano Rosas
2024-11-25 19:55     ` Maciej S. Szmigiero
2024-11-25 20:51       ` Fabiano Rosas
2024-11-26 19:25       ` Cédric Le Goater
2024-11-26 21:21         ` Maciej S. Szmigiero
2024-11-26 19:29   ` Cédric Le Goater
2024-11-26 21:22     ` Maciej S. Szmigiero
2024-12-05 13:10       ` Cédric Le Goater
2024-11-28 10:08   ` Avihai Horon
2024-11-28 12:11     ` Maciej S. Szmigiero
2024-12-04 20:04   ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 05/24] migration: Add MIG_CMD_SWITCHOVER_START and its load handler Maciej S. Szmigiero
2024-11-25 19:46   ` Fabiano Rosas
2024-11-26 19:37   ` Cédric Le Goater
2024-11-26 21:22     ` Maciej S. Szmigiero
2024-12-04 21:29   ` Peter Xu
2024-12-05 19:46     ` Zhang Chen
2024-12-06 18:24       ` Maciej S. Szmigiero
2024-12-06 22:12         ` Peter Xu
2024-12-09  1:43           ` Zhang Chen
2024-11-17 19:20 ` [PATCH v3 06/24] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-12-04 21:32   ` Peter Xu
2024-12-06 21:12     ` Maciej S. Szmigiero
2024-11-17 19:20 ` [PATCH v3 07/24] migration: Document the BQL behavior of load SaveVMHandlers Maciej S. Szmigiero
2024-12-04 21:38   ` Peter Xu
2024-12-06 18:40     ` Maciej S. Szmigiero
2024-12-06 22:15       ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 08/24] migration: Add thread pool of optional load threads Maciej S. Szmigiero
2024-11-25 19:58   ` Fabiano Rosas
2024-11-27  9:13   ` Cédric Le Goater
2024-11-27 20:16     ` Maciej S. Szmigiero
2024-12-04 22:48       ` Peter Xu
2024-12-05 16:15         ` Peter Xu
2024-12-10 23:05           ` Maciej S. Szmigiero
2024-12-10 23:05         ` Maciej S. Szmigiero
2024-12-12 16:38           ` Peter Xu
2024-12-12 22:53             ` Maciej S. Szmigiero
2024-12-16 16:29               ` Peter Xu
2024-12-16 23:15                 ` Maciej S. Szmigiero
2024-12-17 14:50                   ` Peter Xu
2024-11-28 10:26   ` Avihai Horon
2024-11-28 12:11     ` Maciej S. Szmigiero
2024-12-04 22:43       ` Peter Xu
2024-12-10 23:05         ` Maciej S. Szmigiero
2024-12-12 16:55           ` Peter Xu
2024-12-12 22:53             ` Maciej S. Szmigiero
2024-12-16 16:33               ` Peter Xu
2024-12-16 23:15                 ` Maciej S. Szmigiero
2024-11-17 19:20 ` [PATCH v3 09/24] migration/multifd: Split packet into header and RAM data Maciej S. Szmigiero
2024-11-26 14:34   ` Fabiano Rosas
2024-12-05 15:29   ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 10/24] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-12-05 16:06   ` Peter Xu
2024-12-06 21:12     ` Maciej S. Szmigiero
2024-12-06 21:57       ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 11/24] migration/multifd: Make multifd_send() thread safe Maciej S. Szmigiero
2024-12-05 16:17   ` Peter Xu
2024-12-06 21:12     ` Maciej S. Szmigiero
2024-11-17 19:20 ` [PATCH v3 12/24] migration/multifd: Add an explicit MultiFDSendData destructor Maciej S. Szmigiero
2024-12-05 16:23   ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 13/24] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-11-26 19:58   ` Fabiano Rosas
2024-11-26 21:22     ` Maciej S. Szmigiero
2024-11-17 19:20 ` [PATCH v3 14/24] migration/multifd: Make MultiFDSendData a struct Maciej S. Szmigiero
2024-11-17 19:20 ` [PATCH v3 15/24] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-11-26 20:05   ` Fabiano Rosas
2024-11-28 10:33   ` Avihai Horon
2024-11-28 12:12     ` Maciej S. Szmigiero
2024-12-05 16:44       ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 16/24] migration/multifd: Send final SYNC only after device state is complete Maciej S. Szmigiero
2024-11-26 20:52   ` Fabiano Rosas
2024-11-26 21:22     ` Maciej S. Szmigiero
2024-12-05 19:02       ` Peter Xu
2024-12-10 23:05         ` Maciej S. Szmigiero
2024-12-11 13:20           ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 17/24] migration: Add save_live_complete_precopy_thread handler Maciej S. Szmigiero
2024-11-29 14:03   ` Cédric Le Goater
2024-11-29 17:14     ` Maciej S. Szmigiero
2024-11-17 19:20 ` [PATCH v3 18/24] vfio/migration: Don't run load cleanup if load setup didn't run Maciej S. Szmigiero
2024-11-29 14:08   ` Cédric Le Goater
2024-11-29 17:15     ` Maciej S. Szmigiero
2024-12-03 15:09       ` Avihai Horon
2024-12-10 23:04         ` Maciej S. Szmigiero
2024-12-12 14:30           ` Avihai Horon
2024-12-12 22:52             ` Maciej S. Szmigiero
2024-12-19  9:19               ` Cédric Le Goater
2024-11-17 19:20 ` [PATCH v3 19/24] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
2024-11-29 14:11   ` Cédric Le Goater
2024-11-29 17:15     ` Maciej S. Szmigiero
2024-12-19  9:37       ` Cédric Le Goater
2024-11-17 19:20 ` [PATCH v3 20/24] vfio/migration: Add load_device_config_state_start trace event Maciej S. Szmigiero
2024-11-29 14:26   ` Cédric Le Goater
2024-11-17 19:20 ` [PATCH v3 21/24] vfio/migration: Convert bytes_transferred counter to atomic Maciej S. Szmigiero
2024-11-17 19:20 ` [PATCH v3 22/24] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-12-02 17:56   ` Cédric Le Goater
2024-12-10 23:04     ` Maciej S. Szmigiero
2024-12-19 14:13       ` Cédric Le Goater
2024-12-09  9:13   ` Avihai Horon
2024-12-10 23:06     ` Maciej S. Szmigiero
2024-12-12 14:33       ` Avihai Horon
2024-11-17 19:20 ` [PATCH v3 23/24] migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile Maciej S. Szmigiero
2024-11-26 21:01   ` Fabiano Rosas
2024-12-05 19:49   ` Peter Xu
2024-11-17 19:20 ` [PATCH v3 24/24] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-12-09  9:28   ` Avihai Horon
2024-12-10 23:06     ` Maciej S. Szmigiero
2024-12-12 11:10       ` Cédric Le Goater
2024-12-12 22:52         ` Maciej S. Szmigiero
2024-12-13 11:08           ` Cédric Le Goater
2024-12-13 18:25             ` Maciej S. Szmigiero
2024-12-12 14:54       ` Avihai Horon
2024-12-12 22:53         ` Maciej S. Szmigiero
2024-12-16 17:33           ` Peter Xu
2024-12-19  9:50             ` Cédric Le Goater
2024-12-04 19:10 ` [PATCH v3 00/24] Multifd 🔀 device state transfer support with VFIO consumer Peter Xu
2024-12-06 18:03   ` Maciej S. Szmigiero
2024-12-06 22:20     ` Peter Xu
2024-12-10 23:06       ` Maciej S. Szmigiero
2024-12-12 17:35         ` Peter Xu
2024-12-19  7:55           ` Yanghang Liu
2024-12-19  8:53             ` Cédric Le Goater
2024-12-19 13:00               ` Yanghang Liu
2024-12-05 21:27 ` Cédric Le Goater
2024-12-05 21:42   ` Peter Xu
2024-12-06 10:24     ` Cédric Le Goater
2024-12-06 18:44   ` Maciej S. Szmigiero

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).