[PATCH 0/2] VFIO device migration parallel state transitions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] VFIO device migration parallel state transitions
@ 2026-05-20 14:41 Maciej S. Szmigiero
  2026-05-20 14:41 ` [PATCH 1/2] system/runstate: Allow adjustment of priority for VM state change handlers Maciej S. Szmigiero
  2026-05-20 14:41 ` [PATCH 2/2] vfio/migration: Parallelize device state transitions Maciej S. Szmigiero
  0 siblings, 2 replies; 9+ messages in thread
From: Maciej S. Szmigiero @ 2026-05-20 14:41 UTC (permalink / raw)
  To: Alex Williamson, Cédric Le Goater
  Cc: Peter Xu, Fabiano Rosas, Paolo Bonzini, Avihai Horon, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

When multiple VFIO devices are present in a VM the fact that their state
transitions on migration happen sequentially has a visible impact on
migration downtime.

This is because both PRE_COPY -> PRE_COPY_P2P -> STOP_COPY transitions on
the source and RESUMING -> RUNNING_P2P -> RUNNING transitions on the target
happen during the switchover phase.
During this phase the VM is stopped so the downtime is ticking.

These device state transitions are performed by VM state change handlers
registered by the VFIO device migration code.

Instead of performing such state transition synchronously launch a thread
performing the state change in parallel with other VFIO devices and other
VM state change handlers at the particular VFIO device qdev tree depth.
Only wait for this thread to finish *after* all other handlers at this
tree depth finish doing their jobs.

To implement the above allow adjustment of priority for VM state change
handlers - specifically allow registering qdev VM state change handlers
below and above the normal priority level for the registering device qdev
tree depth, but still properly ordered with respect to handlers
registered at other tree depths.

This way these state transitions can happen in parallel not only with
respect to other VFIO device instances but also ordinary (serialized)
handlers for other devices at this qdev tree depth.

Downtime results:
               4 VFs    2 VFs    1 VF
Disabled:    1385 ms   758 ms  497 ms
Enabled:      986 ms   653 ms  493 ms 
IMPROVEMENT:   ~29 %    ~14 %    ~0 %

Test VM shape:
vCPU 12 cores x 2 threads, 15 GiB RAM.

VFIO devices in the source and target machine:
Mellanox ConnectX-7 with 100GbE link and ~100 MiB of device state per VF.

Maciej S. Szmigiero (2):
  system/runstate: Allow adjustment of priority for VM state change
    handlers
  vfio/migration: Parallelize device state transitions

 hw/core/vm-change-state-handler.c |  22 ++--
 hw/vfio/migration.c               | 174 ++++++++++++++++++++++++++++--
 hw/vfio/pci.c                     |   2 +
 hw/vfio/vfio-migration-internal.h |   4 +-
 include/hw/vfio/vfio-device.h     |   1 +
 include/system/runstate.h         |   2 +-
 6 files changed, 189 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] system/runstate: Allow adjustment of priority for VM state change handlers
  2026-05-20 14:41 [PATCH 0/2] VFIO device migration parallel state transitions Maciej S. Szmigiero
@ 2026-05-20 14:41 ` Maciej S. Szmigiero
  2026-06-01 13:26   ` Fabiano Rosas
  2026-05-20 14:41 ` [PATCH 2/2] vfio/migration: Parallelize device state transitions Maciej S. Szmigiero
  1 sibling, 1 reply; 9+ messages in thread
From: Maciej S. Szmigiero @ 2026-05-20 14:41 UTC (permalink / raw)
  To: Alex Williamson, Cédric Le Goater
  Cc: Peter Xu, Fabiano Rosas, Paolo Bonzini, Avihai Horon, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

A future patch will need an ability to register qdev VM state change
handlers below and above the normal priority level for the registering
device qdev tree depth, but still properly ordered with respect to handlers
registered at other tree depths.

To implement this split the priority argument passed to
qemu_add_vm_change_state_handler_prio_full() into two parts:
its 15 most significant bits will now carry the actual qdev tree depth
while the 16 least significant bits will now carry the caller provided
priority adjustment value.

Although this will limit the qdev tree to a depth of 32k such high limit
shouldn't be a problem in practice.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/core/vm-change-state-handler.c | 22 +++++++++++++++-------
 hw/vfio/migration.c               |  2 +-
 include/system/runstate.h         |  2 +-
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/hw/core/vm-change-state-handler.c b/hw/core/vm-change-state-handler.c
index 2c111350298d..3db0819984c6 100644
--- a/hw/core/vm-change-state-handler.c
+++ b/hw/core/vm-change-state-handler.c
@@ -19,9 +19,9 @@
 #include "hw/core/qdev.h"
 #include "system/runstate.h"
 
-static int qdev_get_dev_tree_depth(DeviceState *dev)
+static unsigned int qdev_get_dev_tree_depth(DeviceState *dev)
 {
-    int depth;
+    unsigned int depth;
 
     for (depth = 0; dev; depth++) {
         BusState *bus = dev->parent_bus;
@@ -61,20 +61,28 @@ VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
                                                      void *opaque)
 {
     assert(!cb || !cb_ret);
-    return qdev_add_vm_change_state_handler_full(dev, cb, NULL, cb_ret, opaque);
+    return qdev_add_vm_change_state_handler_full(dev, cb, NULL, cb_ret, opaque, 0);
 }
 
 /*
  * Exactly like qdev_add_vm_change_state_handler() but passes a prepare_cb
- * and the cb_ret arguments too.
+ * and the cb_ret arguments too and allows for adjustment of priority.
  */
 VMChangeStateEntry *qdev_add_vm_change_state_handler_full(
     DeviceState *dev, VMChangeStateHandler *cb, VMChangeStateHandler *prepare_cb,
-    VMChangeStateHandlerWithRet *cb_ret, void *opaque)
+    VMChangeStateHandlerWithRet *cb_ret, void *opaque, int adj)
 {
-    int depth = qdev_get_dev_tree_depth(dev);
+    unsigned int depth = qdev_get_dev_tree_depth(dev);
+    int prio;
+
+    /* 32k depth should be enough for everyone */
+    assert(depth <= INT16_MAX);
+
+    /* encode depth on 15 MSB and adj on 16 LSB */
+    assert(adj >= INT16_MIN && adj <= INT16_MAX);
+    prio = (depth << 16) + (adj - INT16_MIN);
 
     assert(!cb || !cb_ret);
     return qemu_add_vm_change_state_handler_prio_full(cb, prepare_cb, cb_ret,
-                                                      opaque, depth);
+                                                      opaque, prio);
 }
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index dbfd13b83a15..9889b20ad7dd 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -1227,7 +1227,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
                      vfio_vmstate_change_prepare :
                      NULL;
     migration->vm_state = qdev_add_vm_change_state_handler_full(
-        vbasedev->dev, vfio_vmstate_change, prepare_cb, NULL, vbasedev);
+        vbasedev->dev, vfio_vmstate_change, prepare_cb, NULL, vbasedev, 0);
     migration_add_notifier(&migration->migration_state,
                            vfio_migration_state_notifier);
 
diff --git a/include/system/runstate.h b/include/system/runstate.h
index 929379adae41..306e2684c195 100644
--- a/include/system/runstate.h
+++ b/include/system/runstate.h
@@ -69,7 +69,7 @@ VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
                                                      void *opaque);
 VMChangeStateEntry *qdev_add_vm_change_state_handler_full(
     DeviceState *dev, VMChangeStateHandler *cb, VMChangeStateHandler *prepare_cb,
-    VMChangeStateHandlerWithRet *cb_ret, void *opaque);
+    VMChangeStateHandlerWithRet *cb_ret, void *opaque, int adj);
 void qemu_del_vm_change_state_handler(VMChangeStateEntry *e);
 /**
  * vm_state_notify: Notify the state of the VM


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] vfio/migration: Parallelize device state transitions
  2026-05-20 14:41 [PATCH 0/2] VFIO device migration parallel state transitions Maciej S. Szmigiero
  2026-05-20 14:41 ` [PATCH 1/2] system/runstate: Allow adjustment of priority for VM state change handlers Maciej S. Szmigiero
@ 2026-05-20 14:41 ` Maciej S. Szmigiero
  2026-06-01 13:39   ` Fabiano Rosas
  1 sibling, 1 reply; 9+ messages in thread
From: Maciej S. Szmigiero @ 2026-05-20 14:41 UTC (permalink / raw)
  To: Alex Williamson, Cédric Le Goater
  Cc: Peter Xu, Fabiano Rosas, Paolo Bonzini, Avihai Horon, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

When multiple VFIO devices are present in a VM the fact that their state
transitions on migration happen sequentially has a visible impact on
migration downtime.

This is because both PRE_COPY -> PRE_COPY_P2P -> STOP_COPY transitions on
the source and RESUMING -> RUNNING_P2P -> RUNNING transitions on the target
happen during the switchover phase.
During this phase the VM is stopped so the downtime is ticking.

These device state transitions are performed by VM state change handlers
registered by the VFIO device migration code.

Instead of performing such state transition synchronously use the priority
adjustment mechanism from the previous patch to launch a thread performing
the state change at the priority level *before* all other VM state change
handlers at the particular VFIO device qdev tree depth.
Only wait for this thread to finish at the priority level ordered *after*
all other handlers at this tree depth.

This way these state transitions can happen in parallel not only with
respect to other VFIO device instances but also ordinary (serialized)
handlers for other devices at this qdev tree depth while still being
properly ordered with respect to handlers registered at other tree depths.

Unfortunately, the order in which VM state handlers are called depends
on whether the VM is starting or stopping.
Because of this, one extra layer of indirection is necessary to make the
(first, second) ordering of these handlers constant.

Enable the feature by default since it has no impact on the migration bit
stream protocol - it shouldn't need disabling for anything else but
debugging scenarios.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/vfio/migration.c               | 174 ++++++++++++++++++++++++++++--
 hw/vfio/pci.c                     |   2 +
 hw/vfio/vfio-migration-internal.h |   4 +-
 include/hw/vfio/vfio-device.h     |   1 +
 4 files changed, 173 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 9889b20ad7dd..fd2b37a85f4b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -1047,6 +1047,135 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
                               mig_state_to_str(new_state));
 }
 
+typedef struct VMStateChangeThreadData {
+    VFIODevice *vbasedev;
+    bool is_prepare;
+    bool running;
+    RunState state;
+} VMStateChangeThreadData;
+
+static void *vfio_vmstate_change_thread(void *opaque)
+{
+    g_autofree VMStateChangeThreadData *data = opaque;
+
+    if (data->is_prepare) {
+        vfio_vmstate_change_prepare(data->vbasedev, data->running, data->state);
+    } else {
+        vfio_vmstate_change(data->vbasedev, data->running, data->state);
+    }
+
+    return NULL;
+}
+
+static void vfio_vmstate_change_thread_launch(VFIODevice *vbasedev,
+                                              bool is_prepare,
+                                              bool running,
+                                              RunState state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VMStateChangeThreadData *data = g_new(VMStateChangeThreadData, 1);
+
+    data->vbasedev = vbasedev;
+    data->is_prepare = is_prepare;
+    data->running = running;
+    data->state = state;
+
+    assert(!migration->vm_state_thread_running);
+    migration->vm_state_thread_running = true;
+
+    qemu_thread_create(&migration->vm_state_thread,
+                       is_prepare ? "vfio_vmstate_change_prepare" :
+                       "vfio_vmstate_change",
+                       vfio_vmstate_change_thread, data,
+                       QEMU_THREAD_JOINABLE);
+}
+
+static void vfio_vmstate_change_thread_join(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    assert(migration->vm_state_thread_running);
+
+    qemu_thread_join(&migration->vm_state_thread);
+
+    migration->vm_state_thread_running = false;
+}
+
+/*
+ * The first handler called during a vmstate change at a particular depth -
+ * launch the VFIO device state change thread.
+ */
+static void vfio_vmstate_change_first(VFIODevice *vbasedev,
+                                      bool is_prepare,
+                                      bool running, RunState state)
+{
+    vfio_vmstate_change_thread_launch(vbasedev,
+                                      is_prepare,
+                                      running,
+                                      state);
+}
+
+/*
+ * The last handler called during a vmstate change at a particular depth -
+ * wait for the VFIO device state change thread to finish.
+ */
+static void vfio_vmstate_change_second(VFIODevice *vbasedev)
+{
+    vfio_vmstate_change_thread_join(vbasedev);
+}
+
+/*
+ * Lower priority number handler:
+ * Called before higher number handler when VM is starting
+ * but after higher number handler when VM is stopping.
+ */
+static void vfio_vmstate_change_prepare_lower_prio(void *opaque, bool running,
+                                                   RunState state)
+{
+    if (running) {
+        vfio_vmstate_change_first(opaque, true, running, state);
+    } else {
+        vfio_vmstate_change_second(opaque);
+    }
+}
+
+/*
+ * Higher priority number handler:
+ * Called after lower number handler when VM is starting
+ * but before lower number handler when VM is stopping.
+ */
+static void vfio_vmstate_change_prepare_higher_prio(void *opaque, bool running,
+                                                    RunState state)
+{
+    if (running) {
+        vfio_vmstate_change_second(opaque);
+    } else {
+        vfio_vmstate_change_first(opaque, true, running, state);
+    }
+}
+
+/* Same ordering issues as for vfio_vmstate_change_prepare_lower_prio() */
+static void vfio_vmstate_change_lower_prio(void *opaque, bool running,
+                                           RunState state)
+{
+    if (running) {
+        vfio_vmstate_change_first(opaque, false, running, state);
+    } else {
+        vfio_vmstate_change_second(opaque);
+    }
+}
+
+/* Same ordering issues as for vfio_vmstate_change_prepare_higher_prio() */
+static void vfio_vmstate_change_higher_prio(void *opaque, bool running,
+                                            RunState state)
+{
+    if (running) {
+        vfio_vmstate_change_second(opaque);
+    } else {
+        vfio_vmstate_change_first(opaque, false, running, state);
+    }
+}
+
 static int vfio_migration_state_notifier(NotifierWithReturn *notifier,
                                          MigrationEvent *e, Error **errp)
 {
@@ -1063,6 +1192,8 @@ static int vfio_migration_state_notifier(NotifierWithReturn *notifier,
          * MigrationNotifyFunc may not return an error code and an Error
          * object for MIG_EVENT_FAILED. Hence, report the error
          * locally and ignore the errp argument.
+         * This state change is not parallelized as it is not expected to be
+         * performance critical.
          */
         ret = vfio_migration_set_state_or_reset(vbasedev,
                                                 VFIO_DEVICE_STATE_RUNNING,
@@ -1143,7 +1274,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
     uint64_t mig_flags = 0;
-    VMChangeStateHandler *prepare_cb;
+    VMChangeStateHandler *prepare_cb, *prepare_cb_lower, *prepare_cb_higher;
 
     if (!vbasedev->ops->vfio_get_object) {
         ret = -EINVAL;
@@ -1223,11 +1354,34 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
     register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
                          vbasedev);
 
-    prepare_cb = migration->mig_flags & VFIO_MIGRATION_P2P ?
-                     vfio_vmstate_change_prepare :
-                     NULL;
-    migration->vm_state = qdev_add_vm_change_state_handler_full(
-        vbasedev->dev, vfio_vmstate_change, prepare_cb, NULL, vbasedev, 0);
+    if (vbasedev->migration_parallel_states) {
+        /*
+         * Unfortunately, the order in which vmstate handlers are called depends
+         * on whether the VM is starting or stopping.
+         * Because of this, one extra layer of indirection is necessary
+         * to make the (first, second) ordering of these handlers constant.
+         */
+        prepare_cb_lower = migration->mig_flags & VFIO_MIGRATION_P2P ?
+            vfio_vmstate_change_prepare_lower_prio : NULL;
+        prepare_cb_higher = migration->mig_flags & VFIO_MIGRATION_P2P ?
+            vfio_vmstate_change_prepare_higher_prio : NULL;
+        migration->vm_state_lower_prio = qdev_add_vm_change_state_handler_full(
+            vbasedev->dev, vfio_vmstate_change_lower_prio, prepare_cb_lower,
+            NULL, vbasedev, -1);
+        migration->vm_state_higher_prio = qdev_add_vm_change_state_handler_full(
+            vbasedev->dev, vfio_vmstate_change_higher_prio, prepare_cb_higher,
+            NULL, vbasedev, 1);
+    } else {
+        prepare_cb = migration->mig_flags & VFIO_MIGRATION_P2P ?
+            vfio_vmstate_change_prepare : NULL;
+        /* Arbitrarily use lower_prio field to store non-parallel handler */
+        migration->vm_state_lower_prio =
+            qdev_add_vm_change_state_handler_full(vbasedev->dev,
+                                                  vfio_vmstate_change,
+                                                  prepare_cb, NULL,
+                                                  vbasedev, 0);
+    }
+
     migration_add_notifier(&migration->migration_state,
                            vfio_migration_state_notifier);
 
@@ -1302,7 +1456,13 @@ static void vfio_migration_deinit(VFIODevice *vbasedev)
     VFIOMigration *migration = vbasedev->migration;
 
     migration_remove_notifier(&migration->migration_state);
-    qemu_del_vm_change_state_handler(migration->vm_state);
+
+    if (vbasedev->migration_parallel_states) {
+        qemu_del_vm_change_state_handler(migration->vm_state_higher_prio);
+    }
+    /* Non-parallel state change uses lower_prio field to store its handler */
+    qemu_del_vm_change_state_handler(migration->vm_state_lower_prio);
+
     unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev);
     vfio_migration_free(vbasedev);
     vfio_unblock_multiple_devices_migration();
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b2a07f6bb421..fa2411474c9b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3777,6 +3777,8 @@ static const Property vfio_pci_properties[] = {
                             ON_OFF_AUTO_AUTO),
     DEFINE_PROP_SIZE("x-migration-max-queued-buffers-size", VFIOPCIDevice,
                      vbasedev.migration_max_queued_buffers_size, UINT64_MAX),
+    DEFINE_PROP_BOOL("x-migration-parallel-states", VFIOPCIDevice,
+                     vbasedev.migration_parallel_states, true),
     DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
                      vbasedev.migration_events, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
index a1c58b112685..9fb00f9f4d7d 100644
--- a/hw/vfio/vfio-migration-internal.h
+++ b/hw/vfio/vfio-migration-internal.h
@@ -38,7 +38,9 @@ typedef struct VFIOMultifd VFIOMultifd;
 
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
-    VMChangeStateEntry *vm_state;
+    VMChangeStateEntry *vm_state_lower_prio, *vm_state_higher_prio;
+    QemuThread vm_state_thread;
+    bool vm_state_thread_running;
     NotifierWithReturn migration_state;
     uint32_t device_state;
     int data_fd;
diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 380a55d6e5ea..28004e1e99f4 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -69,6 +69,7 @@ typedef struct VFIODevice {
     OnOffAuto migration_multifd_transfer;
     OnOffAuto migration_load_config_after_iter;
     uint64_t migration_max_queued_buffers_size;
+    bool migration_parallel_states;
     bool migration_events;
     bool use_region_fds;
     VFIODeviceOps *ops;


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] system/runstate: Allow adjustment of priority for VM state change handlers
  2026-05-20 14:41 ` [PATCH 1/2] system/runstate: Allow adjustment of priority for VM state change handlers Maciej S. Szmigiero
@ 2026-06-01 13:26   ` Fabiano Rosas
  0 siblings, 0 replies; 9+ messages in thread
From: Fabiano Rosas @ 2026-06-01 13:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater
  Cc: Peter Xu, Paolo Bonzini, Avihai Horon, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> A future patch will need an ability to register qdev VM state change
> handlers below and above the normal priority level for the registering
> device qdev tree depth, but still properly ordered with respect to handlers
> registered at other tree depths.
>
> To implement this split the priority argument passed to
> qemu_add_vm_change_state_handler_prio_full() into two parts:
> its 15 most significant bits will now carry the actual qdev tree depth
> while the 16 least significant bits will now carry the caller provided
> priority adjustment value.
>
> Although this will limit the qdev tree to a depth of 32k such high limit
> shouldn't be a problem in practice.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] vfio/migration: Parallelize device state transitions
  2026-05-20 14:41 ` [PATCH 2/2] vfio/migration: Parallelize device state transitions Maciej S. Szmigiero
@ 2026-06-01 13:39   ` Fabiano Rosas
  2026-06-01 19:33     ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Fabiano Rosas @ 2026-06-01 13:39 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater
  Cc: Peter Xu, Paolo Bonzini, Avihai Horon, qemu-devel

"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>
> When multiple VFIO devices are present in a VM the fact that their state
> transitions on migration happen sequentially has a visible impact on
> migration downtime.
>
> This is because both PRE_COPY -> PRE_COPY_P2P -> STOP_COPY transitions on
> the source and RESUMING -> RUNNING_P2P -> RUNNING transitions on the target
> happen during the switchover phase.
> During this phase the VM is stopped so the downtime is ticking.
>
> These device state transitions are performed by VM state change handlers
> registered by the VFIO device migration code.
>
> Instead of performing such state transition synchronously use the priority
> adjustment mechanism from the previous patch to launch a thread performing
> the state change at the priority level *before* all other VM state change
> handlers at the particular VFIO device qdev tree depth.
> Only wait for this thread to finish at the priority level ordered *after*
> all other handlers at this tree depth.
>
> This way these state transitions can happen in parallel not only with
> respect to other VFIO device instances but also ordinary (serialized)
> handlers for other devices at this qdev tree depth while still being
> properly ordered with respect to handlers registered at other tree depths.
>
> Unfortunately, the order in which VM state handlers are called depends
> on whether the VM is starting or stopping.
> Because of this, one extra layer of indirection is necessary to make the
> (first, second) ordering of these handlers constant.
>
> Enable the feature by default since it has no impact on the migration bit
> stream protocol - it shouldn't need disabling for anything else but
> debugging scenarios.
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  hw/vfio/migration.c               | 174 ++++++++++++++++++++++++++++--
>  hw/vfio/pci.c                     |   2 +
>  hw/vfio/vfio-migration-internal.h |   4 +-
>  include/hw/vfio/vfio-device.h     |   1 +
>  4 files changed, 173 insertions(+), 8 deletions(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 9889b20ad7dd..fd2b37a85f4b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -1047,6 +1047,135 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>                                mig_state_to_str(new_state));
>  }
>  
> +typedef struct VMStateChangeThreadData {
> +    VFIODevice *vbasedev;
> +    bool is_prepare;
> +    bool running;
> +    RunState state;
> +} VMStateChangeThreadData;
> +
> +static void *vfio_vmstate_change_thread(void *opaque)
> +{
> +    g_autofree VMStateChangeThreadData *data = opaque;
> +
> +    if (data->is_prepare) {
> +        vfio_vmstate_change_prepare(data->vbasedev, data->running, data->state);
> +    } else {
> +        vfio_vmstate_change(data->vbasedev, data->running, data->state);
> +    }
> +
> +    return NULL;
> +}
> +
> +static void vfio_vmstate_change_thread_launch(VFIODevice *vbasedev,
> +                                              bool is_prepare,
> +                                              bool running,
> +                                              RunState state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VMStateChangeThreadData *data = g_new(VMStateChangeThreadData, 1);
> +
> +    data->vbasedev = vbasedev;
> +    data->is_prepare = is_prepare;
> +    data->running = running;
> +    data->state = state;
> +
> +    assert(!migration->vm_state_thread_running);
> +    migration->vm_state_thread_running = true;
> +
> +    qemu_thread_create(&migration->vm_state_thread,
> +                       is_prepare ? "vfio_vmstate_change_prepare" :
> +                       "vfio_vmstate_change",
> +                       vfio_vmstate_change_thread, data,
> +                       QEMU_THREAD_JOINABLE);
> +}
> +
> +static void vfio_vmstate_change_thread_join(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    assert(migration->vm_state_thread_running);
> +
> +    qemu_thread_join(&migration->vm_state_thread);
> +
> +    migration->vm_state_thread_running = false;
> +}
> +
> +/*
> + * The first handler called during a vmstate change at a particular depth -
> + * launch the VFIO device state change thread.
> + */
> +static void vfio_vmstate_change_first(VFIODevice *vbasedev,
> +                                      bool is_prepare,
> +                                      bool running, RunState state)
> +{
> +    vfio_vmstate_change_thread_launch(vbasedev,
> +                                      is_prepare,
> +                                      running,
> +                                      state);
> +}
> +
> +/*
> + * The last handler called during a vmstate change at a particular depth -
> + * wait for the VFIO device state change thread to finish.
> + */
> +static void vfio_vmstate_change_second(VFIODevice *vbasedev)
> +{
> +    vfio_vmstate_change_thread_join(vbasedev);
> +}
> +
> +/*
> + * Lower priority number handler:
> + * Called before higher number handler when VM is starting
> + * but after higher number handler when VM is stopping.
> + */
> +static void vfio_vmstate_change_prepare_lower_prio(void *opaque, bool running,
> +                                                   RunState state)
> +{
> +    if (running) {
> +        vfio_vmstate_change_first(opaque, true, running, state);
> +    } else {
> +        vfio_vmstate_change_second(opaque);
> +    }
> +}
> +
> +/*
> + * Higher priority number handler:
> + * Called after lower number handler when VM is starting
> + * but before lower number handler when VM is stopping.
> + */
> +static void vfio_vmstate_change_prepare_higher_prio(void *opaque, bool running,
> +                                                    RunState state)
> +{
> +    if (running) {
> +        vfio_vmstate_change_second(opaque);
> +    } else {
> +        vfio_vmstate_change_first(opaque, true, running, state);
> +    }
> +}
> +
> +/* Same ordering issues as for vfio_vmstate_change_prepare_lower_prio() */
> +static void vfio_vmstate_change_lower_prio(void *opaque, bool running,
> +                                           RunState state)
> +{
> +    if (running) {
> +        vfio_vmstate_change_first(opaque, false, running, state);
> +    } else {
> +        vfio_vmstate_change_second(opaque);
> +    }
> +}
> +
> +/* Same ordering issues as for vfio_vmstate_change_prepare_higher_prio() */
> +static void vfio_vmstate_change_higher_prio(void *opaque, bool running,
> +                                            RunState state)
> +{
> +    if (running) {
> +        vfio_vmstate_change_second(opaque);
> +    } else {
> +        vfio_vmstate_change_first(opaque, false, running, state);
> +    }
> +}
> +
>  static int vfio_migration_state_notifier(NotifierWithReturn *notifier,
>                                           MigrationEvent *e, Error **errp)
>  {
> @@ -1063,6 +1192,8 @@ static int vfio_migration_state_notifier(NotifierWithReturn *notifier,
>           * MigrationNotifyFunc may not return an error code and an Error
>           * object for MIG_EVENT_FAILED. Hence, report the error
>           * locally and ignore the errp argument.
> +         * This state change is not parallelized as it is not expected to be
> +         * performance critical.
>           */
>          ret = vfio_migration_set_state_or_reset(vbasedev,
>                                                  VFIO_DEVICE_STATE_RUNNING,
> @@ -1143,7 +1274,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
>      char id[256] = "";
>      g_autofree char *path = NULL, *oid = NULL;
>      uint64_t mig_flags = 0;
> -    VMChangeStateHandler *prepare_cb;
> +    VMChangeStateHandler *prepare_cb, *prepare_cb_lower, *prepare_cb_higher;
>  
>      if (!vbasedev->ops->vfio_get_object) {
>          ret = -EINVAL;
> @@ -1223,11 +1354,34 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
>      register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
>                           vbasedev);
>  
> -    prepare_cb = migration->mig_flags & VFIO_MIGRATION_P2P ?
> -                     vfio_vmstate_change_prepare :
> -                     NULL;
> -    migration->vm_state = qdev_add_vm_change_state_handler_full(
> -        vbasedev->dev, vfio_vmstate_change, prepare_cb, NULL, vbasedev, 0);
> +    if (vbasedev->migration_parallel_states) {
> +        /*
> +         * Unfortunately, the order in which vmstate handlers are called depends
> +         * on whether the VM is starting or stopping.
> +         * Because of this, one extra layer of indirection is necessary
> +         * to make the (first, second) ordering of these handlers constant.
> +         */
> +        prepare_cb_lower = migration->mig_flags & VFIO_MIGRATION_P2P ?
> +            vfio_vmstate_change_prepare_lower_prio : NULL;
> +        prepare_cb_higher = migration->mig_flags & VFIO_MIGRATION_P2P ?
> +            vfio_vmstate_change_prepare_higher_prio : NULL;
> +        migration->vm_state_lower_prio = qdev_add_vm_change_state_handler_full(
> +            vbasedev->dev, vfio_vmstate_change_lower_prio, prepare_cb_lower,
> +            NULL, vbasedev, -1);
> +        migration->vm_state_higher_prio = qdev_add_vm_change_state_handler_full(
> +            vbasedev->dev, vfio_vmstate_change_higher_prio, prepare_cb_higher,
> +            NULL, vbasedev, 1);
> +    } else {
> +        prepare_cb = migration->mig_flags & VFIO_MIGRATION_P2P ?
> +            vfio_vmstate_change_prepare : NULL;
> +        /* Arbitrarily use lower_prio field to store non-parallel handler */
> +        migration->vm_state_lower_prio =
> +            qdev_add_vm_change_state_handler_full(vbasedev->dev,
> +                                                  vfio_vmstate_change,
> +                                                  prepare_cb, NULL,
> +                                                  vbasedev, 0);
> +    }
> +
>      migration_add_notifier(&migration->migration_state,
>                             vfio_migration_state_notifier);
>  
> @@ -1302,7 +1456,13 @@ static void vfio_migration_deinit(VFIODevice *vbasedev)
>      VFIOMigration *migration = vbasedev->migration;
>  
>      migration_remove_notifier(&migration->migration_state);
> -    qemu_del_vm_change_state_handler(migration->vm_state);
> +
> +    if (vbasedev->migration_parallel_states) {
> +        qemu_del_vm_change_state_handler(migration->vm_state_higher_prio);
> +    }
> +    /* Non-parallel state change uses lower_prio field to store its handler */
> +    qemu_del_vm_change_state_handler(migration->vm_state_lower_prio);
> +
>      unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev);
>      vfio_migration_free(vbasedev);
>      vfio_unblock_multiple_devices_migration();
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index b2a07f6bb421..fa2411474c9b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3777,6 +3777,8 @@ static const Property vfio_pci_properties[] = {
>                              ON_OFF_AUTO_AUTO),
>      DEFINE_PROP_SIZE("x-migration-max-queued-buffers-size", VFIOPCIDevice,
>                       vbasedev.migration_max_queued_buffers_size, UINT64_MAX),
> +    DEFINE_PROP_BOOL("x-migration-parallel-states", VFIOPCIDevice,
> +                     vbasedev.migration_parallel_states, true),
>      DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
>                       vbasedev.migration_events, false),
>      DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
> index a1c58b112685..9fb00f9f4d7d 100644
> --- a/hw/vfio/vfio-migration-internal.h
> +++ b/hw/vfio/vfio-migration-internal.h
> @@ -38,7 +38,9 @@ typedef struct VFIOMultifd VFIOMultifd;
>  
>  typedef struct VFIOMigration {
>      struct VFIODevice *vbasedev;
> -    VMChangeStateEntry *vm_state;
> +    VMChangeStateEntry *vm_state_lower_prio, *vm_state_higher_prio;
> +    QemuThread vm_state_thread;
> +    bool vm_state_thread_running;
>      NotifierWithReturn migration_state;
>      uint32_t device_state;
>      int data_fd;
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 380a55d6e5ea..28004e1e99f4 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -69,6 +69,7 @@ typedef struct VFIODevice {
>      OnOffAuto migration_multifd_transfer;
>      OnOffAuto migration_load_config_after_iter;
>      uint64_t migration_max_queued_buffers_size;
> +    bool migration_parallel_states;
>      bool migration_events;
>      bool use_region_fds;
>      VFIODeviceOps *ops;

A bit idiosyncratic, but I don't have any better suggestions.

Also, pre-existing but maybe you'll want to fix it in this patch,
'vmstate_change' is the wrong term. This is vm_change_state or
vm_state_change.

Acked-by: Fabiano Rosas <farosas@suse.de>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] vfio/migration: Parallelize device state transitions
  2026-06-01 13:39   ` Fabiano Rosas
@ 2026-06-01 19:33     ` Peter Xu
  2026-06-02 10:01       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2026-06-01 19:33 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Maciej S. Szmigiero, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Avihai Horon, qemu-devel

On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:
> A bit idiosyncratic, but I don't have any better suggestions.

One thing I don't think proper with the current approach: if it's a real
"priority" it should be invoked always with high->low order, no matter if
running=true or not..  but that's not what will happen if we encode it with
the "depth" field.

That is an implication that we have two demands, and they aren't the same,
but this solution wrongly packed them together.

In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb().  That is
a real impl of prority queues with only two: prepare_cb() always has higher
priority than the cb() itself.

Can VFIO simply leverage this existing interface, instead of hijacking
"depth" in an unwanted way?

I believe the _P2P states still need to happen first for all vfio devices,
then it can do the next step. Logically I think it can also be done
internally within VFIO by proper impl of these two callbacks, to achieve
both:

- Proper transition to P2P states first for all devices, meanwhile,
- Proper concurrency of all VFIO devices on device state transitions

Would that work in a cleaner way?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] vfio/migration: Parallelize device state transitions
  2026-06-01 19:33     ` Peter Xu
@ 2026-06-02 10:01       ` Maciej S. Szmigiero
  2026-06-02 20:17         ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Maciej S. Szmigiero @ 2026-06-02 10:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Avihai Horon, qemu-devel

On 1.06.2026 21:33, Peter Xu wrote:
> On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:
>> A bit idiosyncratic, but I don't have any better suggestions.
> 
> One thing I don't think proper with the current approach: if it's a real
> "priority" it should be invoked always with high->low order, no matter if
> running=true or not..  but that's not what will happen if we encode it with
> the "depth" field.
> 
> That is an implication that we have two demands, and they aren't the same,
> but this solution wrongly packed them together.
> 
> In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb().  That is
> a real impl of prority queues with only two: prepare_cb() always has higher
> priority than the cb() itself.
> 
> Can VFIO simply leverage this existing interface, instead of hijacking
> "depth" in an unwanted way?
> 
> I believe the _P2P states still need to happen first for all vfio devices,
> then it can do the next step. Logically I think it can also be done
> internally within VFIO by proper impl of these two callbacks, to achieve
> both:
> 
> - Proper transition to P2P states first for all devices, meanwhile,
> - Proper concurrency of all VFIO devices on device state transitions
> 
> Would that work in a cleaner way?

In the VFIO migration save/source code we have two transitions:
PRE_COPY -> PRE_COPY_P2P done via prepare_cb() and
PRE_COPY_P2P -> STOP_COPY done via cb().
On the load/target side the situation is similar, only states are
different.

For each of these transitions we need two callbacks: one has to launch
per-VFIO device state change thread and the other one has to "collect"
(or join) these threads.

All handlers of the first callback have to be complete before calling
any handler of the second callback since otherwise there would be
VFIO device vs VFIO device serialization.

Moreover, all the prepare_cb()-related transitions have to be done
before any of the cb()-related is started.

I presume what you suggest above is to launch per-VFIO device state
transition thread in prepare_cb() which would then be responsible for
doing *both* state transitions for this VFIO device.
And then collect this thread in cb().

However, this would need some extra interlocking:
- A lock inside VFIO code shared by all VFIO devices making sure
no VFIO device starts the second transition (the cb() one) before
all VFIO devices finish the first transition (the prepare_cb() one).

- Looking at the current QEMU code, there are some non-VFIO cb()
handlers registered via qdev_add_vm_change_state_handler() and
quite a few handlers registered via qemu_add_vm_change_state_handler().

I presume these VFIO prepare_cb()-related transitions need to finish
before calling any of these handlers to maintain the current ordering
so some kind of interlocking in vm_state_notify() between calling
prepare_cb() and cb() may be necessary since any "prio" based
ordering will have the same problem of reverse calling order on VM
start vs stop.

To be clear, I'm not insisting on the implementation from this patch -
any alternative one that preserves the aforementioned ordering rules
and has similar downtime performance is probably okay too.

> Thanks,
> 

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] vfio/migration: Parallelize device state transitions
  2026-06-02 10:01       ` Maciej S. Szmigiero
@ 2026-06-02 20:17         ` Peter Xu
  2026-06-03 21:34           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2026-06-02 20:17 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Avihai Horon, qemu-devel

On Tue, Jun 02, 2026 at 12:01:32PM +0200, Maciej S. Szmigiero wrote:
> On 1.06.2026 21:33, Peter Xu wrote:
> > On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:
> > > A bit idiosyncratic, but I don't have any better suggestions.
> > 
> > One thing I don't think proper with the current approach: if it's a real
> > "priority" it should be invoked always with high->low order, no matter if
> > running=true or not..  but that's not what will happen if we encode it with
> > the "depth" field.
> > 
> > That is an implication that we have two demands, and they aren't the same,
> > but this solution wrongly packed them together.
> > 
> > In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb().  That is
> > a real impl of prority queues with only two: prepare_cb() always has higher
> > priority than the cb() itself.
> > 
> > Can VFIO simply leverage this existing interface, instead of hijacking
> > "depth" in an unwanted way?
> > 
> > I believe the _P2P states still need to happen first for all vfio devices,
> > then it can do the next step. Logically I think it can also be done
> > internally within VFIO by proper impl of these two callbacks, to achieve
> > both:
> > 
> > - Proper transition to P2P states first for all devices, meanwhile,
> > - Proper concurrency of all VFIO devices on device state transitions
> > 
> > Would that work in a cleaner way?
> 
> In the VFIO migration save/source code we have two transitions:
> PRE_COPY -> PRE_COPY_P2P done via prepare_cb() and
> PRE_COPY_P2P -> STOP_COPY done via cb().
> On the load/target side the situation is similar, only states are
> different.
> 
> For each of these transitions we need two callbacks: one has to launch
> per-VFIO device state change thread and the other one has to "collect"
> (or join) these threads.
> 
> All handlers of the first callback have to be complete before calling
> any handler of the second callback since otherwise there would be
> VFIO device vs VFIO device serialization.
> 
> Moreover, all the prepare_cb()-related transitions have to be done
> before any of the cb()-related is started.
> 
> 
> I presume what you suggest above is to launch per-VFIO device state
> transition thread in prepare_cb() which would then be responsible for
> doing *both* state transitions for this VFIO device.
> And then collect this thread in cb().

Yes.

> 
> However, this would need some extra interlocking:
> - A lock inside VFIO code shared by all VFIO devices making sure
> no VFIO device starts the second transition (the cb() one) before
> all VFIO devices finish the first transition (the prepare_cb() one).

Yes, more below.

> 
> - Looking at the current QEMU code, there are some non-VFIO cb()
> handlers registered via qdev_add_vm_change_state_handler() and
> quite a few handlers registered via qemu_add_vm_change_state_handler().
> 
> I presume these VFIO prepare_cb()-related transitions need to finish
> before calling any of these handlers to maintain the current ordering

IIUC, we don't necessarily need to keep this ordering; or do you see any
real demand of that?

What I see is that the order is only introduced due to P2P state of VFIO,
it seems very specific to me.

> so some kind of interlocking in vm_state_notify() between calling
> prepare_cb() and cb() may be necessary since any "prio" based
> ordering will have the same problem of reverse calling order on VM
> start vs stop.
> 
> 
> To be clear, I'm not insisting on the implementation from this patch -
> any alternative one that preserves the aforementioned ordering rules
> and has similar downtime performance is probably okay too.

The problem is, IIUC patch 1 needs better justification on its own, and I
feel like it's not doing the right thing, as discussed previously. So even
if we want to introduce a new priority-based approach, we may want to
rethink patch 1.

IMHO it'll be perfect if we can simply reuse prepare_cb().  It seems
working.

Due to this demand, I got to learn we have pthread_barrier_init(), but..
personally I like a simple atomic+event approach:

vfio thread:

  # Do step 1, ->P2P
  if (qatomic_fetch_inc(&vfio_p2p_done) + 1 == vfio_total_devices) {                                                                  
      qemu_event_set(&vfio_p2p_done_event);
  }                                                                  
  qemu_event_wait(&vfio_p2p_done_event);
  # Do step 2, ->STOPCOPY

cb() only join()s.

Would this work?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] vfio/migration: Parallelize device state transitions
  2026-06-02 20:17         ` Peter Xu
@ 2026-06-03 21:34           ` Maciej S. Szmigiero
  0 siblings, 0 replies; 9+ messages in thread
From: Maciej S. Szmigiero @ 2026-06-03 21:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Avihai Horon, qemu-devel

On 2.06.2026 22:17, Peter Xu wrote:
> On Tue, Jun 02, 2026 at 12:01:32PM +0200, Maciej S. Szmigiero wrote:
>> On 1.06.2026 21:33, Peter Xu wrote:
>>> On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:
>>>> A bit idiosyncratic, but I don't have any better suggestions.
>>>
>>> One thing I don't think proper with the current approach: if it's a real
>>> "priority" it should be invoked always with high->low order, no matter if
>>> running=true or not..  but that's not what will happen if we encode it with
>>> the "depth" field.
>>>
>>> That is an implication that we have two demands, and they aren't the same,
>>> but this solution wrongly packed them together.
>>>
>>> In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb().  That is
>>> a real impl of prority queues with only two: prepare_cb() always has higher
>>> priority than the cb() itself.
>>>
>>> Can VFIO simply leverage this existing interface, instead of hijacking
>>> "depth" in an unwanted way?
>>>
>>> I believe the _P2P states still need to happen first for all vfio devices,
>>> then it can do the next step. Logically I think it can also be done
>>> internally within VFIO by proper impl of these two callbacks, to achieve
>>> both:
>>>
>>> - Proper transition to P2P states first for all devices, meanwhile,
>>> - Proper concurrency of all VFIO devices on device state transitions
>>>
>>> Would that work in a cleaner way?
>>
>> In the VFIO migration save/source code we have two transitions:
>> PRE_COPY -> PRE_COPY_P2P done via prepare_cb() and
>> PRE_COPY_P2P -> STOP_COPY done via cb().
>> On the load/target side the situation is similar, only states are
>> different.
>>
>> For each of these transitions we need two callbacks: one has to launch
>> per-VFIO device state change thread and the other one has to "collect"
>> (or join) these threads.
>>
>> All handlers of the first callback have to be complete before calling
>> any handler of the second callback since otherwise there would be
>> VFIO device vs VFIO device serialization.
>>
>> Moreover, all the prepare_cb()-related transitions have to be done
>> before any of the cb()-related is started.
>>
>>
>> I presume what you suggest above is to launch per-VFIO device state
>> transition thread in prepare_cb() which would then be responsible for
>> doing *both* state transitions for this VFIO device.
>> And then collect this thread in cb().
> 
> Yes.
> 
>>
>> However, this would need some extra interlocking:
>> - A lock inside VFIO code shared by all VFIO devices making sure
>> no VFIO device starts the second transition (the cb() one) before
>> all VFIO devices finish the first transition (the prepare_cb() one).
> 
> Yes, more below.
> 
>>
>> - Looking at the current QEMU code, there are some non-VFIO cb()
>> handlers registered via qdev_add_vm_change_state_handler() and
>> quite a few handlers registered via qemu_add_vm_change_state_handler().
>>
>> I presume these VFIO prepare_cb()-related transitions need to finish
>> before calling any of these handlers to maintain the current ordering
> 
> IIUC, we don't necessarily need to keep this ordering; or do you see any
> real demand of that?
> 
> What I see is that the order is only introduced due to P2P state of VFIO,
> it seems very specific to me.

While I don't have a specific example why relaxing this ordering is unsafe
I am not certain that it is safe either - we can end up introducing subtle
race condition and possible guest memory corruption.

Maybe Alex and Cédric can provide some additional input here.

If nobody can authoritatively say that calling non-VFIO cb()
VM state change handlers before being certain that all VFIO devices
reached PRE_COPY_P2P / RUNNING_P2P state is indeed safe thing to do
I think we'll have to play safe here.

> 
>> so some kind of interlocking in vm_state_notify() between calling
>> prepare_cb() and cb() may be necessary since any "prio" based
>> ordering will have the same problem of reverse calling order on VM
>> start vs stop.
>>
>>
>> To be clear, I'm not insisting on the implementation from this patch -
>> any alternative one that preserves the aforementioned ordering rules
>> and has similar downtime performance is probably okay too.
> 
> The problem is, IIUC patch 1 needs better justification on its own, and I
> feel like it's not doing the right thing, as discussed previously. So even
> if we want to introduce a new priority-based approach, we may want to
> rethink patch 1.
> 
> IMHO it'll be perfect if we can simply reuse prepare_cb().  It seems
> working.
> 
> Due to this demand, I got to learn we have pthread_barrier_init(), but..
> personally I like a simple atomic+event approach:
> 
> vfio thread:
> 
>    # Do step 1, ->P2P
>    if (qatomic_fetch_inc(&vfio_p2p_done) + 1 == vfio_total_devices) {
>        qemu_event_set(&vfio_p2p_done_event);
>    }
>    qemu_event_wait(&vfio_p2p_done_event);
>    # Do step 2, ->STOPCOPY
> 
> cb() only join()s.
> 
> Would this work?

At the first glance it should work - of course, we'd need to introduce
the vfio_total_devices counter too.

However the issue about synchronization with other (non-VFIO) cb()
handlers described above still (possibly) remains.

> Thanks,
> 

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-06-03 21:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 14:41 [PATCH 0/2] VFIO device migration parallel state transitions Maciej S. Szmigiero
2026-05-20 14:41 ` [PATCH 1/2] system/runstate: Allow adjustment of priority for VM state change handlers Maciej S. Szmigiero
2026-06-01 13:26   ` Fabiano Rosas
2026-05-20 14:41 ` [PATCH 2/2] vfio/migration: Parallelize device state transitions Maciej S. Szmigiero
2026-06-01 13:39   ` Fabiano Rosas
2026-06-01 19:33     ` Peter Xu
2026-06-02 10:01       ` Maciej S. Szmigiero
2026-06-02 20:17         ` Peter Xu
2026-06-03 21:34           ` Maciej S. Szmigiero

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.