* [RFC 0/6] virtio-net: initial iterative live migration support
@ 2025-07-22 12:41 Jonah Palmer
2025-07-22 12:41 ` [RFC 1/6] migration: Add virtio-iterative capability Jonah Palmer
` (6 more replies)
0 siblings, 7 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-22 12:41 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky, jonah.palmer
This series is an RFC initial implementation of iterative live
migration for virtio-net devices.
The main motivation behind implementing iterative migration for
virtio-net devices is to start on heavy, time-consuming operations
for the destination while the source is still active (i.e. before
the stop-and-copy phase).
The motivation behind this RFC series specifically is to provide an
initial framework for such an implementation and get feedback on the
design and direction.
-------
This implementation of iterative live migration for a virtio-net device
is enabled via setting the migration capability 'virtio-iterative' to
on for both the source & destination, e.g. (HMP):
(qemu) migrate_set_capability virtio-iterative on
The virtio-net device's SaveVMHandlers hooks are registered/unregistered
during the device's realize/unrealize phase.
Currently, this series only sends and loads the vmstate at the start of
migration. The vmstate is still sent (again) during the stop-and-copy
phase, as it is today, to handle any deltas in the state since it was
initially sent. A future patch in this series could avoid having to
re-send and re-load the entire state again and instead focus only on the
deltas.
There is a slight, modest improvement in guest-visible downtime from
this series. More specifically, when using iterative live migration with
a virtio-net device, the downtime contributed by migrating a virtio-net
device decreased from ~3.2ms to ~1.4ms on average:
Before:
-------
vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
instance_id=0 downtime=3594
After:
------
vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
instance_id=0 downtime=1607
This slight improvement is likely due to the initial vmstate_load_state
call "warming up" pages in memory such that, when it's called a second
time during the stop-and-copy phase, allocation and page-fault latencies
are reduced.
-------
Comments, suggestions, etc. are welcome here.
Jonah Palmer (6):
migration: Add virtio-iterative capability
virtio-net: Reorder vmstate_virtio_net and helpers
virtio-net: Add SaveVMHandlers for iterative migration
virtio-net: iter live migration - migrate vmstate
virtio,virtio-net: skip consistency check in virtio_load for iterative
migration
virtio-net: skip vhost_started assertion during iterative migration
hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
hw/virtio/virtio.c | 32 +++--
include/hw/virtio/virtio-net.h | 8 ++
include/hw/virtio/virtio.h | 7 +
migration/savevm.c | 1 +
qapi/migration.json | 7 +-
6 files changed, 247 insertions(+), 54 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 66+ messages in thread
* [RFC 1/6] migration: Add virtio-iterative capability
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
@ 2025-07-22 12:41 ` Jonah Palmer
2025-08-06 15:58 ` Peter Xu
2025-08-08 10:48 ` Markus Armbruster
2025-07-22 12:41 ` [RFC 2/6] virtio-net: Reorder vmstate_virtio_net and helpers Jonah Palmer
` (5 subsequent siblings)
6 siblings, 2 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-22 12:41 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky, jonah.palmer
Adds a new migration capability 'virtio-iterative' that will allow
virtio devices, where supported, to iteratively migrate configuration
changes that occur during the migration process.
This capability is added to the validated capabilities list to ensure
both the source and destination support it before enabling.
The capability defaults to off to maintain backward compatibility.
To enable the capability via HMP:
(qemu) migrate_set_capability virtio-iterative on
To enable the capability via QMP:
{"execute": "migrate-set-capabilities", "arguments": {
"capabilities": [
{ "capability": "virtio-iterative", "state": true }
]
}
}
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
migration/savevm.c | 1 +
qapi/migration.json | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/migration/savevm.c b/migration/savevm.c
index bb04a4520d..40a2189866 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
switch (capability) {
case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
case MIGRATION_CAPABILITY_MAPPED_RAM:
+ case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
return true;
default:
return false;
diff --git a/qapi/migration.json b/qapi/migration.json
index 4963f6ca12..8f042c3ba5 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -479,6 +479,11 @@
# each RAM page. Requires a migration URI that supports seeking,
# such as a file. (since 9.0)
#
+# @virtio-iterative: Enable iterative migration for virtio devices, if
+# the device supports it. When enabled, and where supported, virtio
+# devices will track and migrate configuration changes that may
+# occur during the migration process. (Since 10.1)
+#
# Features:
#
# @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -498,7 +503,7 @@
{ 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
'validate-uuid', 'background-snapshot',
'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
- 'dirty-limit', 'mapped-ram'] }
+ 'dirty-limit', 'mapped-ram', 'virtio-iterative'] }
##
# @MigrationCapabilityStatus:
--
2.47.1
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [RFC 2/6] virtio-net: Reorder vmstate_virtio_net and helpers
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
2025-07-22 12:41 ` [RFC 1/6] migration: Add virtio-iterative capability Jonah Palmer
@ 2025-07-22 12:41 ` Jonah Palmer
2025-07-22 12:41 ` [RFC 3/6] virtio-net: Add SaveVMHandlers for iterative migration Jonah Palmer
` (4 subsequent siblings)
6 siblings, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-22 12:41 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky, jonah.palmer
This reordering makes the vmstate_virtio_net available for use by future
virtio-net SaveVMHandlers hooks that will need to be placed before
virtio_net_device_realize.
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
hw/net/virtio-net.c | 90 ++++++++++++++++++++++-----------------------
1 file changed, 45 insertions(+), 45 deletions(-)
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 221252e00a..93029104b3 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3746,6 +3746,51 @@ static bool failover_hide_primary_device(DeviceListener *listener,
return qatomic_read(&n->failover_primary_hidden);
}
+static int virtio_net_pre_save(void *opaque)
+{
+ VirtIONet *n = opaque;
+
+ /* At this point, backend must be stopped, otherwise
+ * it might keep writing to memory. */
+ assert(!n->vhost_started);
+
+ return 0;
+}
+
+static bool primary_unplug_pending(void *opaque)
+{
+ DeviceState *dev = opaque;
+ DeviceState *primary;
+ VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+ VirtIONet *n = VIRTIO_NET(vdev);
+
+ if (!virtio_vdev_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+ return false;
+ }
+ primary = failover_find_primary_device(n);
+ return primary ? primary->pending_deleted_event : false;
+}
+
+static bool dev_unplug_pending(void *opaque)
+{
+ DeviceState *dev = opaque;
+ VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(dev);
+
+ return vdc->primary_unplug_pending(dev);
+}
+
+static const VMStateDescription vmstate_virtio_net = {
+ .name = "virtio-net",
+ .minimum_version_id = VIRTIO_NET_VM_VERSION,
+ .version_id = VIRTIO_NET_VM_VERSION,
+ .fields = (const VMStateField[]) {
+ VMSTATE_VIRTIO_DEVICE,
+ VMSTATE_END_OF_LIST()
+ },
+ .pre_save = virtio_net_pre_save,
+ .dev_unplug_pending = dev_unplug_pending,
+};
+
static void virtio_net_device_realize(DeviceState *dev, Error **errp)
{
VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -4016,51 +4061,6 @@ static void virtio_net_instance_init(Object *obj)
ebpf_rss_init(&n->ebpf_rss);
}
-static int virtio_net_pre_save(void *opaque)
-{
- VirtIONet *n = opaque;
-
- /* At this point, backend must be stopped, otherwise
- * it might keep writing to memory. */
- assert(!n->vhost_started);
-
- return 0;
-}
-
-static bool primary_unplug_pending(void *opaque)
-{
- DeviceState *dev = opaque;
- DeviceState *primary;
- VirtIODevice *vdev = VIRTIO_DEVICE(dev);
- VirtIONet *n = VIRTIO_NET(vdev);
-
- if (!virtio_vdev_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
- return false;
- }
- primary = failover_find_primary_device(n);
- return primary ? primary->pending_deleted_event : false;
-}
-
-static bool dev_unplug_pending(void *opaque)
-{
- DeviceState *dev = opaque;
- VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(dev);
-
- return vdc->primary_unplug_pending(dev);
-}
-
-static const VMStateDescription vmstate_virtio_net = {
- .name = "virtio-net",
- .minimum_version_id = VIRTIO_NET_VM_VERSION,
- .version_id = VIRTIO_NET_VM_VERSION,
- .fields = (const VMStateField[]) {
- VMSTATE_VIRTIO_DEVICE,
- VMSTATE_END_OF_LIST()
- },
- .pre_save = virtio_net_pre_save,
- .dev_unplug_pending = dev_unplug_pending,
-};
-
static const Property virtio_net_properties[] = {
DEFINE_PROP_BIT64("csum", VirtIONet, host_features,
VIRTIO_NET_F_CSUM, true),
--
2.47.1
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [RFC 3/6] virtio-net: Add SaveVMHandlers for iterative migration
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
2025-07-22 12:41 ` [RFC 1/6] migration: Add virtio-iterative capability Jonah Palmer
2025-07-22 12:41 ` [RFC 2/6] virtio-net: Reorder vmstate_virtio_net and helpers Jonah Palmer
@ 2025-07-22 12:41 ` Jonah Palmer
2025-07-22 12:41 ` [RFC 4/6] virtio-net: iter live migration - migrate vmstate Jonah Palmer
` (3 subsequent siblings)
6 siblings, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-22 12:41 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky, jonah.palmer
Add SaveVMHandlers struct for virtio-net iterative migration support.
The handlers are registered but only contain no-op implementations.
This provides the framework for iterative migration without changing any
actual migration behavior when the capability is disabled.
A BFD representation is used when registering a virtio-net device's
SaveVMHandlers hooks. This is to create unique IDs in the case of
multiple virtio-net devices.
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
hw/net/virtio-net.c | 85 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 85 insertions(+)
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 93029104b3..19aa5b5936 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -17,6 +17,7 @@
#include "qemu/log.h"
#include "qemu/main-loop.h"
#include "qemu/module.h"
+#include "qemu/cutils.h"
#include "hw/virtio/virtio.h"
#include "net/net.h"
#include "net/checksum.h"
@@ -38,6 +39,9 @@
#include "qapi/qapi-events-migration.h"
#include "hw/virtio/virtio-access.h"
#include "migration/misc.h"
+#include "migration/register.h"
+#include "migration/qemu-file.h"
+#include "migration/migration.h"
#include "standard-headers/linux/ethtool.h"
#include "system/system.h"
#include "system/replay.h"
@@ -3791,11 +3795,77 @@ static const VMStateDescription vmstate_virtio_net = {
.dev_unplug_pending = dev_unplug_pending,
};
+static bool virtio_net_iterative_migration_enabled(void)
+{
+ MigrationState *s = migrate_get_current();
+ return s->capabilities[MIGRATION_CAPABILITY_VIRTIO_ITERATIVE];
+}
+
+static bool virtio_net_is_active(void *opaque)
+{
+ return virtio_net_iterative_migration_enabled();
+}
+
+static int virtio_net_save_setup(QEMUFile *f, void *opaque, Error **errp)
+{
+ return 0;
+}
+
+static int virtio_net_save_live_iterate(QEMUFile *f, void *opaque)
+{
+ return 1;
+}
+
+static int virtio_net_save_live_complete_precopy(QEMUFile *f, void *opaque)
+{
+ return 0;
+}
+
+static void virtio_net_save_cleanup(void *opaque)
+{
+
+}
+
+static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
+{
+ return 0;
+}
+
+static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+ return 0;
+}
+
+static int virtio_net_load_cleanup(void *opaque)
+{
+ return 0;
+}
+
+static void virtio_net_state_pending_exact(void *opaque, uint64_t *must_precopy,
+ uint64_t *can_postcopy)
+{
+
+}
+
+static const SaveVMHandlers savevm_virtio_net_handlers = {
+ .is_active = virtio_net_is_active,
+ .save_setup = virtio_net_save_setup,
+ .save_live_iterate = virtio_net_save_live_iterate,
+ .save_live_complete_precopy = virtio_net_save_live_complete_precopy,
+ .save_cleanup = virtio_net_save_cleanup,
+ .load_setup = virtio_net_load_setup,
+ .load_state = virtio_net_load_state,
+ .load_cleanup = virtio_net_load_cleanup,
+ .state_pending_exact = virtio_net_state_pending_exact,
+};
+
static void virtio_net_device_realize(DeviceState *dev, Error **errp)
{
VirtIODevice *vdev = VIRTIO_DEVICE(dev);
VirtIONet *n = VIRTIO_NET(dev);
NetClientState *nc;
+ g_autofree char *path = NULL;
+ char id[256] = "";
int i;
if (n->net_conf.mtu) {
@@ -3963,12 +4033,21 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
if (virtio_has_feature(n->host_features, VIRTIO_NET_F_RSS)) {
virtio_net_load_ebpf(n, errp);
}
+
+ /* Register handlers for iterative migration */
+ path = qdev_get_dev_path(DEVICE(&n->parent_obj));
+ path = g_strdup_printf("%s/virtio-net-iterative", path);
+ strpadcpy(id, sizeof(id), path, '\0');
+ register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+ &savevm_virtio_net_handlers, n);
}
static void virtio_net_device_unrealize(DeviceState *dev)
{
VirtIODevice *vdev = VIRTIO_DEVICE(dev);
VirtIONet *n = VIRTIO_NET(dev);
+ g_autofree char *path = NULL;
+ char id[256] = "";
int i, max_queue_pairs;
if (virtio_has_feature(n->host_features, VIRTIO_NET_F_RSS)) {
@@ -4007,6 +4086,12 @@ static void virtio_net_device_unrealize(DeviceState *dev)
g_free(n->rss_data.indirections_table);
net_rx_pkt_uninit(n->rx_pkt);
virtio_cleanup(vdev);
+
+ /* Unregister migration handlers */
+ path = qdev_get_dev_path(DEVICE(&n->parent_obj));
+ path = g_strdup_printf("%s/virtio-net-iterative", path);
+ strpadcpy(id, sizeof(id), path, '\0');
+ unregister_savevm(VMSTATE_IF(dev), id, n);
}
static void virtio_net_reset(VirtIODevice *vdev)
--
2.47.1
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [RFC 4/6] virtio-net: iter live migration - migrate vmstate
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
` (2 preceding siblings ...)
2025-07-22 12:41 ` [RFC 3/6] virtio-net: Add SaveVMHandlers for iterative migration Jonah Palmer
@ 2025-07-22 12:41 ` Jonah Palmer
2025-07-23 6:51 ` Michael S. Tsirkin
2025-07-22 12:41 ` [RFC 5/6] virtio, virtio-net: skip consistency check in virtio_load for iterative migration Jonah Palmer via
` (2 subsequent siblings)
6 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-07-22 12:41 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky, jonah.palmer
Lays out the initial groundwork for iteratively migrating the state of a
virtio-net device, starting with its vmstate (via vmstate_save_state &
vmstate_load_state).
The original non-iterative vmstate framework still runs during the
stop-and-copy phase when the guest is paused, which is still necessary
to migrate over the final state of the virtqueues once the sourced has
been paused.
Although the vmstate framework is used twice (once during the iterative
portion and once during the stop-and-copy phase), it appears that
there's some modest improvement in guest-visible downtime when using a
virtio-net device.
When tracing the vmstate_downtime_save and vmstate_downtime_load
tracepoints, for a virtio-net device using iterative live migration, the
non-iterative downtime portion improved modestly, going from ~3.2ms to
~1.4ms:
Before:
-------
vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
instance_id=0 downtime=3594
After:
------
vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
instance_id=0 downtime=1607
This improvement is likely due to the initial vmstate_load_state call
(while the guest is still running) "warming up" all related pages and
structures on the destination. In other words, by the time the final
stop-and-copy phase starts, the heavy allocations and page-fault
latencies are reduced, making the device re-loads slightly faster and
the guest-visible downtime window slightly smaller.
Future patches could improve upon this by skipping the second
vmstate_save/load_state calls (during the stop-and-copy phase) and
instead only send deltas right before/after the source is stopped.
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
hw/net/virtio-net.c | 37 ++++++++++++++++++++++++++++++++++
include/hw/virtio/virtio-net.h | 8 ++++++++
2 files changed, 45 insertions(+)
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 19aa5b5936..86a6fe5b91 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3808,16 +3808,31 @@ static bool virtio_net_is_active(void *opaque)
static int virtio_net_save_setup(QEMUFile *f, void *opaque, Error **errp)
{
+ VirtIONet *n = opaque;
+
+ qemu_put_be64(f, VNET_MIG_F_INIT_STATE);
+ vmstate_save_state(f, &vmstate_virtio_net, n, NULL);
+ qemu_put_be64(f, VNET_MIG_F_END_DATA);
+
return 0;
}
static int virtio_net_save_live_iterate(QEMUFile *f, void *opaque)
{
+ bool new_data = false;
+
+ if (!new_data) {
+ qemu_put_be64(f, VNET_MIG_F_NO_DATA);
+ return 1;
+ }
+
+ qemu_put_be64(f, VNET_MIG_F_END_DATA);
return 1;
}
static int virtio_net_save_live_complete_precopy(QEMUFile *f, void *opaque)
{
+ qemu_put_be64(f, VNET_MIG_F_NO_DATA);
return 0;
}
@@ -3833,6 +3848,28 @@ static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
{
+ VirtIONet *n = opaque;
+ uint64_t flag;
+
+ flag = qemu_get_be64(f);
+ if (flag == VNET_MIG_F_NO_DATA) {
+ return 0;
+ }
+
+ while (flag != VNET_MIG_F_END_DATA) {
+ switch (flag) {
+ case VNET_MIG_F_INIT_STATE:
+ {
+ vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
+ break;
+ }
+ default:
+ qemu_log_mask(LOG_GUEST_ERROR, "%s: Uknown flag 0x%"PRIx64, __func__, flag);
+ return -EINVAL;
+ }
+
+ flag = qemu_get_be64(f);
+ }
return 0;
}
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index b9ea9e824e..d6c7619053 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -163,6 +163,14 @@ typedef struct VirtIONetQueue {
struct VirtIONet *n;
} VirtIONetQueue;
+/*
+ * Flags to be used as unique delimiters for virtio-net devices in the
+ * migration stream.
+ */
+#define VNET_MIG_F_INIT_STATE (0xffffffffef200000ULL)
+#define VNET_MIG_F_END_DATA (0xffffffffef200001ULL)
+#define VNET_MIG_F_NO_DATA (0xffffffffef200002ULL)
+
struct VirtIONet {
VirtIODevice parent_obj;
uint8_t mac[ETH_ALEN];
--
2.47.1
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [RFC 5/6] virtio, virtio-net: skip consistency check in virtio_load for iterative migration
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
` (3 preceding siblings ...)
2025-07-22 12:41 ` [RFC 4/6] virtio-net: iter live migration - migrate vmstate Jonah Palmer
@ 2025-07-22 12:41 ` Jonah Palmer via
2025-07-28 15:30 ` [RFC 5/6] virtio,virtio-net: " Eugenio Perez Martin
2025-08-06 16:27 ` Peter Xu
2025-07-22 12:41 ` [RFC 6/6] virtio-net: skip vhost_started assertion during " Jonah Palmer
2025-07-23 5:51 ` [RFC 0/6] virtio-net: initial iterative live migration support Jason Wang
6 siblings, 2 replies; 66+ messages in thread
From: Jonah Palmer via @ 2025-07-22 12:41 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky, jonah.palmer
Iterative live migration for virtio-net sends an initial
VMStateDescription while the source is still active. Because data
continues to flow for virtio-net, the guest's avail index continues to
increment after last_avail_idx had already been sent. This causes the
destination to often see something like this from virtio_error():
VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
This patch suppresses this consistency check if we're loading the
initial VMStateDescriptions via iterative migration and unsuppresses
it for the stop-and-copy phase when the final VMStateDescriptions
(carrying the correct indices) are loaded.
A temporary VirtIODevMigration migration data structure is introduced here to
represent the iterative migration process for a VirtIODevice. For now it
just holds a flag to indicate whether or not the initial
VMStateDescription was sent during the iterative live migration process.
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
hw/net/virtio-net.c | 13 +++++++++++++
hw/virtio/virtio.c | 32 ++++++++++++++++++++++++--------
include/hw/virtio/virtio.h | 6 ++++++
3 files changed, 43 insertions(+), 8 deletions(-)
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 86a6fe5b91..b7ac5e8278 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3843,12 +3843,19 @@ static void virtio_net_save_cleanup(void *opaque)
static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
{
+ VirtIONet *n = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(n);
+ vdev->migration = g_new0(VirtIODevMigration, 1);
+ vdev->migration->iterative_vmstate_loaded = false;
+
return 0;
}
static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
{
VirtIONet *n = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(n);
+ VirtIODevMigration *mig = vdev->migration;
uint64_t flag;
flag = qemu_get_be64(f);
@@ -3861,6 +3868,7 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
case VNET_MIG_F_INIT_STATE:
{
vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
+ mig->iterative_vmstate_loaded = true;
break;
}
default:
@@ -3875,6 +3883,11 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
static int virtio_net_load_cleanup(void *opaque)
{
+ VirtIONet *n = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(n);
+ g_free(vdev->migration);
+ vdev->migration = NULL;
+
return 0;
}
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5534251e01..68957ee7d1 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3222,6 +3222,7 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
int32_t config_len;
uint32_t num;
uint32_t features;
+ bool inconsistent_indices;
BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
@@ -3365,6 +3366,16 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
if (vdev->vq[i].vring.desc) {
uint16_t nheads;
+ /*
+ * Ring indices will be inconsistent during iterative migration. The actual
+ * indices will be sent later during the stop-and-copy phase.
+ */
+ if (vdev->migration) {
+ inconsistent_indices = !vdev->migration->iterative_vmstate_loaded;
+ } else {
+ inconsistent_indices = false;
+ }
+
/*
* VIRTIO-1 devices migrate desc, used, and avail ring addresses so
* only the region cache needs to be set up. Legacy devices need
@@ -3384,14 +3395,19 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
continue;
}
- nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
- /* Check it isn't doing strange things with descriptor numbers. */
- if (nheads > vdev->vq[i].vring.num) {
- virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
- "inconsistent with Host index 0x%x: delta 0x%x",
- i, vdev->vq[i].vring.num,
- vring_avail_idx(&vdev->vq[i]),
- vdev->vq[i].last_avail_idx, nheads);
+ if (!inconsistent_indices) {
+ nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
+ /* Check it isn't doing strange things with descriptor numbers. */
+ if (nheads > vdev->vq[i].vring.num) {
+ virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
+ "inconsistent with Host index 0x%x: delta 0x%x",
+ i, vdev->vq[i].vring.num,
+ vring_avail_idx(&vdev->vq[i]),
+ vdev->vq[i].last_avail_idx, nheads);
+ inconsistent_indices = true;
+ }
+ }
+ if (inconsistent_indices) {
vdev->vq[i].used_idx = 0;
vdev->vq[i].shadow_avail_idx = 0;
vdev->vq[i].inuse = 0;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 214d4a77e9..06b6e6ba65 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -98,6 +98,11 @@ enum virtio_device_endian {
VIRTIO_DEVICE_ENDIAN_BIG,
};
+/* VirtIODevice iterative live migration data structure */
+typedef struct VirtIODevMigration {
+ bool iterative_vmstate_loaded;
+} VirtIODevMigration;
+
/**
* struct VirtIODevice - common VirtIO structure
* @name: name of the device
@@ -151,6 +156,7 @@ struct VirtIODevice
bool disable_legacy_check;
bool vhost_started;
VMChangeStateEntry *vmstate;
+ VirtIODevMigration *migration;
char *bus_name;
uint8_t device_endian;
/**
--
2.47.1
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [RFC 6/6] virtio-net: skip vhost_started assertion during iterative migration
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
` (4 preceding siblings ...)
2025-07-22 12:41 ` [RFC 5/6] virtio, virtio-net: skip consistency check in virtio_load for iterative migration Jonah Palmer via
@ 2025-07-22 12:41 ` Jonah Palmer
2025-07-23 5:51 ` [RFC 0/6] virtio-net: initial iterative live migration support Jason Wang
6 siblings, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-22 12:41 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky, jonah.palmer
Initializes vhost-net support for iterative live migration by avoiding
the assertion that vhost needs to be stopped before proceeding with
sending the initial VMStateDescription for virtio-net.
This should be okay to do since we only care about the static device
state and not the dynamic ring states for the initial sending of the
device state.
After the iterative migration portion is finished and the source is
stopped, we still assert that vhost is also stopped.
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
hw/net/virtio-net.c | 23 ++++++++++++++++++++++-
include/hw/virtio/virtio.h | 1 +
2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index b7ac5e8278..07941f991e 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3753,6 +3753,19 @@ static bool failover_hide_primary_device(DeviceListener *listener,
static int virtio_net_pre_save(void *opaque)
{
VirtIONet *n = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(n);
+
+ /*
+ * During iterative migration, vhost will still be active. However,
+ * this shouldn't be an issue since we don't care about the dynamic
+ * ring states at this point.
+ *
+ * The final migration at the end will still occur with vhost stopped
+ * and any inconsistencies will be overwritten.
+ */
+ if (vdev->migration && !vdev->migration->iterative_vmstate_sent) {
+ return 0;
+ }
/* At this point, backend must be stopped, otherwise
* it might keep writing to memory. */
@@ -3809,11 +3822,16 @@ static bool virtio_net_is_active(void *opaque)
static int virtio_net_save_setup(QEMUFile *f, void *opaque, Error **errp)
{
VirtIONet *n = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(n);
+ vdev->migration = g_new0(VirtIODevMigration, 1);
+ vdev->migration->iterative_vmstate_sent = false;
qemu_put_be64(f, VNET_MIG_F_INIT_STATE);
vmstate_save_state(f, &vmstate_virtio_net, n, NULL);
qemu_put_be64(f, VNET_MIG_F_END_DATA);
+ vdev->migration->iterative_vmstate_sent = true;
+
return 0;
}
@@ -3838,7 +3856,10 @@ static int virtio_net_save_live_complete_precopy(QEMUFile *f, void *opaque)
static void virtio_net_save_cleanup(void *opaque)
{
-
+ VirtIONet *n = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(n);
+ g_free(vdev->migration);
+ vdev->migration = NULL;
}
static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 06b6e6ba65..aa3f60cb7b 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -101,6 +101,7 @@ enum virtio_device_endian {
/* VirtIODevice iterative live migration data structure */
typedef struct VirtIODevMigration {
bool iterative_vmstate_loaded;
+ bool iterative_vmstate_sent;
} VirtIODevMigration;
/**
--
2.47.1
^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
` (5 preceding siblings ...)
2025-07-22 12:41 ` [RFC 6/6] virtio-net: skip vhost_started assertion during " Jonah Palmer
@ 2025-07-23 5:51 ` Jason Wang
2025-07-24 21:59 ` Jonah Palmer
6 siblings, 1 reply; 66+ messages in thread
From: Jason Wang @ 2025-07-23 5:51 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, armbru, mst, si-wei.liu,
eperezma, boris.ostrovsky
On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> This series is an RFC initial implementation of iterative live
> migration for virtio-net devices.
>
> The main motivation behind implementing iterative migration for
> virtio-net devices is to start on heavy, time-consuming operations
> for the destination while the source is still active (i.e. before
> the stop-and-copy phase).
It would be better to explain which kind of operations were heavy and
time-consuming and how iterative migration help.
>
> The motivation behind this RFC series specifically is to provide an
> initial framework for such an implementation and get feedback on the
> design and direction.
> -------
>
> This implementation of iterative live migration for a virtio-net device
> is enabled via setting the migration capability 'virtio-iterative' to
> on for both the source & destination, e.g. (HMP):
>
> (qemu) migrate_set_capability virtio-iterative on
>
> The virtio-net device's SaveVMHandlers hooks are registered/unregistered
> during the device's realize/unrealize phase.
I wonder about the plan for libvirt support.
>
> Currently, this series only sends and loads the vmstate at the start of
> migration. The vmstate is still sent (again) during the stop-and-copy
> phase, as it is today, to handle any deltas in the state since it was
> initially sent. A future patch in this series could avoid having to
> re-send and re-load the entire state again and instead focus only on the
> deltas.
>
> There is a slight, modest improvement in guest-visible downtime from
> this series. More specifically, when using iterative live migration with
> a virtio-net device, the downtime contributed by migrating a virtio-net
> device decreased from ~3.2ms to ~1.4ms on average:
Are you testing this via a software virtio device or hardware one?
>
> Before:
> -------
> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> instance_id=0 downtime=3594
>
> After:
> ------
> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> instance_id=0 downtime=1607
>
> This slight improvement is likely due to the initial vmstate_load_state
> call "warming up" pages in memory such that, when it's called a second
> time during the stop-and-copy phase, allocation and page-fault latencies
> are reduced.
> -------
>
> Comments, suggestions, etc. are welcome here.
>
> Jonah Palmer (6):
> migration: Add virtio-iterative capability
> virtio-net: Reorder vmstate_virtio_net and helpers
> virtio-net: Add SaveVMHandlers for iterative migration
> virtio-net: iter live migration - migrate vmstate
> virtio,virtio-net: skip consistency check in virtio_load for iterative
> migration
> virtio-net: skip vhost_started assertion during iterative migration
>
> hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
> hw/virtio/virtio.c | 32 +++--
> include/hw/virtio/virtio-net.h | 8 ++
> include/hw/virtio/virtio.h | 7 +
> migration/savevm.c | 1 +
> qapi/migration.json | 7 +-
> 6 files changed, 247 insertions(+), 54 deletions(-)
>
> --
> 2.47.1
Thanks
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 4/6] virtio-net: iter live migration - migrate vmstate
2025-07-22 12:41 ` [RFC 4/6] virtio-net: iter live migration - migrate vmstate Jonah Palmer
@ 2025-07-23 6:51 ` Michael S. Tsirkin
2025-07-24 14:45 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2025-07-23 6:51 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, si-wei.liu,
eperezma, boris.ostrovsky
On Tue, Jul 22, 2025 at 12:41:25PM +0000, Jonah Palmer wrote:
> Lays out the initial groundwork for iteratively migrating the state of a
> virtio-net device, starting with its vmstate (via vmstate_save_state &
> vmstate_load_state).
>
> The original non-iterative vmstate framework still runs during the
> stop-and-copy phase when the guest is paused, which is still necessary
> to migrate over the final state of the virtqueues once the sourced has
> been paused.
>
> Although the vmstate framework is used twice (once during the iterative
> portion and once during the stop-and-copy phase), it appears that
> there's some modest improvement in guest-visible downtime when using a
> virtio-net device.
>
> When tracing the vmstate_downtime_save and vmstate_downtime_load
> tracepoints, for a virtio-net device using iterative live migration, the
> non-iterative downtime portion improved modestly, going from ~3.2ms to
> ~1.4ms:
>
> Before:
> -------
> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> instance_id=0 downtime=3594
>
> After:
> ------
> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> instance_id=0 downtime=1607
>
> This improvement is likely due to the initial vmstate_load_state call
> (while the guest is still running) "warming up" all related pages and
> structures on the destination. In other words, by the time the final
> stop-and-copy phase starts, the heavy allocations and page-fault
> latencies are reduced, making the device re-loads slightly faster and
> the guest-visible downtime window slightly smaller.
did I get it right it's just the vmstate load for this single device?
If the theory is right, is it not possible that while the
tracepoints are now closer together, you have pushed something
else out of the cache, making the effect on guest visible downtime
unpredictable? how about the total vmstate load time?
> Future patches could improve upon this by skipping the second
> vmstate_save/load_state calls (during the stop-and-copy phase) and
> instead only send deltas right before/after the source is stopped.
>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---
> hw/net/virtio-net.c | 37 ++++++++++++++++++++++++++++++++++
> include/hw/virtio/virtio-net.h | 8 ++++++++
> 2 files changed, 45 insertions(+)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 19aa5b5936..86a6fe5b91 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -3808,16 +3808,31 @@ static bool virtio_net_is_active(void *opaque)
>
> static int virtio_net_save_setup(QEMUFile *f, void *opaque, Error **errp)
> {
> + VirtIONet *n = opaque;
> +
> + qemu_put_be64(f, VNET_MIG_F_INIT_STATE);
> + vmstate_save_state(f, &vmstate_virtio_net, n, NULL);
> + qemu_put_be64(f, VNET_MIG_F_END_DATA);
> +
> return 0;
> }
>
> static int virtio_net_save_live_iterate(QEMUFile *f, void *opaque)
> {
> + bool new_data = false;
> +
> + if (!new_data) {
> + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
> + return 1;
> + }
> +
> + qemu_put_be64(f, VNET_MIG_F_END_DATA);
> return 1;
> }
>
> static int virtio_net_save_live_complete_precopy(QEMUFile *f, void *opaque)
> {
> + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
> return 0;
> }
>
> @@ -3833,6 +3848,28 @@ static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
>
> static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
> {
> + VirtIONet *n = opaque;
> + uint64_t flag;
> +
> + flag = qemu_get_be64(f);
> + if (flag == VNET_MIG_F_NO_DATA) {
> + return 0;
> + }
> +
> + while (flag != VNET_MIG_F_END_DATA) {
> + switch (flag) {
> + case VNET_MIG_F_INIT_STATE:
> + {
> + vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
> + break;
> + }
> + default:
> + qemu_log_mask(LOG_GUEST_ERROR, "%s: Uknown flag 0x%"PRIx64, __func__, flag);
> + return -EINVAL;
> + }
> +
> + flag = qemu_get_be64(f);
> + }
> return 0;
> }
>
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index b9ea9e824e..d6c7619053 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -163,6 +163,14 @@ typedef struct VirtIONetQueue {
> struct VirtIONet *n;
> } VirtIONetQueue;
>
> +/*
> + * Flags to be used as unique delimiters for virtio-net devices in the
> + * migration stream.
> + */
> +#define VNET_MIG_F_INIT_STATE (0xffffffffef200000ULL)
> +#define VNET_MIG_F_END_DATA (0xffffffffef200001ULL)
> +#define VNET_MIG_F_NO_DATA (0xffffffffef200002ULL)
> +
> struct VirtIONet {
> VirtIODevice parent_obj;
> uint8_t mac[ETH_ALEN];
> --
> 2.47.1
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 4/6] virtio-net: iter live migration - migrate vmstate
2025-07-23 6:51 ` Michael S. Tsirkin
@ 2025-07-24 14:45 ` Jonah Palmer
2025-07-25 9:31 ` Michael S. Tsirkin
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-07-24 14:45 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, si-wei.liu,
eperezma, boris.ostrovsky
On 7/23/25 2:51 AM, Michael S. Tsirkin wrote:
> On Tue, Jul 22, 2025 at 12:41:25PM +0000, Jonah Palmer wrote:
>> Lays out the initial groundwork for iteratively migrating the state of a
>> virtio-net device, starting with its vmstate (via vmstate_save_state &
>> vmstate_load_state).
>>
>> The original non-iterative vmstate framework still runs during the
>> stop-and-copy phase when the guest is paused, which is still necessary
>> to migrate over the final state of the virtqueues once the sourced has
>> been paused.
>>
>> Although the vmstate framework is used twice (once during the iterative
>> portion and once during the stop-and-copy phase), it appears that
>> there's some modest improvement in guest-visible downtime when using a
>> virtio-net device.
>>
>> When tracing the vmstate_downtime_save and vmstate_downtime_load
>> tracepoints, for a virtio-net device using iterative live migration, the
>> non-iterative downtime portion improved modestly, going from ~3.2ms to
>> ~1.4ms:
>>
>> Before:
>> -------
>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>> instance_id=0 downtime=3594
>>
>> After:
>> ------
>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>> instance_id=0 downtime=1607
>>
>> This improvement is likely due to the initial vmstate_load_state call
>> (while the guest is still running) "warming up" all related pages and
>> structures on the destination. In other words, by the time the final
>> stop-and-copy phase starts, the heavy allocations and page-fault
>> latencies are reduced, making the device re-loads slightly faster and
>> the guest-visible downtime window slightly smaller.
>
> did I get it right it's just the vmstate load for this single device?
> If the theory is right, is it not possible that while the
> tracepoints are now closer together, you have pushed something
> else out of the cache, making the effect on guest visible downtime
> unpredictable? how about the total vmstate load time?
>
Correct, the data above is just from the virtio-net device's downtime
contribution (specifically during the stop-and-copy phase).
Theoretically, yes I believe so. To try and get a feel on this, I ran
some slightly heavier testing for the virtio-net device: vhost-net + 4
queue pairs (the one above was just a virtio-net device with 1 queue pair).
I traced the reported downtimes of the devices that come right before
and after virtio-net's vmstate_load_state call with and without
iterative migration on the virtio-net device.
The downtimes below are all from the vmstate_load_state calls that
happen while the source has been stopped:
With iterative migration for virtio-net:
----------------------------------------
vga: 1.50ms | 1.39ms | 1.37ms | 1.50ms | 1.63ms |
virtio-console: 13.78ms | 14.24ms | 13.74ms | 13.89ms | 13.60ms |
virtio-net: 13.91ms | 13.52ms | 13.09ms | 13.59ms | 13.37ms |
virtio-scsi: 18.71ms | 13.96ms | 14.05ms | 16.55ms | 14.30ms |
vga: Avg. 1.47ms | Var: 0.0109ms² | Std. Dev (σ): 0.104ms
virtio-console: Avg. 13.85ms | Var: 0.0583ms² | Std. Dev (σ): 0.241ms
virtio-net: Avg. 13.49ms | Var: 0.0904ms² | Std. Dev (σ): 0.301ms
virtio-scsi: Avg. 15.51ms | Var: 4.3299ms² | Std. Dev (σ): 2.081ms
Without iterative migration for virtio-net:
-------------------------------------------
vga: 1.47ms | 1.28ms | 1.55ms | 1.36ms | 1.22ms |
virtio-console: 13.39ms | 13.40ms | 14.37ms | 13.93ms | 13.36ms |
virtio-net: 18.52ms | 17.77ms | 17.52ms | 15.52ms | 17.32ms |
virtio-scsi: 13.35ms | 13.94ms | 15.17ms | 16.01ms | 14.08ms |
vga: Avg. 1.37ms | Var: 0.0182ms² | Std. Dev (σ): 0.135ms
virtio-console: Avg. 13.69ms | Var: 0.2007ms² | Std. Dev (σ): 0.448ms
virtio-net: Avg. 17.33ms | Var: 1.2305ms² | Std. Dev (σ): 1.109ms
virtio-scsi: Avg. 14.51ms | Var: 1.1352ms² | Std. Dev (σ): 1.065ms
The most notable difference here is the standard deviation of
virtio-scsi's migration downtime, which comes after virtio-net's
migration: virtio-scsi's σ rises from ~1.07ms to ~2.08ms when virtio-net
is iteratively migrated.
However, since I only got 5 samples per device, the trend is indicative
but not definitive.
Total vmstate load time per device ≈ downtimes reported above, unless
you're referring to overall downtime across all devices?
----------
Having said all this, this RFC is just an initial, first-step for
iterative migration of a virtio-net device. This second
vmstate_load_state call during the stop-and-copy phase isn't optimal. A
future version of this series could do away with this second call and
only send the deltas instead of the entire state again.
>> Future patches could improve upon this by skipping the second
>> vmstate_save/load_state calls (during the stop-and-copy phase) and
>> instead only send deltas right before/after the source is stopped.
>>
>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>> ---
>> hw/net/virtio-net.c | 37 ++++++++++++++++++++++++++++++++++
>> include/hw/virtio/virtio-net.h | 8 ++++++++
>> 2 files changed, 45 insertions(+)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 19aa5b5936..86a6fe5b91 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -3808,16 +3808,31 @@ static bool virtio_net_is_active(void *opaque)
>>
>> static int virtio_net_save_setup(QEMUFile *f, void *opaque, Error **errp)
>> {
>> + VirtIONet *n = opaque;
>> +
>> + qemu_put_be64(f, VNET_MIG_F_INIT_STATE);
>> + vmstate_save_state(f, &vmstate_virtio_net, n, NULL);
>> + qemu_put_be64(f, VNET_MIG_F_END_DATA);
>> +
>> return 0;
>> }
>>
>> static int virtio_net_save_live_iterate(QEMUFile *f, void *opaque)
>> {
>> + bool new_data = false;
>> +
>> + if (!new_data) {
>> + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
>> + return 1;
>> + }
>> +
>> + qemu_put_be64(f, VNET_MIG_F_END_DATA);
>> return 1;
>> }
>>
>> static int virtio_net_save_live_complete_precopy(QEMUFile *f, void *opaque)
>> {
>> + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
>> return 0;
>> }
>>
>> @@ -3833,6 +3848,28 @@ static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>
>> static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
>> {
>> + VirtIONet *n = opaque;
>> + uint64_t flag;
>> +
>> + flag = qemu_get_be64(f);
>> + if (flag == VNET_MIG_F_NO_DATA) {
>> + return 0;
>> + }
>> +
>> + while (flag != VNET_MIG_F_END_DATA) {
>> + switch (flag) {
>> + case VNET_MIG_F_INIT_STATE:
>> + {
>> + vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
>> + break;
>> + }
>> + default:
>> + qemu_log_mask(LOG_GUEST_ERROR, "%s: Uknown flag 0x%"PRIx64, __func__, flag);
>> + return -EINVAL;
>> + }
>> +
>> + flag = qemu_get_be64(f);
>> + }
>> return 0;
>> }
>>
>> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
>> index b9ea9e824e..d6c7619053 100644
>> --- a/include/hw/virtio/virtio-net.h
>> +++ b/include/hw/virtio/virtio-net.h
>> @@ -163,6 +163,14 @@ typedef struct VirtIONetQueue {
>> struct VirtIONet *n;
>> } VirtIONetQueue;
>>
>> +/*
>> + * Flags to be used as unique delimiters for virtio-net devices in the
>> + * migration stream.
>> + */
>> +#define VNET_MIG_F_INIT_STATE (0xffffffffef200000ULL)
>> +#define VNET_MIG_F_END_DATA (0xffffffffef200001ULL)
>> +#define VNET_MIG_F_NO_DATA (0xffffffffef200002ULL)
>> +
>> struct VirtIONet {
>> VirtIODevice parent_obj;
>> uint8_t mac[ETH_ALEN];
>> --
>> 2.47.1
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-23 5:51 ` [RFC 0/6] virtio-net: initial iterative live migration support Jason Wang
@ 2025-07-24 21:59 ` Jonah Palmer
2025-07-25 9:18 ` Lei Yang
2025-07-25 9:33 ` Michael S. Tsirkin
0 siblings, 2 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-24 21:59 UTC (permalink / raw)
To: Jason Wang
Cc: qemu-devel, peterx, farosas, eblake, armbru, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 7/23/25 1:51 AM, Jason Wang wrote:
> On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>> This series is an RFC initial implementation of iterative live
>> migration for virtio-net devices.
>>
>> The main motivation behind implementing iterative migration for
>> virtio-net devices is to start on heavy, time-consuming operations
>> for the destination while the source is still active (i.e. before
>> the stop-and-copy phase).
>
> It would be better to explain which kind of operations were heavy and
> time-consuming and how iterative migration help.
>
You're right. Apologies for being vague here.
I did do some profiling of the virtio_load call for virtio-net to try
and narrow down where exactly most of the downtime is coming from during
the stop-and-copy phase.
Pretty much the entirety of the downtime comes from the
vmstate_load_state call for the vmstate_virtio's subsections:
/* Subsections */
ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
if (ret) {
return ret;
}
More specifically, the vmstate_virtio_virtqueues and
vmstate_virtio_extra_state subsections.
For example, currently (with no iterative migration), for a virtio-net
device, the virtio_load call took 13.29ms to finish. 13.20ms of that
time was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues
and ~6.33ms was spent migrating the vmstate_virtio_extra_state
subsections. And I believe this is from walking VIRTIO_QUEUE_MAX
virtqueues, twice.
vmstate_load_state virtio-net v11
vmstate_load_state PCIDevice v2
vmstate_load_state_end PCIDevice end/0
vmstate_load_state virtio-net-device v11
vmstate_load_state virtio-net-queue-tx_waiting v0
vmstate_load_state_end virtio-net-queue-tx_waiting end/0
vmstate_load_state virtio-net-vnet v0
vmstate_load_state_end virtio-net-vnet end/0
vmstate_load_state virtio-net-ufo v0
vmstate_load_state_end virtio-net-ufo end/0
vmstate_load_state virtio-net-tx_waiting v0
vmstate_load_state virtio-net-queue-tx_waiting v0
vmstate_load_state_end virtio-net-queue-tx_waiting end/0
vmstate_load_state virtio-net-queue-tx_waiting v0
vmstate_load_state_end virtio-net-queue-tx_waiting end/0
vmstate_load_state virtio-net-queue-tx_waiting v0
vmstate_load_state_end virtio-net-queue-tx_waiting end/0
vmstate_load_state_end virtio-net-tx_waiting end/0
vmstate_load_state_end virtio-net-device end/0
vmstate_load_state virtio v1
vmstate_load_state virtio/64bit_features v1
vmstate_load_state_end virtio/64bit_features end/0
vmstate_load_state virtio/virtqueues v1
vmstate_load_state virtqueue_state v1 <--- Queue idx 0
...
vmstate_load_state_end virtqueue_state end/0
vmstate_load_state virtqueue_state v1 <--- Queue idx 1023
vmstate_load_state_end virtqueue_state end/0
vmstate_load_state_end virtio/virtqueues end/0
vmstate_load_state virtio/extra_state v1
vmstate_load_state virtio_pci v1
vmstate_load_state virtio_pci/modern_state v1
vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 0
vmstate_load_state_end virtio_pci/modern_queue_state end/0
...
vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 1023
vmstate_load_state_end virtio_pci/modern_queue_state end/0
vmstate_load_state_end virtio_pci/modern_state end/0
vmstate_load_state_end virtio_pci end/0
vmstate_load_state_end virtio/extra_state end/0
vmstate_load_state virtio/started v1
vmstate_load_state_end virtio/started end/0
vmstate_load_state_end virtio end/0
vmstate_load_state_end virtio-net end/0
vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
instance_id=0 downtime=13260
With iterative migration for virtio-net (maybe all virtio devices?), we
can send this early while the source is still running and then only send
the deltas during the stop-and-copy phase. It's likely that the source
wont be using all VIRTIO_QUEUE_MAX virtqueues during the migration
period, so this could really minimize a large majority of the downtime
contributed by virtio-net.
This could be one example.
>>
>> The motivation behind this RFC series specifically is to provide an
>> initial framework for such an implementation and get feedback on the
>> design and direction.
>> -------
>>
>> This implementation of iterative live migration for a virtio-net device
>> is enabled via setting the migration capability 'virtio-iterative' to
>> on for both the source & destination, e.g. (HMP):
>>
>> (qemu) migrate_set_capability virtio-iterative on
>>
>> The virtio-net device's SaveVMHandlers hooks are registered/unregistered
>> during the device's realize/unrealize phase.
>
> I wonder about the plan for libvirt support.
>
Could you elaborate on this a bit?
>>
>> Currently, this series only sends and loads the vmstate at the start of
>> migration. The vmstate is still sent (again) during the stop-and-copy
>> phase, as it is today, to handle any deltas in the state since it was
>> initially sent. A future patch in this series could avoid having to
>> re-send and re-load the entire state again and instead focus only on the
>> deltas.
>>
>> There is a slight, modest improvement in guest-visible downtime from
>> this series. More specifically, when using iterative live migration with
>> a virtio-net device, the downtime contributed by migrating a virtio-net
>> device decreased from ~3.2ms to ~1.4ms on average:
>
> Are you testing this via a software virtio device or hardware one?
>
Just software (virtio-device, vhost-net) with these numbers. I can run
some tests with vDPA hardware though.
Those numbers were from a simple, 1 queue-pair virtio-net device.
>>
>> Before:
>> -------
>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>> instance_id=0 downtime=3594
>>
>> After:
>> ------
>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>> instance_id=0 downtime=1607
>>
>> This slight improvement is likely due to the initial vmstate_load_state
>> call "warming up" pages in memory such that, when it's called a second
>> time during the stop-and-copy phase, allocation and page-fault latencies
>> are reduced.
>> -------
>>
>> Comments, suggestions, etc. are welcome here.
>>
>> Jonah Palmer (6):
>> migration: Add virtio-iterative capability
>> virtio-net: Reorder vmstate_virtio_net and helpers
>> virtio-net: Add SaveVMHandlers for iterative migration
>> virtio-net: iter live migration - migrate vmstate
>> virtio,virtio-net: skip consistency check in virtio_load for iterative
>> migration
>> virtio-net: skip vhost_started assertion during iterative migration
>>
>> hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
>> hw/virtio/virtio.c | 32 +++--
>> include/hw/virtio/virtio-net.h | 8 ++
>> include/hw/virtio/virtio.h | 7 +
>> migration/savevm.c | 1 +
>> qapi/migration.json | 7 +-
>> 6 files changed, 247 insertions(+), 54 deletions(-)
>>
>> --
>> 2.47.1
>
> Thanks
>
>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-24 21:59 ` Jonah Palmer
@ 2025-07-25 9:18 ` Lei Yang
2025-07-25 9:33 ` Michael S. Tsirkin
1 sibling, 0 replies; 66+ messages in thread
From: Lei Yang @ 2025-07-25 9:18 UTC (permalink / raw)
To: Jonah Palmer
Cc: Jason Wang, qemu-devel, peterx, farosas, eblake, armbru, mst,
si-wei.liu, eperezma, boris.ostrovsky
Tested this series of patches with virtio-net regression tests,
everything works fine.
Tested-by: Lei Yang <leiyang@redhat.com>
On Fri, Jul 25, 2025 at 6:01 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 7/23/25 1:51 AM, Jason Wang wrote:
> > On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >> This series is an RFC initial implementation of iterative live
> >> migration for virtio-net devices.
> >>
> >> The main motivation behind implementing iterative migration for
> >> virtio-net devices is to start on heavy, time-consuming operations
> >> for the destination while the source is still active (i.e. before
> >> the stop-and-copy phase).
> >
> > It would be better to explain which kind of operations were heavy and
> > time-consuming and how iterative migration help.
> >
>
> You're right. Apologies for being vague here.
>
> I did do some profiling of the virtio_load call for virtio-net to try
> and narrow down where exactly most of the downtime is coming from during
> the stop-and-copy phase.
>
> Pretty much the entirety of the downtime comes from the
> vmstate_load_state call for the vmstate_virtio's subsections:
>
> /* Subsections */
> ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
> if (ret) {
> return ret;
> }
>
> More specifically, the vmstate_virtio_virtqueues and
> vmstate_virtio_extra_state subsections.
>
> For example, currently (with no iterative migration), for a virtio-net
> device, the virtio_load call took 13.29ms to finish. 13.20ms of that
> time was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
>
> Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues
> and ~6.33ms was spent migrating the vmstate_virtio_extra_state
> subsections. And I believe this is from walking VIRTIO_QUEUE_MAX
> virtqueues, twice.
>
> vmstate_load_state virtio-net v11
> vmstate_load_state PCIDevice v2
> vmstate_load_state_end PCIDevice end/0
> vmstate_load_state virtio-net-device v11
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state virtio-net-vnet v0
> vmstate_load_state_end virtio-net-vnet end/0
> vmstate_load_state virtio-net-ufo v0
> vmstate_load_state_end virtio-net-ufo end/0
> vmstate_load_state virtio-net-tx_waiting v0
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state_end virtio-net-tx_waiting end/0
> vmstate_load_state_end virtio-net-device end/0
> vmstate_load_state virtio v1
> vmstate_load_state virtio/64bit_features v1
> vmstate_load_state_end virtio/64bit_features end/0
> vmstate_load_state virtio/virtqueues v1
> vmstate_load_state virtqueue_state v1 <--- Queue idx 0
> ...
> vmstate_load_state_end virtqueue_state end/0
> vmstate_load_state virtqueue_state v1 <--- Queue idx 1023
> vmstate_load_state_end virtqueue_state end/0
> vmstate_load_state_end virtio/virtqueues end/0
> vmstate_load_state virtio/extra_state v1
> vmstate_load_state virtio_pci v1
> vmstate_load_state virtio_pci/modern_state v1
> vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 0
> vmstate_load_state_end virtio_pci/modern_queue_state end/0
> ...
> vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 1023
> vmstate_load_state_end virtio_pci/modern_queue_state end/0
> vmstate_load_state_end virtio_pci/modern_state end/0
> vmstate_load_state_end virtio_pci end/0
> vmstate_load_state_end virtio/extra_state end/0
> vmstate_load_state virtio/started v1
> vmstate_load_state_end virtio/started end/0
> vmstate_load_state_end virtio end/0
> vmstate_load_state_end virtio-net end/0
> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> instance_id=0 downtime=13260
>
> With iterative migration for virtio-net (maybe all virtio devices?), we
> can send this early while the source is still running and then only send
> the deltas during the stop-and-copy phase. It's likely that the source
> wont be using all VIRTIO_QUEUE_MAX virtqueues during the migration
> period, so this could really minimize a large majority of the downtime
> contributed by virtio-net.
>
> This could be one example.
>
> >>
> >> The motivation behind this RFC series specifically is to provide an
> >> initial framework for such an implementation and get feedback on the
> >> design and direction.
> >> -------
> >>
> >> This implementation of iterative live migration for a virtio-net device
> >> is enabled via setting the migration capability 'virtio-iterative' to
> >> on for both the source & destination, e.g. (HMP):
> >>
> >> (qemu) migrate_set_capability virtio-iterative on
> >>
> >> The virtio-net device's SaveVMHandlers hooks are registered/unregistered
> >> during the device's realize/unrealize phase.
> >
> > I wonder about the plan for libvirt support.
> >
>
> Could you elaborate on this a bit?
>
> >>
> >> Currently, this series only sends and loads the vmstate at the start of
> >> migration. The vmstate is still sent (again) during the stop-and-copy
> >> phase, as it is today, to handle any deltas in the state since it was
> >> initially sent. A future patch in this series could avoid having to
> >> re-send and re-load the entire state again and instead focus only on the
> >> deltas.
> >>
> >> There is a slight, modest improvement in guest-visible downtime from
> >> this series. More specifically, when using iterative live migration with
> >> a virtio-net device, the downtime contributed by migrating a virtio-net
> >> device decreased from ~3.2ms to ~1.4ms on average:
> >
> > Are you testing this via a software virtio device or hardware one?
> >
>
> Just software (virtio-device, vhost-net) with these numbers. I can run
> some tests with vDPA hardware though.
>
> Those numbers were from a simple, 1 queue-pair virtio-net device.
>
> >>
> >> Before:
> >> -------
> >> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> >> instance_id=0 downtime=3594
> >>
> >> After:
> >> ------
> >> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> >> instance_id=0 downtime=1607
> >>
> >> This slight improvement is likely due to the initial vmstate_load_state
> >> call "warming up" pages in memory such that, when it's called a second
> >> time during the stop-and-copy phase, allocation and page-fault latencies
> >> are reduced.
> >> -------
> >>
> >> Comments, suggestions, etc. are welcome here.
> >>
> >> Jonah Palmer (6):
> >> migration: Add virtio-iterative capability
> >> virtio-net: Reorder vmstate_virtio_net and helpers
> >> virtio-net: Add SaveVMHandlers for iterative migration
> >> virtio-net: iter live migration - migrate vmstate
> >> virtio,virtio-net: skip consistency check in virtio_load for iterative
> >> migration
> >> virtio-net: skip vhost_started assertion during iterative migration
> >>
> >> hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
> >> hw/virtio/virtio.c | 32 +++--
> >> include/hw/virtio/virtio-net.h | 8 ++
> >> include/hw/virtio/virtio.h | 7 +
> >> migration/savevm.c | 1 +
> >> qapi/migration.json | 7 +-
> >> 6 files changed, 247 insertions(+), 54 deletions(-)
> >>
> >> --
> >> 2.47.1
> >
> > Thanks
> >
> >>
> >
>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 4/6] virtio-net: iter live migration - migrate vmstate
2025-07-24 14:45 ` Jonah Palmer
@ 2025-07-25 9:31 ` Michael S. Tsirkin
2025-07-28 12:30 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2025-07-25 9:31 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, si-wei.liu,
eperezma, boris.ostrovsky
On Thu, Jul 24, 2025 at 10:45:34AM -0400, Jonah Palmer wrote:
>
>
> On 7/23/25 2:51 AM, Michael S. Tsirkin wrote:
> > On Tue, Jul 22, 2025 at 12:41:25PM +0000, Jonah Palmer wrote:
> > > Lays out the initial groundwork for iteratively migrating the state of a
> > > virtio-net device, starting with its vmstate (via vmstate_save_state &
> > > vmstate_load_state).
> > >
> > > The original non-iterative vmstate framework still runs during the
> > > stop-and-copy phase when the guest is paused, which is still necessary
> > > to migrate over the final state of the virtqueues once the sourced has
> > > been paused.
> > >
> > > Although the vmstate framework is used twice (once during the iterative
> > > portion and once during the stop-and-copy phase), it appears that
> > > there's some modest improvement in guest-visible downtime when using a
> > > virtio-net device.
> > >
> > > When tracing the vmstate_downtime_save and vmstate_downtime_load
> > > tracepoints, for a virtio-net device using iterative live migration, the
> > > non-iterative downtime portion improved modestly, going from ~3.2ms to
> > > ~1.4ms:
> > >
> > > Before:
> > > -------
> > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > instance_id=0 downtime=3594
> > >
> > > After:
> > > ------
> > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > instance_id=0 downtime=1607
> > >
> > > This improvement is likely due to the initial vmstate_load_state call
> > > (while the guest is still running) "warming up" all related pages and
> > > structures on the destination. In other words, by the time the final
> > > stop-and-copy phase starts, the heavy allocations and page-fault
> > > latencies are reduced, making the device re-loads slightly faster and
> > > the guest-visible downtime window slightly smaller.
> >
> > did I get it right it's just the vmstate load for this single device?
> > If the theory is right, is it not possible that while the
> > tracepoints are now closer together, you have pushed something
> > else out of the cache, making the effect on guest visible downtime
> > unpredictable? how about the total vmstate load time?
> >
>
> Correct, the data above is just from the virtio-net device's downtime
> contribution (specifically during the stop-and-copy phase).
>
> Theoretically, yes I believe so. To try and get a feel on this, I ran some
> slightly heavier testing for the virtio-net device: vhost-net + 4 queue
> pairs (the one above was just a virtio-net device with 1 queue pair).
>
> I traced the reported downtimes of the devices that come right before and
> after virtio-net's vmstate_load_state call with and without iterative
> migration on the virtio-net device.
>
> The downtimes below are all from the vmstate_load_state calls that happen
> while the source has been stopped:
>
> With iterative migration for virtio-net:
> ----------------------------------------
> vga: 1.50ms | 1.39ms | 1.37ms | 1.50ms | 1.63ms |
> virtio-console: 13.78ms | 14.24ms | 13.74ms | 13.89ms | 13.60ms |
> virtio-net: 13.91ms | 13.52ms | 13.09ms | 13.59ms | 13.37ms |
> virtio-scsi: 18.71ms | 13.96ms | 14.05ms | 16.55ms | 14.30ms |
>
> vga: Avg. 1.47ms | Var: 0.0109ms² | Std. Dev (σ): 0.104ms
> virtio-console: Avg. 13.85ms | Var: 0.0583ms² | Std. Dev (σ): 0.241ms
> virtio-net: Avg. 13.49ms | Var: 0.0904ms² | Std. Dev (σ): 0.301ms
> virtio-scsi: Avg. 15.51ms | Var: 4.3299ms² | Std. Dev (σ): 2.081ms
>
> Without iterative migration for virtio-net:
> -------------------------------------------
> vga: 1.47ms | 1.28ms | 1.55ms | 1.36ms | 1.22ms |
> virtio-console: 13.39ms | 13.40ms | 14.37ms | 13.93ms | 13.36ms |
> virtio-net: 18.52ms | 17.77ms | 17.52ms | 15.52ms | 17.32ms |
> virtio-scsi: 13.35ms | 13.94ms | 15.17ms | 16.01ms | 14.08ms |
>
> vga: Avg. 1.37ms | Var: 0.0182ms² | Std. Dev (σ): 0.135ms
> virtio-console: Avg. 13.69ms | Var: 0.2007ms² | Std. Dev (σ): 0.448ms
> virtio-net: Avg. 17.33ms | Var: 1.2305ms² | Std. Dev (σ): 1.109ms
> virtio-scsi: Avg. 14.51ms | Var: 1.1352ms² | Std. Dev (σ): 1.065ms
>
> The most notable difference here is the standard deviation of virtio-scsi's
> migration downtime, which comes after virtio-net's migration: virtio-scsi's
> σ rises from ~1.07ms to ~2.08ms when virtio-net is iteratively migrated.
>
> However, since I only got 5 samples per device, the trend is indicative but
> not definitive.
>
> Total vmstate load time per device ≈ downtimes reported above, unless you're
> referring to overall downtime across all devices?
Indeed.
I also wonder, if preheating cache is a big gain, why don't we just
do it for all devices? there is nothing special in virtio: just
call save for all devices, send the state, call load on destination
then call reset to discard the state.
> ----------
>
> Having said all this, this RFC is just an initial, first-step for iterative
> migration of a virtio-net device. This second vmstate_load_state call during
> the stop-and-copy phase isn't optimal. A future version of this series could
> do away with this second call and only send the deltas instead of the entire
> state again.
I see how this could be a win, in theory, if the state is big.
> > > Future patches could improve upon this by skipping the second
> > > vmstate_save/load_state calls (during the stop-and-copy phase) and
> > > instead only send deltas right before/after the source is stopped.
> > >
> > > Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> > > ---
> > > hw/net/virtio-net.c | 37 ++++++++++++++++++++++++++++++++++
> > > include/hw/virtio/virtio-net.h | 8 ++++++++
> > > 2 files changed, 45 insertions(+)
> > >
> > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > index 19aa5b5936..86a6fe5b91 100644
> > > --- a/hw/net/virtio-net.c
> > > +++ b/hw/net/virtio-net.c
> > > @@ -3808,16 +3808,31 @@ static bool virtio_net_is_active(void *opaque)
> > > static int virtio_net_save_setup(QEMUFile *f, void *opaque, Error **errp)
> > > {
> > > + VirtIONet *n = opaque;
> > > +
> > > + qemu_put_be64(f, VNET_MIG_F_INIT_STATE);
> > > + vmstate_save_state(f, &vmstate_virtio_net, n, NULL);
> > > + qemu_put_be64(f, VNET_MIG_F_END_DATA);
> > > +
> > > return 0;
> > > }
> > > static int virtio_net_save_live_iterate(QEMUFile *f, void *opaque)
> > > {
> > > + bool new_data = false;
> > > +
> > > + if (!new_data) {
> > > + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
> > > + return 1;
> > > + }
> > > +
> > > + qemu_put_be64(f, VNET_MIG_F_END_DATA);
> > > return 1;
> > > }
> > > static int virtio_net_save_live_complete_precopy(QEMUFile *f, void *opaque)
> > > {
> > > + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
> > > return 0;
> > > }
> > > @@ -3833,6 +3848,28 @@ static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
> > > static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
> > > {
> > > + VirtIONet *n = opaque;
> > > + uint64_t flag;
> > > +
> > > + flag = qemu_get_be64(f);
> > > + if (flag == VNET_MIG_F_NO_DATA) {
> > > + return 0;
> > > + }
> > > +
> > > + while (flag != VNET_MIG_F_END_DATA) {
> > > + switch (flag) {
> > > + case VNET_MIG_F_INIT_STATE:
> > > + {
> > > + vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
> > > + break;
> > > + }
> > > + default:
> > > + qemu_log_mask(LOG_GUEST_ERROR, "%s: Uknown flag 0x%"PRIx64, __func__, flag);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + flag = qemu_get_be64(f);
> > > + }
> > > return 0;
> > > }
> > > diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> > > index b9ea9e824e..d6c7619053 100644
> > > --- a/include/hw/virtio/virtio-net.h
> > > +++ b/include/hw/virtio/virtio-net.h
> > > @@ -163,6 +163,14 @@ typedef struct VirtIONetQueue {
> > > struct VirtIONet *n;
> > > } VirtIONetQueue;
> > > +/*
> > > + * Flags to be used as unique delimiters for virtio-net devices in the
> > > + * migration stream.
> > > + */
> > > +#define VNET_MIG_F_INIT_STATE (0xffffffffef200000ULL)
> > > +#define VNET_MIG_F_END_DATA (0xffffffffef200001ULL)
> > > +#define VNET_MIG_F_NO_DATA (0xffffffffef200002ULL)
> > > +
> > > struct VirtIONet {
> > > VirtIODevice parent_obj;
> > > uint8_t mac[ETH_ALEN];
> > > --
> > > 2.47.1
> >
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-24 21:59 ` Jonah Palmer
2025-07-25 9:18 ` Lei Yang
@ 2025-07-25 9:33 ` Michael S. Tsirkin
2025-07-28 7:09 ` Jason Wang
1 sibling, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2025-07-25 9:33 UTC (permalink / raw)
To: Jonah Palmer
Cc: Jason Wang, qemu-devel, peterx, farosas, eblake, armbru,
si-wei.liu, eperezma, boris.ostrovsky
On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
>
>
> On 7/23/25 1:51 AM, Jason Wang wrote:
> > On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > >
> > > This series is an RFC initial implementation of iterative live
> > > migration for virtio-net devices.
> > >
> > > The main motivation behind implementing iterative migration for
> > > virtio-net devices is to start on heavy, time-consuming operations
> > > for the destination while the source is still active (i.e. before
> > > the stop-and-copy phase).
> >
> > It would be better to explain which kind of operations were heavy and
> > time-consuming and how iterative migration help.
> >
>
> You're right. Apologies for being vague here.
>
> I did do some profiling of the virtio_load call for virtio-net to try and
> narrow down where exactly most of the downtime is coming from during the
> stop-and-copy phase.
>
> Pretty much the entirety of the downtime comes from the vmstate_load_state
> call for the vmstate_virtio's subsections:
>
> /* Subsections */
> ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
> if (ret) {
> return ret;
> }
>
> More specifically, the vmstate_virtio_virtqueues and
> vmstate_virtio_extra_state subsections.
>
> For example, currently (with no iterative migration), for a virtio-net
> device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
> was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
>
> Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
> ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
> I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
Can we optimize it simply by sending a bitmap of used vqs?
> vmstate_load_state virtio-net v11
> vmstate_load_state PCIDevice v2
> vmstate_load_state_end PCIDevice end/0
> vmstate_load_state virtio-net-device v11
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state virtio-net-vnet v0
> vmstate_load_state_end virtio-net-vnet end/0
> vmstate_load_state virtio-net-ufo v0
> vmstate_load_state_end virtio-net-ufo end/0
> vmstate_load_state virtio-net-tx_waiting v0
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state virtio-net-queue-tx_waiting v0
> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> vmstate_load_state_end virtio-net-tx_waiting end/0
> vmstate_load_state_end virtio-net-device end/0
> vmstate_load_state virtio v1
> vmstate_load_state virtio/64bit_features v1
> vmstate_load_state_end virtio/64bit_features end/0
> vmstate_load_state virtio/virtqueues v1
> vmstate_load_state virtqueue_state v1 <--- Queue idx 0
> ...
> vmstate_load_state_end virtqueue_state end/0
> vmstate_load_state virtqueue_state v1 <--- Queue idx 1023
> vmstate_load_state_end virtqueue_state end/0
> vmstate_load_state_end virtio/virtqueues end/0
> vmstate_load_state virtio/extra_state v1
> vmstate_load_state virtio_pci v1
> vmstate_load_state virtio_pci/modern_state v1
> vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 0
> vmstate_load_state_end virtio_pci/modern_queue_state end/0
> ...
> vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 1023
> vmstate_load_state_end virtio_pci/modern_queue_state end/0
> vmstate_load_state_end virtio_pci/modern_state end/0
> vmstate_load_state_end virtio_pci end/0
> vmstate_load_state_end virtio/extra_state end/0
> vmstate_load_state virtio/started v1
> vmstate_load_state_end virtio/started end/0
> vmstate_load_state_end virtio end/0
> vmstate_load_state_end virtio-net end/0
> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> instance_id=0 downtime=13260
>
> With iterative migration for virtio-net (maybe all virtio devices?), we can
> send this early while the source is still running and then only send the
> deltas during the stop-and-copy phase. It's likely that the source wont be
> using all VIRTIO_QUEUE_MAX virtqueues during the migration period, so this
> could really minimize a large majority of the downtime contributed by
> virtio-net.
>
> This could be one example.
>
> > >
> > > The motivation behind this RFC series specifically is to provide an
> > > initial framework for such an implementation and get feedback on the
> > > design and direction.
> > > -------
> > >
> > > This implementation of iterative live migration for a virtio-net device
> > > is enabled via setting the migration capability 'virtio-iterative' to
> > > on for both the source & destination, e.g. (HMP):
> > >
> > > (qemu) migrate_set_capability virtio-iterative on
> > >
> > > The virtio-net device's SaveVMHandlers hooks are registered/unregistered
> > > during the device's realize/unrealize phase.
> >
> > I wonder about the plan for libvirt support.
> >
>
> Could you elaborate on this a bit?
>
> > >
> > > Currently, this series only sends and loads the vmstate at the start of
> > > migration. The vmstate is still sent (again) during the stop-and-copy
> > > phase, as it is today, to handle any deltas in the state since it was
> > > initially sent. A future patch in this series could avoid having to
> > > re-send and re-load the entire state again and instead focus only on the
> > > deltas.
> > >
> > > There is a slight, modest improvement in guest-visible downtime from
> > > this series. More specifically, when using iterative live migration with
> > > a virtio-net device, the downtime contributed by migrating a virtio-net
> > > device decreased from ~3.2ms to ~1.4ms on average:
> >
> > Are you testing this via a software virtio device or hardware one?
> >
>
> Just software (virtio-device, vhost-net) with these numbers. I can run some
> tests with vDPA hardware though.
>
> Those numbers were from a simple, 1 queue-pair virtio-net device.
>
> > >
> > > Before:
> > > -------
> > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > instance_id=0 downtime=3594
> > >
> > > After:
> > > ------
> > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > instance_id=0 downtime=1607
> > >
> > > This slight improvement is likely due to the initial vmstate_load_state
> > > call "warming up" pages in memory such that, when it's called a second
> > > time during the stop-and-copy phase, allocation and page-fault latencies
> > > are reduced.
> > > -------
> > >
> > > Comments, suggestions, etc. are welcome here.
> > >
> > > Jonah Palmer (6):
> > > migration: Add virtio-iterative capability
> > > virtio-net: Reorder vmstate_virtio_net and helpers
> > > virtio-net: Add SaveVMHandlers for iterative migration
> > > virtio-net: iter live migration - migrate vmstate
> > > virtio,virtio-net: skip consistency check in virtio_load for iterative
> > > migration
> > > virtio-net: skip vhost_started assertion during iterative migration
> > >
> > > hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
> > > hw/virtio/virtio.c | 32 +++--
> > > include/hw/virtio/virtio-net.h | 8 ++
> > > include/hw/virtio/virtio.h | 7 +
> > > migration/savevm.c | 1 +
> > > qapi/migration.json | 7 +-
> > > 6 files changed, 247 insertions(+), 54 deletions(-)
> > >
> > > --
> > > 2.47.1
> >
> > Thanks
> >
> > >
> >
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-25 9:33 ` Michael S. Tsirkin
@ 2025-07-28 7:09 ` Jason Wang
2025-07-28 7:35 ` Jason Wang
0 siblings, 1 reply; 66+ messages in thread
From: Jason Wang @ 2025-07-28 7:09 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jonah Palmer, qemu-devel, peterx, farosas, eblake, armbru,
si-wei.liu, eperezma, boris.ostrovsky
On Fri, Jul 25, 2025 at 5:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
> >
> >
> > On 7/23/25 1:51 AM, Jason Wang wrote:
> > > On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > >
> > > > This series is an RFC initial implementation of iterative live
> > > > migration for virtio-net devices.
> > > >
> > > > The main motivation behind implementing iterative migration for
> > > > virtio-net devices is to start on heavy, time-consuming operations
> > > > for the destination while the source is still active (i.e. before
> > > > the stop-and-copy phase).
> > >
> > > It would be better to explain which kind of operations were heavy and
> > > time-consuming and how iterative migration help.
> > >
> >
> > You're right. Apologies for being vague here.
> >
> > I did do some profiling of the virtio_load call for virtio-net to try and
> > narrow down where exactly most of the downtime is coming from during the
> > stop-and-copy phase.
> >
> > Pretty much the entirety of the downtime comes from the vmstate_load_state
> > call for the vmstate_virtio's subsections:
> >
> > /* Subsections */
> > ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
> > if (ret) {
> > return ret;
> > }
> >
> > More specifically, the vmstate_virtio_virtqueues and
> > vmstate_virtio_extra_state subsections.
> >
> > For example, currently (with no iterative migration), for a virtio-net
> > device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
> > was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
> >
> > Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
> > ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
> > I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
>
> Can we optimize it simply by sending a bitmap of used vqs?
+1.
For example devices like virtio-net may know exactly the number of
virtqueues that will be used.
>
> > vmstate_load_state virtio-net v11
> > vmstate_load_state PCIDevice v2
> > vmstate_load_state_end PCIDevice end/0
> > vmstate_load_state virtio-net-device v11
> > vmstate_load_state virtio-net-queue-tx_waiting v0
> > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > vmstate_load_state virtio-net-vnet v0
> > vmstate_load_state_end virtio-net-vnet end/0
> > vmstate_load_state virtio-net-ufo v0
> > vmstate_load_state_end virtio-net-ufo end/0
> > vmstate_load_state virtio-net-tx_waiting v0
> > vmstate_load_state virtio-net-queue-tx_waiting v0
> > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > vmstate_load_state virtio-net-queue-tx_waiting v0
> > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > vmstate_load_state virtio-net-queue-tx_waiting v0
> > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > vmstate_load_state_end virtio-net-tx_waiting end/0
> > vmstate_load_state_end virtio-net-device end/0
> > vmstate_load_state virtio v1
> > vmstate_load_state virtio/64bit_features v1
> > vmstate_load_state_end virtio/64bit_features end/0
> > vmstate_load_state virtio/virtqueues v1
> > vmstate_load_state virtqueue_state v1 <--- Queue idx 0
> > ...
> > vmstate_load_state_end virtqueue_state end/0
> > vmstate_load_state virtqueue_state v1 <--- Queue idx 1023
> > vmstate_load_state_end virtqueue_state end/0
> > vmstate_load_state_end virtio/virtqueues end/0
> > vmstate_load_state virtio/extra_state v1
> > vmstate_load_state virtio_pci v1
> > vmstate_load_state virtio_pci/modern_state v1
> > vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 0
> > vmstate_load_state_end virtio_pci/modern_queue_state end/0
> > ...
> > vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 1023
> > vmstate_load_state_end virtio_pci/modern_queue_state end/0
> > vmstate_load_state_end virtio_pci/modern_state end/0
> > vmstate_load_state_end virtio_pci end/0
> > vmstate_load_state_end virtio/extra_state end/0
> > vmstate_load_state virtio/started v1
> > vmstate_load_state_end virtio/started end/0
> > vmstate_load_state_end virtio end/0
> > vmstate_load_state_end virtio-net end/0
> > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > instance_id=0 downtime=13260
> >
> > With iterative migration for virtio-net (maybe all virtio devices?), we can
> > send this early while the source is still running and then only send the
> > deltas during the stop-and-copy phase. It's likely that the source wont be
> > using all VIRTIO_QUEUE_MAX virtqueues during the migration period, so this
> > could really minimize a large majority of the downtime contributed by
> > virtio-net.
> >
> > This could be one example.
Or if the system call is expensive, could we try io_uring to mitigate it.
> >
> > > >
> > > > The motivation behind this RFC series specifically is to provide an
> > > > initial framework for such an implementation and get feedback on the
> > > > design and direction.
> > > > -------
> > > >
> > > > This implementation of iterative live migration for a virtio-net device
> > > > is enabled via setting the migration capability 'virtio-iterative' to
> > > > on for both the source & destination, e.g. (HMP):
> > > >
> > > > (qemu) migrate_set_capability virtio-iterative on
> > > >
> > > > The virtio-net device's SaveVMHandlers hooks are registered/unregistered
> > > > during the device's realize/unrealize phase.
> > >
> > > I wonder about the plan for libvirt support.
> > >
> >
> > Could you elaborate on this a bit?
I meant how this feature will be supported by the libvirt.
> >
> > > >
> > > > Currently, this series only sends and loads the vmstate at the start of
> > > > migration. The vmstate is still sent (again) during the stop-and-copy
> > > > phase, as it is today, to handle any deltas in the state since it was
> > > > initially sent. A future patch in this series could avoid having to
> > > > re-send and re-load the entire state again and instead focus only on the
> > > > deltas.
> > > >
> > > > There is a slight, modest improvement in guest-visible downtime from
> > > > this series. More specifically, when using iterative live migration with
> > > > a virtio-net device, the downtime contributed by migrating a virtio-net
> > > > device decreased from ~3.2ms to ~1.4ms on average:
> > >
> > > Are you testing this via a software virtio device or hardware one?
> > >
> >
> > Just software (virtio-device, vhost-net) with these numbers. I can run some
> > tests with vDPA hardware though.
I see. Considering you see great improvement with software devices. It
should be sufficient.
> >
> > Those numbers were from a simple, 1 queue-pair virtio-net device.
Thanks
> >
> > > >
> > > > Before:
> > > > -------
> > > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > > instance_id=0 downtime=3594
> > > >
> > > > After:
> > > > ------
> > > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > > instance_id=0 downtime=1607
> > > >
> > > > This slight improvement is likely due to the initial vmstate_load_state
> > > > call "warming up" pages in memory such that, when it's called a second
> > > > time during the stop-and-copy phase, allocation and page-fault latencies
> > > > are reduced.
> > > > -------
> > > >
> > > > Comments, suggestions, etc. are welcome here.
> > > >
> > > > Jonah Palmer (6):
> > > > migration: Add virtio-iterative capability
> > > > virtio-net: Reorder vmstate_virtio_net and helpers
> > > > virtio-net: Add SaveVMHandlers for iterative migration
> > > > virtio-net: iter live migration - migrate vmstate
> > > > virtio,virtio-net: skip consistency check in virtio_load for iterative
> > > > migration
> > > > virtio-net: skip vhost_started assertion during iterative migration
> > > >
> > > > hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
> > > > hw/virtio/virtio.c | 32 +++--
> > > > include/hw/virtio/virtio-net.h | 8 ++
> > > > include/hw/virtio/virtio.h | 7 +
> > > > migration/savevm.c | 1 +
> > > > qapi/migration.json | 7 +-
> > > > 6 files changed, 247 insertions(+), 54 deletions(-)
> > > >
> > > > --
> > > > 2.47.1
> > >
> > > Thanks
> > >
> > > >
> > >
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-28 7:09 ` Jason Wang
@ 2025-07-28 7:35 ` Jason Wang
2025-07-28 12:41 ` Jonah Palmer
2025-07-28 14:51 ` Eugenio Perez Martin
0 siblings, 2 replies; 66+ messages in thread
From: Jason Wang @ 2025-07-28 7:35 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jonah Palmer, qemu-devel, peterx, farosas, eblake, armbru,
si-wei.liu, eperezma, boris.ostrovsky
On Mon, Jul 28, 2025 at 3:09 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jul 25, 2025 at 5:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
> > >
> > >
> > > On 7/23/25 1:51 AM, Jason Wang wrote:
> > > > On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > > >
> > > > > This series is an RFC initial implementation of iterative live
> > > > > migration for virtio-net devices.
> > > > >
> > > > > The main motivation behind implementing iterative migration for
> > > > > virtio-net devices is to start on heavy, time-consuming operations
> > > > > for the destination while the source is still active (i.e. before
> > > > > the stop-and-copy phase).
> > > >
> > > > It would be better to explain which kind of operations were heavy and
> > > > time-consuming and how iterative migration help.
> > > >
> > >
> > > You're right. Apologies for being vague here.
> > >
> > > I did do some profiling of the virtio_load call for virtio-net to try and
> > > narrow down where exactly most of the downtime is coming from during the
> > > stop-and-copy phase.
> > >
> > > Pretty much the entirety of the downtime comes from the vmstate_load_state
> > > call for the vmstate_virtio's subsections:
> > >
> > > /* Subsections */
> > > ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
> > > if (ret) {
> > > return ret;
> > > }
> > >
> > > More specifically, the vmstate_virtio_virtqueues and
> > > vmstate_virtio_extra_state subsections.
> > >
> > > For example, currently (with no iterative migration), for a virtio-net
> > > device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
> > > was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
> > >
> > > Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
> > > ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
> > > I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
> >
> > Can we optimize it simply by sending a bitmap of used vqs?
>
> +1.
>
> For example devices like virtio-net may know exactly the number of
> virtqueues that will be used.
Ok, I think it comes from the following subsections:
static const VMStateDescription vmstate_virtio_virtqueues = {
.name = "virtio/virtqueues",
.version_id = 1,
.minimum_version_id = 1,
.needed = &virtio_virtqueue_needed,
.fields = (const VMStateField[]) {
VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
VIRTIO_QUEUE_MAX, 0, vmstate_virtqueue, VirtQueue),
VMSTATE_END_OF_LIST()
}
};
static const VMStateDescription vmstate_virtio_packed_virtqueues = {
.name = "virtio/packed_virtqueues",
.version_id = 1,
.minimum_version_id = 1,
.needed = &virtio_packed_virtqueue_needed,
.fields = (const VMStateField[]) {
VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
VIRTIO_QUEUE_MAX, 0, vmstate_packed_virtqueue, VirtQueue),
VMSTATE_END_OF_LIST()
}
};
A rough idea is to disable those subsections and use new subsections
instead (and do the compatibility work) like virtio_save():
for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
if (vdev->vq[i].vring.num == 0)
break;
}
qemu_put_be32(f, i);
....
Thanks
>
> >
> > > vmstate_load_state virtio-net v11
> > > vmstate_load_state PCIDevice v2
> > > vmstate_load_state_end PCIDevice end/0
> > > vmstate_load_state virtio-net-device v11
> > > vmstate_load_state virtio-net-queue-tx_waiting v0
> > > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > > vmstate_load_state virtio-net-vnet v0
> > > vmstate_load_state_end virtio-net-vnet end/0
> > > vmstate_load_state virtio-net-ufo v0
> > > vmstate_load_state_end virtio-net-ufo end/0
> > > vmstate_load_state virtio-net-tx_waiting v0
> > > vmstate_load_state virtio-net-queue-tx_waiting v0
> > > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > > vmstate_load_state virtio-net-queue-tx_waiting v0
> > > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > > vmstate_load_state virtio-net-queue-tx_waiting v0
> > > vmstate_load_state_end virtio-net-queue-tx_waiting end/0
> > > vmstate_load_state_end virtio-net-tx_waiting end/0
> > > vmstate_load_state_end virtio-net-device end/0
> > > vmstate_load_state virtio v1
> > > vmstate_load_state virtio/64bit_features v1
> > > vmstate_load_state_end virtio/64bit_features end/0
> > > vmstate_load_state virtio/virtqueues v1
> > > vmstate_load_state virtqueue_state v1 <--- Queue idx 0
> > > ...
> > > vmstate_load_state_end virtqueue_state end/0
> > > vmstate_load_state virtqueue_state v1 <--- Queue idx 1023
> > > vmstate_load_state_end virtqueue_state end/0
> > > vmstate_load_state_end virtio/virtqueues end/0
> > > vmstate_load_state virtio/extra_state v1
> > > vmstate_load_state virtio_pci v1
> > > vmstate_load_state virtio_pci/modern_state v1
> > > vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 0
> > > vmstate_load_state_end virtio_pci/modern_queue_state end/0
> > > ...
> > > vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 1023
> > > vmstate_load_state_end virtio_pci/modern_queue_state end/0
> > > vmstate_load_state_end virtio_pci/modern_state end/0
> > > vmstate_load_state_end virtio_pci end/0
> > > vmstate_load_state_end virtio/extra_state end/0
> > > vmstate_load_state virtio/started v1
> > > vmstate_load_state_end virtio/started end/0
> > > vmstate_load_state_end virtio end/0
> > > vmstate_load_state_end virtio-net end/0
> > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > instance_id=0 downtime=13260
> > >
> > > With iterative migration for virtio-net (maybe all virtio devices?), we can
> > > send this early while the source is still running and then only send the
> > > deltas during the stop-and-copy phase. It's likely that the source wont be
> > > using all VIRTIO_QUEUE_MAX virtqueues during the migration period, so this
> > > could really minimize a large majority of the downtime contributed by
> > > virtio-net.
> > >
> > > This could be one example.
>
> Or if the system call is expensive, could we try io_uring to mitigate it.
>
> > >
> > > > >
> > > > > The motivation behind this RFC series specifically is to provide an
> > > > > initial framework for such an implementation and get feedback on the
> > > > > design and direction.
> > > > > -------
> > > > >
> > > > > This implementation of iterative live migration for a virtio-net device
> > > > > is enabled via setting the migration capability 'virtio-iterative' to
> > > > > on for both the source & destination, e.g. (HMP):
> > > > >
> > > > > (qemu) migrate_set_capability virtio-iterative on
> > > > >
> > > > > The virtio-net device's SaveVMHandlers hooks are registered/unregistered
> > > > > during the device's realize/unrealize phase.
> > > >
> > > > I wonder about the plan for libvirt support.
> > > >
> > >
> > > Could you elaborate on this a bit?
>
> I meant how this feature will be supported by the libvirt.
>
> > >
> > > > >
> > > > > Currently, this series only sends and loads the vmstate at the start of
> > > > > migration. The vmstate is still sent (again) during the stop-and-copy
> > > > > phase, as it is today, to handle any deltas in the state since it was
> > > > > initially sent. A future patch in this series could avoid having to
> > > > > re-send and re-load the entire state again and instead focus only on the
> > > > > deltas.
> > > > >
> > > > > There is a slight, modest improvement in guest-visible downtime from
> > > > > this series. More specifically, when using iterative live migration with
> > > > > a virtio-net device, the downtime contributed by migrating a virtio-net
> > > > > device decreased from ~3.2ms to ~1.4ms on average:
> > > >
> > > > Are you testing this via a software virtio device or hardware one?
> > > >
> > >
> > > Just software (virtio-device, vhost-net) with these numbers. I can run some
> > > tests with vDPA hardware though.
>
> I see. Considering you see great improvement with software devices. It
> should be sufficient.
>
> > >
> > > Those numbers were from a simple, 1 queue-pair virtio-net device.
>
> Thanks
>
> > >
> > > > >
> > > > > Before:
> > > > > -------
> > > > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > > > instance_id=0 downtime=3594
> > > > >
> > > > > After:
> > > > > ------
> > > > > vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
> > > > > instance_id=0 downtime=1607
> > > > >
> > > > > This slight improvement is likely due to the initial vmstate_load_state
> > > > > call "warming up" pages in memory such that, when it's called a second
> > > > > time during the stop-and-copy phase, allocation and page-fault latencies
> > > > > are reduced.
> > > > > -------
> > > > >
> > > > > Comments, suggestions, etc. are welcome here.
> > > > >
> > > > > Jonah Palmer (6):
> > > > > migration: Add virtio-iterative capability
> > > > > virtio-net: Reorder vmstate_virtio_net and helpers
> > > > > virtio-net: Add SaveVMHandlers for iterative migration
> > > > > virtio-net: iter live migration - migrate vmstate
> > > > > virtio,virtio-net: skip consistency check in virtio_load for iterative
> > > > > migration
> > > > > virtio-net: skip vhost_started assertion during iterative migration
> > > > >
> > > > > hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
> > > > > hw/virtio/virtio.c | 32 +++--
> > > > > include/hw/virtio/virtio-net.h | 8 ++
> > > > > include/hw/virtio/virtio.h | 7 +
> > > > > migration/savevm.c | 1 +
> > > > > qapi/migration.json | 7 +-
> > > > > 6 files changed, 247 insertions(+), 54 deletions(-)
> > > > >
> > > > > --
> > > > > 2.47.1
> > > >
> > > > Thanks
> > > >
> > > > >
> > > >
> >
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 4/6] virtio-net: iter live migration - migrate vmstate
2025-07-25 9:31 ` Michael S. Tsirkin
@ 2025-07-28 12:30 ` Jonah Palmer
0 siblings, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-28 12:30 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, si-wei.liu,
eperezma, boris.ostrovsky
On 7/25/25 5:31 AM, Michael S. Tsirkin wrote:
> On Thu, Jul 24, 2025 at 10:45:34AM -0400, Jonah Palmer wrote:
>>
>>
>> On 7/23/25 2:51 AM, Michael S. Tsirkin wrote:
>>> On Tue, Jul 22, 2025 at 12:41:25PM +0000, Jonah Palmer wrote:
>>>> Lays out the initial groundwork for iteratively migrating the state of a
>>>> virtio-net device, starting with its vmstate (via vmstate_save_state &
>>>> vmstate_load_state).
>>>>
>>>> The original non-iterative vmstate framework still runs during the
>>>> stop-and-copy phase when the guest is paused, which is still necessary
>>>> to migrate over the final state of the virtqueues once the sourced has
>>>> been paused.
>>>>
>>>> Although the vmstate framework is used twice (once during the iterative
>>>> portion and once during the stop-and-copy phase), it appears that
>>>> there's some modest improvement in guest-visible downtime when using a
>>>> virtio-net device.
>>>>
>>>> When tracing the vmstate_downtime_save and vmstate_downtime_load
>>>> tracepoints, for a virtio-net device using iterative live migration, the
>>>> non-iterative downtime portion improved modestly, going from ~3.2ms to
>>>> ~1.4ms:
>>>>
>>>> Before:
>>>> -------
>>>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>>>> instance_id=0 downtime=3594
>>>>
>>>> After:
>>>> ------
>>>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>>>> instance_id=0 downtime=1607
>>>>
>>>> This improvement is likely due to the initial vmstate_load_state call
>>>> (while the guest is still running) "warming up" all related pages and
>>>> structures on the destination. In other words, by the time the final
>>>> stop-and-copy phase starts, the heavy allocations and page-fault
>>>> latencies are reduced, making the device re-loads slightly faster and
>>>> the guest-visible downtime window slightly smaller.
>>>
>>> did I get it right it's just the vmstate load for this single device?
>>> If the theory is right, is it not possible that while the
>>> tracepoints are now closer together, you have pushed something
>>> else out of the cache, making the effect on guest visible downtime
>>> unpredictable? how about the total vmstate load time?
>>>
>>
>> Correct, the data above is just from the virtio-net device's downtime
>> contribution (specifically during the stop-and-copy phase).
>>
>> Theoretically, yes I believe so. To try and get a feel on this, I ran some
>> slightly heavier testing for the virtio-net device: vhost-net + 4 queue
>> pairs (the one above was just a virtio-net device with 1 queue pair).
>>
>> I traced the reported downtimes of the devices that come right before and
>> after virtio-net's vmstate_load_state call with and without iterative
>> migration on the virtio-net device.
>>
>> The downtimes below are all from the vmstate_load_state calls that happen
>> while the source has been stopped:
>>
>> With iterative migration for virtio-net:
>> ----------------------------------------
>> vga: 1.50ms | 1.39ms | 1.37ms | 1.50ms | 1.63ms |
>> virtio-console: 13.78ms | 14.24ms | 13.74ms | 13.89ms | 13.60ms |
>> virtio-net: 13.91ms | 13.52ms | 13.09ms | 13.59ms | 13.37ms |
>> virtio-scsi: 18.71ms | 13.96ms | 14.05ms | 16.55ms | 14.30ms |
>>
>> vga: Avg. 1.47ms | Var: 0.0109ms² | Std. Dev (σ): 0.104ms
>> virtio-console: Avg. 13.85ms | Var: 0.0583ms² | Std. Dev (σ): 0.241ms
>> virtio-net: Avg. 13.49ms | Var: 0.0904ms² | Std. Dev (σ): 0.301ms
>> virtio-scsi: Avg. 15.51ms | Var: 4.3299ms² | Std. Dev (σ): 2.081ms
>>
>> Without iterative migration for virtio-net:
>> -------------------------------------------
>> vga: 1.47ms | 1.28ms | 1.55ms | 1.36ms | 1.22ms |
>> virtio-console: 13.39ms | 13.40ms | 14.37ms | 13.93ms | 13.36ms |
>> virtio-net: 18.52ms | 17.77ms | 17.52ms | 15.52ms | 17.32ms |
>> virtio-scsi: 13.35ms | 13.94ms | 15.17ms | 16.01ms | 14.08ms |
>>
>> vga: Avg. 1.37ms | Var: 0.0182ms² | Std. Dev (σ): 0.135ms
>> virtio-console: Avg. 13.69ms | Var: 0.2007ms² | Std. Dev (σ): 0.448ms
>> virtio-net: Avg. 17.33ms | Var: 1.2305ms² | Std. Dev (σ): 1.109ms
>> virtio-scsi: Avg. 14.51ms | Var: 1.1352ms² | Std. Dev (σ): 1.065ms
>>
>> The most notable difference here is the standard deviation of virtio-scsi's
>> migration downtime, which comes after virtio-net's migration: virtio-scsi's
>> σ rises from ~1.07ms to ~2.08ms when virtio-net is iteratively migrated.
>>
>> However, since I only got 5 samples per device, the trend is indicative but
>> not definitive.
>>
>> Total vmstate load time per device ≈ downtimes reported above, unless you're
>> referring to overall downtime across all devices?
>
> Indeed.
>
> I also wonder, if preheating cache is a big gain, why don't we just
> do it for all devices? there is nothing special in virtio: just
> call save for all devices, send the state, call load on destination
> then call reset to discard the state.
>
So with a relatively simple guest with vhost-net (4x queue pairs),
virtio-scsi, and virtio-serial (virtio-console), total downtime across
all devices came out to ~66.29ms. This was with iterative live migration
for the virtio-net device.
The 5 largest contributors to downtime was virtio-scsi, virtio-serial,
virtio-net, RAM, and CPU:
virtio-scsi: 13.994ms
virtio-console: 13.796ms
virtio-net: 13.495ms
RAM: 9.994ms
CPU: 4.125ms
...
-----------
Perhaps we could do it for all devices, but it would probably be much
more efficient to not discard the state after the iterative portion and
just send the deltas at the end. This will probably be the next goal for
this series.
>
>
>> ----------
>>
>> Having said all this, this RFC is just an initial, first-step for iterative
>> migration of a virtio-net device. This second vmstate_load_state call during
>> the stop-and-copy phase isn't optimal. A future version of this series could
>> do away with this second call and only send the deltas instead of the entire
>> state again.
>
> I see how this could be a win, in theory, if the state is big.
>
>
>>>> Future patches could improve upon this by skipping the second
>>>> vmstate_save/load_state calls (during the stop-and-copy phase) and
>>>> instead only send deltas right before/after the source is stopped.
>>>>
>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>> ---
>>>> hw/net/virtio-net.c | 37 ++++++++++++++++++++++++++++++++++
>>>> include/hw/virtio/virtio-net.h | 8 ++++++++
>>>> 2 files changed, 45 insertions(+)
>>>>
>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>> index 19aa5b5936..86a6fe5b91 100644
>>>> --- a/hw/net/virtio-net.c
>>>> +++ b/hw/net/virtio-net.c
>>>> @@ -3808,16 +3808,31 @@ static bool virtio_net_is_active(void *opaque)
>>>> static int virtio_net_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>> {
>>>> + VirtIONet *n = opaque;
>>>> +
>>>> + qemu_put_be64(f, VNET_MIG_F_INIT_STATE);
>>>> + vmstate_save_state(f, &vmstate_virtio_net, n, NULL);
>>>> + qemu_put_be64(f, VNET_MIG_F_END_DATA);
>>>> +
>>>> return 0;
>>>> }
>>>> static int virtio_net_save_live_iterate(QEMUFile *f, void *opaque)
>>>> {
>>>> + bool new_data = false;
>>>> +
>>>> + if (!new_data) {
>>>> + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
>>>> + return 1;
>>>> + }
>>>> +
>>>> + qemu_put_be64(f, VNET_MIG_F_END_DATA);
>>>> return 1;
>>>> }
>>>> static int virtio_net_save_live_complete_precopy(QEMUFile *f, void *opaque)
>>>> {
>>>> + qemu_put_be64(f, VNET_MIG_F_NO_DATA);
>>>> return 0;
>>>> }
>>>> @@ -3833,6 +3848,28 @@ static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
>>>> static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
>>>> {
>>>> + VirtIONet *n = opaque;
>>>> + uint64_t flag;
>>>> +
>>>> + flag = qemu_get_be64(f);
>>>> + if (flag == VNET_MIG_F_NO_DATA) {
>>>> + return 0;
>>>> + }
>>>> +
>>>> + while (flag != VNET_MIG_F_END_DATA) {
>>>> + switch (flag) {
>>>> + case VNET_MIG_F_INIT_STATE:
>>>> + {
>>>> + vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
>>>> + break;
>>>> + }
>>>> + default:
>>>> + qemu_log_mask(LOG_GUEST_ERROR, "%s: Uknown flag 0x%"PRIx64, __func__, flag);
>>>> + return -EINVAL;
>>>> + }
>>>> +
>>>> + flag = qemu_get_be64(f);
>>>> + }
>>>> return 0;
>>>> }
>>>> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
>>>> index b9ea9e824e..d6c7619053 100644
>>>> --- a/include/hw/virtio/virtio-net.h
>>>> +++ b/include/hw/virtio/virtio-net.h
>>>> @@ -163,6 +163,14 @@ typedef struct VirtIONetQueue {
>>>> struct VirtIONet *n;
>>>> } VirtIONetQueue;
>>>> +/*
>>>> + * Flags to be used as unique delimiters for virtio-net devices in the
>>>> + * migration stream.
>>>> + */
>>>> +#define VNET_MIG_F_INIT_STATE (0xffffffffef200000ULL)
>>>> +#define VNET_MIG_F_END_DATA (0xffffffffef200001ULL)
>>>> +#define VNET_MIG_F_NO_DATA (0xffffffffef200002ULL)
>>>> +
>>>> struct VirtIONet {
>>>> VirtIODevice parent_obj;
>>>> uint8_t mac[ETH_ALEN];
>>>> --
>>>> 2.47.1
>>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-28 7:35 ` Jason Wang
@ 2025-07-28 12:41 ` Jonah Palmer
2025-07-28 14:51 ` Eugenio Perez Martin
1 sibling, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-28 12:41 UTC (permalink / raw)
To: Jason Wang, Michael S. Tsirkin
Cc: qemu-devel, peterx, farosas, eblake, armbru, si-wei.liu, eperezma,
boris.ostrovsky
On 7/28/25 3:35 AM, Jason Wang wrote:
> On Mon, Jul 28, 2025 at 3:09 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On Fri, Jul 25, 2025 at 5:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>
>>> On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
>>>>
>>>>
>>>> On 7/23/25 1:51 AM, Jason Wang wrote:
>>>>> On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>
>>>>>> This series is an RFC initial implementation of iterative live
>>>>>> migration for virtio-net devices.
>>>>>>
>>>>>> The main motivation behind implementing iterative migration for
>>>>>> virtio-net devices is to start on heavy, time-consuming operations
>>>>>> for the destination while the source is still active (i.e. before
>>>>>> the stop-and-copy phase).
>>>>>
>>>>> It would be better to explain which kind of operations were heavy and
>>>>> time-consuming and how iterative migration help.
>>>>>
>>>>
>>>> You're right. Apologies for being vague here.
>>>>
>>>> I did do some profiling of the virtio_load call for virtio-net to try and
>>>> narrow down where exactly most of the downtime is coming from during the
>>>> stop-and-copy phase.
>>>>
>>>> Pretty much the entirety of the downtime comes from the vmstate_load_state
>>>> call for the vmstate_virtio's subsections:
>>>>
>>>> /* Subsections */
>>>> ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
>>>> if (ret) {
>>>> return ret;
>>>> }
>>>>
>>>> More specifically, the vmstate_virtio_virtqueues and
>>>> vmstate_virtio_extra_state subsections.
>>>>
>>>> For example, currently (with no iterative migration), for a virtio-net
>>>> device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
>>>> was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
>>>>
>>>> Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
>>>> ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
>>>> I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
>>>
>>> Can we optimize it simply by sending a bitmap of used vqs?
>>
>> +1.
>>
>> For example devices like virtio-net may know exactly the number of
>> virtqueues that will be used.
>
> Ok, I think it comes from the following subsections:
>
> static const VMStateDescription vmstate_virtio_virtqueues = {
> .name = "virtio/virtqueues",
> .version_id = 1,
> .minimum_version_id = 1,
> .needed = &virtio_virtqueue_needed,
> .fields = (const VMStateField[]) {
> VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> VIRTIO_QUEUE_MAX, 0, vmstate_virtqueue, VirtQueue),
> VMSTATE_END_OF_LIST()
> }
> };
>
> static const VMStateDescription vmstate_virtio_packed_virtqueues = {
> .name = "virtio/packed_virtqueues",
> .version_id = 1,
> .minimum_version_id = 1,
> .needed = &virtio_packed_virtqueue_needed,
> .fields = (const VMStateField[]) {
> VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> VIRTIO_QUEUE_MAX, 0, vmstate_packed_virtqueue, VirtQueue),
> VMSTATE_END_OF_LIST()
> }
> };
>
> A rough idea is to disable those subsections and use new subsections
> instead (and do the compatibility work) like virtio_save():
>
> for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
> if (vdev->vq[i].vring.num == 0)
> break;
> }
>
> qemu_put_be32(f, i);
> ....
>
> Thanks
>
Right. There's this for split/packed VQs and then there's also the
vmstate_virtio_extra_state which ends up loading this
virtio-pci/modern-state:
static const VMStateDescription vmstate_virtio_pci_modern_state_sub = {
.name = "virtio_pci/modern_state",
.version_id = 1,
.minimum_version_id = 1,
.needed = &virtio_pci_modern_state_needed,
.fields = (const VMStateField[]) {
VMSTATE_UINT32(dfselect, VirtIOPCIProxy),
VMSTATE_UINT32(gfselect, VirtIOPCIProxy),
VMSTATE_UINT32_ARRAY(guest_features, VirtIOPCIProxy, 2),
VMSTATE_STRUCT_ARRAY(vqs, VirtIOPCIProxy, VIRTIO_QUEUE_MAX, 0,
vmstate_virtio_pci_modern_queue_state,
VirtIOPCIQueue),
VMSTATE_END_OF_LIST()
}
};
...
vmstate_load_state_end virtio/virtqueues end/0
vmstate_load_state virtio/extra_state v1
vmstate_load_state virtio_pci v1
vmstate_load_state virtio_pci/modern_state v1
vmstate_load_state virtio_pci/modern_queue_state v1
...
I'll take a look at what could be done here and try and get it into the
next series.
>>
>>>
>>>> vmstate_load_state virtio-net v11
>>>> vmstate_load_state PCIDevice v2
>>>> vmstate_load_state_end PCIDevice end/0
>>>> vmstate_load_state virtio-net-device v11
>>>> vmstate_load_state virtio-net-queue-tx_waiting v0
>>>> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
>>>> vmstate_load_state virtio-net-vnet v0
>>>> vmstate_load_state_end virtio-net-vnet end/0
>>>> vmstate_load_state virtio-net-ufo v0
>>>> vmstate_load_state_end virtio-net-ufo end/0
>>>> vmstate_load_state virtio-net-tx_waiting v0
>>>> vmstate_load_state virtio-net-queue-tx_waiting v0
>>>> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
>>>> vmstate_load_state virtio-net-queue-tx_waiting v0
>>>> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
>>>> vmstate_load_state virtio-net-queue-tx_waiting v0
>>>> vmstate_load_state_end virtio-net-queue-tx_waiting end/0
>>>> vmstate_load_state_end virtio-net-tx_waiting end/0
>>>> vmstate_load_state_end virtio-net-device end/0
>>>> vmstate_load_state virtio v1
>>>> vmstate_load_state virtio/64bit_features v1
>>>> vmstate_load_state_end virtio/64bit_features end/0
>>>> vmstate_load_state virtio/virtqueues v1
>>>> vmstate_load_state virtqueue_state v1 <--- Queue idx 0
>>>> ...
>>>> vmstate_load_state_end virtqueue_state end/0
>>>> vmstate_load_state virtqueue_state v1 <--- Queue idx 1023
>>>> vmstate_load_state_end virtqueue_state end/0
>>>> vmstate_load_state_end virtio/virtqueues end/0
>>>> vmstate_load_state virtio/extra_state v1
>>>> vmstate_load_state virtio_pci v1
>>>> vmstate_load_state virtio_pci/modern_state v1
>>>> vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 0
>>>> vmstate_load_state_end virtio_pci/modern_queue_state end/0
>>>> ...
>>>> vmstate_load_state virtio_pci/modern_queue_state v1 <--- Queue idx 1023
>>>> vmstate_load_state_end virtio_pci/modern_queue_state end/0
>>>> vmstate_load_state_end virtio_pci/modern_state end/0
>>>> vmstate_load_state_end virtio_pci end/0
>>>> vmstate_load_state_end virtio/extra_state end/0
>>>> vmstate_load_state virtio/started v1
>>>> vmstate_load_state_end virtio/started end/0
>>>> vmstate_load_state_end virtio end/0
>>>> vmstate_load_state_end virtio-net end/0
>>>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>>>> instance_id=0 downtime=13260
>>>>
>>>> With iterative migration for virtio-net (maybe all virtio devices?), we can
>>>> send this early while the source is still running and then only send the
>>>> deltas during the stop-and-copy phase. It's likely that the source wont be
>>>> using all VIRTIO_QUEUE_MAX virtqueues during the migration period, so this
>>>> could really minimize a large majority of the downtime contributed by
>>>> virtio-net.
>>>>
>>>> This could be one example.
>>
>> Or if the system call is expensive, could we try io_uring to mitigate it.
>>
>>>>
>>>>>>
>>>>>> The motivation behind this RFC series specifically is to provide an
>>>>>> initial framework for such an implementation and get feedback on the
>>>>>> design and direction.
>>>>>> -------
>>>>>>
>>>>>> This implementation of iterative live migration for a virtio-net device
>>>>>> is enabled via setting the migration capability 'virtio-iterative' to
>>>>>> on for both the source & destination, e.g. (HMP):
>>>>>>
>>>>>> (qemu) migrate_set_capability virtio-iterative on
>>>>>>
>>>>>> The virtio-net device's SaveVMHandlers hooks are registered/unregistered
>>>>>> during the device's realize/unrealize phase.
>>>>>
>>>>> I wonder about the plan for libvirt support.
>>>>>
>>>>
>>>> Could you elaborate on this a bit?
>>
>> I meant how this feature will be supported by the libvirt.
>>
>>>>
>>>>>>
>>>>>> Currently, this series only sends and loads the vmstate at the start of
>>>>>> migration. The vmstate is still sent (again) during the stop-and-copy
>>>>>> phase, as it is today, to handle any deltas in the state since it was
>>>>>> initially sent. A future patch in this series could avoid having to
>>>>>> re-send and re-load the entire state again and instead focus only on the
>>>>>> deltas.
>>>>>>
>>>>>> There is a slight, modest improvement in guest-visible downtime from
>>>>>> this series. More specifically, when using iterative live migration with
>>>>>> a virtio-net device, the downtime contributed by migrating a virtio-net
>>>>>> device decreased from ~3.2ms to ~1.4ms on average:
>>>>>
>>>>> Are you testing this via a software virtio device or hardware one?
>>>>>
>>>>
>>>> Just software (virtio-device, vhost-net) with these numbers. I can run some
>>>> tests with vDPA hardware though.
>>
>> I see. Considering you see great improvement with software devices. It
>> should be sufficient.
>>
>>>>
>>>> Those numbers were from a simple, 1 queue-pair virtio-net device.
>>
>> Thanks
>>
>>>>
>>>>>>
>>>>>> Before:
>>>>>> -------
>>>>>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>>>>>> instance_id=0 downtime=3594
>>>>>>
>>>>>> After:
>>>>>> ------
>>>>>> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-net
>>>>>> instance_id=0 downtime=1607
>>>>>>
>>>>>> This slight improvement is likely due to the initial vmstate_load_state
>>>>>> call "warming up" pages in memory such that, when it's called a second
>>>>>> time during the stop-and-copy phase, allocation and page-fault latencies
>>>>>> are reduced.
>>>>>> -------
>>>>>>
>>>>>> Comments, suggestions, etc. are welcome here.
>>>>>>
>>>>>> Jonah Palmer (6):
>>>>>> migration: Add virtio-iterative capability
>>>>>> virtio-net: Reorder vmstate_virtio_net and helpers
>>>>>> virtio-net: Add SaveVMHandlers for iterative migration
>>>>>> virtio-net: iter live migration - migrate vmstate
>>>>>> virtio,virtio-net: skip consistency check in virtio_load for iterative
>>>>>> migration
>>>>>> virtio-net: skip vhost_started assertion during iterative migration
>>>>>>
>>>>>> hw/net/virtio-net.c | 246 +++++++++++++++++++++++++++------
>>>>>> hw/virtio/virtio.c | 32 +++--
>>>>>> include/hw/virtio/virtio-net.h | 8 ++
>>>>>> include/hw/virtio/virtio.h | 7 +
>>>>>> migration/savevm.c | 1 +
>>>>>> qapi/migration.json | 7 +-
>>>>>> 6 files changed, 247 insertions(+), 54 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.47.1
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>
>>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-28 7:35 ` Jason Wang
2025-07-28 12:41 ` Jonah Palmer
@ 2025-07-28 14:51 ` Eugenio Perez Martin
2025-07-28 15:38 ` Eugenio Perez Martin
2025-07-29 2:38 ` Jason Wang
1 sibling, 2 replies; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-07-28 14:51 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, Jonah Palmer, qemu-devel, peterx, farosas,
eblake, armbru, si-wei.liu, boris.ostrovsky
On Mon, Jul 28, 2025 at 9:36 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Jul 28, 2025 at 3:09 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jul 25, 2025 at 5:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
> > > >
> > > >
> > > > On 7/23/25 1:51 AM, Jason Wang wrote:
> > > > > On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > > > >
> > > > > > This series is an RFC initial implementation of iterative live
> > > > > > migration for virtio-net devices.
> > > > > >
> > > > > > The main motivation behind implementing iterative migration for
> > > > > > virtio-net devices is to start on heavy, time-consuming operations
> > > > > > for the destination while the source is still active (i.e. before
> > > > > > the stop-and-copy phase).
> > > > >
> > > > > It would be better to explain which kind of operations were heavy and
> > > > > time-consuming and how iterative migration help.
> > > > >
> > > >
> > > > You're right. Apologies for being vague here.
> > > >
> > > > I did do some profiling of the virtio_load call for virtio-net to try and
> > > > narrow down where exactly most of the downtime is coming from during the
> > > > stop-and-copy phase.
> > > >
> > > > Pretty much the entirety of the downtime comes from the vmstate_load_state
> > > > call for the vmstate_virtio's subsections:
> > > >
> > > > /* Subsections */
> > > > ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
> > > > if (ret) {
> > > > return ret;
> > > > }
> > > >
> > > > More specifically, the vmstate_virtio_virtqueues and
> > > > vmstate_virtio_extra_state subsections.
> > > >
> > > > For example, currently (with no iterative migration), for a virtio-net
> > > > device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
> > > > was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
> > > >
> > > > Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
> > > > ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
> > > > I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
> > >
> > > Can we optimize it simply by sending a bitmap of used vqs?
> >
> > +1.
> >
> > For example devices like virtio-net may know exactly the number of
> > virtqueues that will be used.
>
> Ok, I think it comes from the following subsections:
>
> static const VMStateDescription vmstate_virtio_virtqueues = {
> .name = "virtio/virtqueues",
> .version_id = 1,
> .minimum_version_id = 1,
> .needed = &virtio_virtqueue_needed,
> .fields = (const VMStateField[]) {
> VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> VIRTIO_QUEUE_MAX, 0, vmstate_virtqueue, VirtQueue),
> VMSTATE_END_OF_LIST()
> }
> };
>
> static const VMStateDescription vmstate_virtio_packed_virtqueues = {
> .name = "virtio/packed_virtqueues",
> .version_id = 1,
> .minimum_version_id = 1,
> .needed = &virtio_packed_virtqueue_needed,
> .fields = (const VMStateField[]) {
> VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> VIRTIO_QUEUE_MAX, 0, vmstate_packed_virtqueue, VirtQueue),
> VMSTATE_END_OF_LIST()
> }
> };
>
> A rough idea is to disable those subsections and use new subsections
> instead (and do the compatibility work) like virtio_save():
>
> for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
> if (vdev->vq[i].vring.num == 0)
> break;
> }
>
> qemu_put_be32(f, i);
> ....
>
While I think this is a very good area to explore, I think we will get
more benefits by pre-warming vhost-vdpa devices, as they take one or
two orders of magnitude more than sending and processing the
virtio-net state (1s~10s vs 10~100ms).
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-07-22 12:41 ` [RFC 5/6] virtio, virtio-net: skip consistency check in virtio_load for iterative migration Jonah Palmer via
@ 2025-07-28 15:30 ` Eugenio Perez Martin
2025-07-28 16:23 ` Jonah Palmer
2025-08-06 16:27 ` Peter Xu
1 sibling, 1 reply; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-07-28 15:30 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky
On Tue, Jul 22, 2025 at 2:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> Iterative live migration for virtio-net sends an initial
> VMStateDescription while the source is still active. Because data
> continues to flow for virtio-net, the guest's avail index continues to
> increment after last_avail_idx had already been sent. This causes the
> destination to often see something like this from virtio_error():
>
> VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
>
> This patch suppresses this consistency check if we're loading the
> initial VMStateDescriptions via iterative migration and unsuppresses
> it for the stop-and-copy phase when the final VMStateDescriptions
> (carrying the correct indices) are loaded.
>
> A temporary VirtIODevMigration migration data structure is introduced here to
> represent the iterative migration process for a VirtIODevice. For now it
> just holds a flag to indicate whether or not the initial
> VMStateDescription was sent during the iterative live migration process.
>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---
> hw/net/virtio-net.c | 13 +++++++++++++
> hw/virtio/virtio.c | 32 ++++++++++++++++++++++++--------
> include/hw/virtio/virtio.h | 6 ++++++
> 3 files changed, 43 insertions(+), 8 deletions(-)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 86a6fe5b91..b7ac5e8278 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -3843,12 +3843,19 @@ static void virtio_net_save_cleanup(void *opaque)
>
> static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
> {
> + VirtIONet *n = opaque;
> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
> + vdev->migration = g_new0(VirtIODevMigration, 1);
> + vdev->migration->iterative_vmstate_loaded = false;
> +
> return 0;
> }
>
> static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
> {
> VirtIONet *n = opaque;
> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
> + VirtIODevMigration *mig = vdev->migration;
> uint64_t flag;
>
> flag = qemu_get_be64(f);
> @@ -3861,6 +3868,7 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
> case VNET_MIG_F_INIT_STATE:
> {
> vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
> + mig->iterative_vmstate_loaded = true;
This code will need to change if we send the status iteratively more
than once. For example, if the guest changes the mac address, the
number of vqs, etc.
In my opinion, we should set a flag named "in_iterative_migration" (or
equivalent) in virtio_net_load_setup and clear it in
virtio_net_load_cleanup. That's enough to tell in virtio_load if we
should perform actions like checking for inconsistent indices.
> break;
> }
> default:
> @@ -3875,6 +3883,11 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
>
> static int virtio_net_load_cleanup(void *opaque)
> {
> + VirtIONet *n = opaque;
> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
> + g_free(vdev->migration);
> + vdev->migration = NULL;
> +
> return 0;
> }
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 5534251e01..68957ee7d1 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -3222,6 +3222,7 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
> int32_t config_len;
> uint32_t num;
> uint32_t features;
> + bool inconsistent_indices;
> BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
> @@ -3365,6 +3366,16 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
> if (vdev->vq[i].vring.desc) {
> uint16_t nheads;
>
> + /*
> + * Ring indices will be inconsistent during iterative migration. The actual
> + * indices will be sent later during the stop-and-copy phase.
> + */
> + if (vdev->migration) {
> + inconsistent_indices = !vdev->migration->iterative_vmstate_loaded;
> + } else {
> + inconsistent_indices = false;
> + }
Nit, "inconsistent_indices = vdev->migration &&
!vdev->migration->iterative_vmstate_loaded" ? I'm happy with the
current "if else" too, but I think the one line is clearer. Your call
:).
> +
> /*
> * VIRTIO-1 devices migrate desc, used, and avail ring addresses so
> * only the region cache needs to be set up. Legacy devices need
> @@ -3384,14 +3395,19 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
> continue;
> }
>
> - nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
> - /* Check it isn't doing strange things with descriptor numbers. */
> - if (nheads > vdev->vq[i].vring.num) {
> - virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
> - "inconsistent with Host index 0x%x: delta 0x%x",
> - i, vdev->vq[i].vring.num,
> - vring_avail_idx(&vdev->vq[i]),
> - vdev->vq[i].last_avail_idx, nheads);
> + if (!inconsistent_indices) {
> + nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
> + /* Check it isn't doing strange things with descriptor numbers. */
> + if (nheads > vdev->vq[i].vring.num) {
> + virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
> + "inconsistent with Host index 0x%x: delta 0x%x",
> + i, vdev->vq[i].vring.num,
> + vring_avail_idx(&vdev->vq[i]),
> + vdev->vq[i].last_avail_idx, nheads);
> + inconsistent_indices = true;
> + }
> + }
> + if (inconsistent_indices) {
> vdev->vq[i].used_idx = 0;
> vdev->vq[i].shadow_avail_idx = 0;
> vdev->vq[i].inuse = 0;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 214d4a77e9..06b6e6ba65 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -98,6 +98,11 @@ enum virtio_device_endian {
> VIRTIO_DEVICE_ENDIAN_BIG,
> };
>
> +/* VirtIODevice iterative live migration data structure */
> +typedef struct VirtIODevMigration {
> + bool iterative_vmstate_loaded;
> +} VirtIODevMigration;
> +
> /**
> * struct VirtIODevice - common VirtIO structure
> * @name: name of the device
> @@ -151,6 +156,7 @@ struct VirtIODevice
> bool disable_legacy_check;
> bool vhost_started;
> VMChangeStateEntry *vmstate;
> + VirtIODevMigration *migration;
> char *bus_name;
> uint8_t device_endian;
> /**
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-28 14:51 ` Eugenio Perez Martin
@ 2025-07-28 15:38 ` Eugenio Perez Martin
2025-07-29 2:38 ` Jason Wang
1 sibling, 0 replies; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-07-28 15:38 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, Jonah Palmer, qemu-devel, peterx, farosas,
eblake, armbru, si-wei.liu, boris.ostrovsky
On Mon, Jul 28, 2025 at 4:51 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Jul 28, 2025 at 9:36 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Jul 28, 2025 at 3:09 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Jul 25, 2025 at 5:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
> > > > >
> > > > >
> > > > > On 7/23/25 1:51 AM, Jason Wang wrote:
> > > > > > On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > > > > >
> > > > > > > This series is an RFC initial implementation of iterative live
> > > > > > > migration for virtio-net devices.
> > > > > > >
> > > > > > > The main motivation behind implementing iterative migration for
> > > > > > > virtio-net devices is to start on heavy, time-consuming operations
> > > > > > > for the destination while the source is still active (i.e. before
> > > > > > > the stop-and-copy phase).
> > > > > >
> > > > > > It would be better to explain which kind of operations were heavy and
> > > > > > time-consuming and how iterative migration help.
> > > > > >
> > > > >
> > > > > You're right. Apologies for being vague here.
> > > > >
> > > > > I did do some profiling of the virtio_load call for virtio-net to try and
> > > > > narrow down where exactly most of the downtime is coming from during the
> > > > > stop-and-copy phase.
> > > > >
> > > > > Pretty much the entirety of the downtime comes from the vmstate_load_state
> > > > > call for the vmstate_virtio's subsections:
> > > > >
> > > > > /* Subsections */
> > > > > ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
> > > > > if (ret) {
> > > > > return ret;
> > > > > }
> > > > >
> > > > > More specifically, the vmstate_virtio_virtqueues and
> > > > > vmstate_virtio_extra_state subsections.
> > > > >
> > > > > For example, currently (with no iterative migration), for a virtio-net
> > > > > device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
> > > > > was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
> > > > >
> > > > > Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
> > > > > ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
> > > > > I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
> > > >
> > > > Can we optimize it simply by sending a bitmap of used vqs?
> > >
> > > +1.
> > >
> > > For example devices like virtio-net may know exactly the number of
> > > virtqueues that will be used.
> >
> > Ok, I think it comes from the following subsections:
> >
> > static const VMStateDescription vmstate_virtio_virtqueues = {
> > .name = "virtio/virtqueues",
> > .version_id = 1,
> > .minimum_version_id = 1,
> > .needed = &virtio_virtqueue_needed,
> > .fields = (const VMStateField[]) {
> > VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> > VIRTIO_QUEUE_MAX, 0, vmstate_virtqueue, VirtQueue),
> > VMSTATE_END_OF_LIST()
> > }
> > };
> >
> > static const VMStateDescription vmstate_virtio_packed_virtqueues = {
> > .name = "virtio/packed_virtqueues",
> > .version_id = 1,
> > .minimum_version_id = 1,
> > .needed = &virtio_packed_virtqueue_needed,
> > .fields = (const VMStateField[]) {
> > VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> > VIRTIO_QUEUE_MAX, 0, vmstate_packed_virtqueue, VirtQueue),
> > VMSTATE_END_OF_LIST()
> > }
> > };
> >
> > A rough idea is to disable those subsections and use new subsections
> > instead (and do the compatibility work) like virtio_save():
> >
> > for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
> > if (vdev->vq[i].vring.num == 0)
> > break;
> > }
> >
> > qemu_put_be32(f, i);
> > ....
> >
>
> While I think this is a very good area to explore, I think we will get
> more benefits by pre-warming vhost-vdpa devices, as they take one or
> two orders of magnitude more than sending and processing the
> virtio-net state (1s~10s vs 10~100ms).
Expanding on this,
This is a great base to start from! My proposal is to perform these as
next steps:
1) Track in the destination what members changes from the vmstate sent
in the iterative phase and the downtime phase. I would start by
creating a copy of the last VirtIODevice and VirtIONet, at least for a
first RFC on top of this one.
2) Start the vhost-vdpa net device by the time the first iterative
state reaches us. Just creating the virtqueues should be noticeable,
but sending the CVQ messages here gives us even more downtime
reduction. Do not send these messages in the downtime if the
properties have not changed!
If we want to take a small detour to make a first revision simpler, we
could do the same with vhost-net first: Pre-warm the device creation
with the mq etc.
Thanks!
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-07-28 15:30 ` [RFC 5/6] virtio,virtio-net: " Eugenio Perez Martin
@ 2025-07-28 16:23 ` Jonah Palmer
2025-07-30 8:59 ` Eugenio Perez Martin
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-07-28 16:23 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky
On 7/28/25 11:30 AM, Eugenio Perez Martin wrote:
> On Tue, Jul 22, 2025 at 2:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>> Iterative live migration for virtio-net sends an initial
>> VMStateDescription while the source is still active. Because data
>> continues to flow for virtio-net, the guest's avail index continues to
>> increment after last_avail_idx had already been sent. This causes the
>> destination to often see something like this from virtio_error():
>>
>> VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
>>
>> This patch suppresses this consistency check if we're loading the
>> initial VMStateDescriptions via iterative migration and unsuppresses
>> it for the stop-and-copy phase when the final VMStateDescriptions
>> (carrying the correct indices) are loaded.
>>
>> A temporary VirtIODevMigration migration data structure is introduced here to
>> represent the iterative migration process for a VirtIODevice. For now it
>> just holds a flag to indicate whether or not the initial
>> VMStateDescription was sent during the iterative live migration process.
>>
>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>> ---
>> hw/net/virtio-net.c | 13 +++++++++++++
>> hw/virtio/virtio.c | 32 ++++++++++++++++++++++++--------
>> include/hw/virtio/virtio.h | 6 ++++++
>> 3 files changed, 43 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 86a6fe5b91..b7ac5e8278 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -3843,12 +3843,19 @@ static void virtio_net_save_cleanup(void *opaque)
>>
>> static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
>> {
>> + VirtIONet *n = opaque;
>> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
>> + vdev->migration = g_new0(VirtIODevMigration, 1);
>> + vdev->migration->iterative_vmstate_loaded = false;
>> +
>> return 0;
>> }
>>
>> static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
>> {
>> VirtIONet *n = opaque;
>> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
>> + VirtIODevMigration *mig = vdev->migration;
>> uint64_t flag;
>>
>> flag = qemu_get_be64(f);
>> @@ -3861,6 +3868,7 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
>> case VNET_MIG_F_INIT_STATE:
>> {
>> vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
>> + mig->iterative_vmstate_loaded = true;
>
> This code will need to change if we send the status iteratively more
> than once. For example, if the guest changes the mac address, the
> number of vqs, etc.
>
Hopefully we can reach a solution where we'd only need to call the full
vmstate_load_state(f, &vmstate_virtio_net, ...) for a virtio-net device
once and then handle any changes afterwards individually.
Perhaps, maybe for simplicity, we could just send the
sub-states/subsections (instead of the whole state again) iteratively if
there were any changes in the fields that those sub-states/subsections
govern.
Definitely something I'll keep in mind as this series develops.
> In my opinion, we should set a flag named "in_iterative_migration" (or
> equivalent) in virtio_net_load_setup and clear it in
> virtio_net_load_cleanup. That's enough to tell in virtio_load if we
> should perform actions like checking for inconsistent indices.
>
I did actually try something like this but I realized that the
.load_cleanup and .save_cleanup hooks actually fire at the very end of
live migration (e.g. during the stop-and-copy phase). I thought they
fired at the end of the iterative portion of live migration, but this
didn't appear to be the case.
>> break;
>> }
>> default:
>> @@ -3875,6 +3883,11 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
>>
>> static int virtio_net_load_cleanup(void *opaque)
>> {
>> + VirtIONet *n = opaque;
>> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
>> + g_free(vdev->migration);
>> + vdev->migration = NULL;
>> +
>> return 0;
>> }
>>
>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> index 5534251e01..68957ee7d1 100644
>> --- a/hw/virtio/virtio.c
>> +++ b/hw/virtio/virtio.c
>> @@ -3222,6 +3222,7 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
>> int32_t config_len;
>> uint32_t num;
>> uint32_t features;
>> + bool inconsistent_indices;
>> BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
>> VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
>> VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
>> @@ -3365,6 +3366,16 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
>> if (vdev->vq[i].vring.desc) {
>> uint16_t nheads;
>>
>> + /*
>> + * Ring indices will be inconsistent during iterative migration. The actual
>> + * indices will be sent later during the stop-and-copy phase.
>> + */
>> + if (vdev->migration) {
>> + inconsistent_indices = !vdev->migration->iterative_vmstate_loaded;
>> + } else {
>> + inconsistent_indices = false;
>> + }
>
> Nit, "inconsistent_indices = vdev->migration &&
> !vdev->migration->iterative_vmstate_loaded" ? I'm happy with the
> current "if else" too, but I think the one line is clearer. Your call
> :).
>
Ah, nice catch! I like the one-liner more :) Will change this for next
series.
>> +
>> /*
>> * VIRTIO-1 devices migrate desc, used, and avail ring addresses so
>> * only the region cache needs to be set up. Legacy devices need
>> @@ -3384,14 +3395,19 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
>> continue;
>> }
>>
>> - nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
>> - /* Check it isn't doing strange things with descriptor numbers. */
>> - if (nheads > vdev->vq[i].vring.num) {
>> - virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
>> - "inconsistent with Host index 0x%x: delta 0x%x",
>> - i, vdev->vq[i].vring.num,
>> - vring_avail_idx(&vdev->vq[i]),
>> - vdev->vq[i].last_avail_idx, nheads);
>> + if (!inconsistent_indices) {
>> + nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
>> + /* Check it isn't doing strange things with descriptor numbers. */
>> + if (nheads > vdev->vq[i].vring.num) {
>> + virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
>> + "inconsistent with Host index 0x%x: delta 0x%x",
>> + i, vdev->vq[i].vring.num,
>> + vring_avail_idx(&vdev->vq[i]),
>> + vdev->vq[i].last_avail_idx, nheads);
>> + inconsistent_indices = true;
>> + }
>> + }
>> + if (inconsistent_indices) {
>> vdev->vq[i].used_idx = 0;
>> vdev->vq[i].shadow_avail_idx = 0;
>> vdev->vq[i].inuse = 0;
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index 214d4a77e9..06b6e6ba65 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -98,6 +98,11 @@ enum virtio_device_endian {
>> VIRTIO_DEVICE_ENDIAN_BIG,
>> };
>>
>> +/* VirtIODevice iterative live migration data structure */
>> +typedef struct VirtIODevMigration {
>> + bool iterative_vmstate_loaded;
>> +} VirtIODevMigration;
>> +
>> /**
>> * struct VirtIODevice - common VirtIO structure
>> * @name: name of the device
>> @@ -151,6 +156,7 @@ struct VirtIODevice
>> bool disable_legacy_check;
>> bool vhost_started;
>> VMChangeStateEntry *vmstate;
>> + VirtIODevMigration *migration;
>> char *bus_name;
>> uint8_t device_endian;
>> /**
>> --
>> 2.47.1
>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-28 14:51 ` Eugenio Perez Martin
2025-07-28 15:38 ` Eugenio Perez Martin
@ 2025-07-29 2:38 ` Jason Wang
2025-07-29 12:41 ` Jonah Palmer
1 sibling, 1 reply; 66+ messages in thread
From: Jason Wang @ 2025-07-29 2:38 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Michael S. Tsirkin, Jonah Palmer, qemu-devel, peterx, farosas,
eblake, armbru, si-wei.liu, boris.ostrovsky
On Mon, Jul 28, 2025 at 10:51 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Jul 28, 2025 at 9:36 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Jul 28, 2025 at 3:09 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Jul 25, 2025 at 5:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
> > > > >
> > > > >
> > > > > On 7/23/25 1:51 AM, Jason Wang wrote:
> > > > > > On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > > > > >
> > > > > > > This series is an RFC initial implementation of iterative live
> > > > > > > migration for virtio-net devices.
> > > > > > >
> > > > > > > The main motivation behind implementing iterative migration for
> > > > > > > virtio-net devices is to start on heavy, time-consuming operations
> > > > > > > for the destination while the source is still active (i.e. before
> > > > > > > the stop-and-copy phase).
> > > > > >
> > > > > > It would be better to explain which kind of operations were heavy and
> > > > > > time-consuming and how iterative migration help.
> > > > > >
> > > > >
> > > > > You're right. Apologies for being vague here.
> > > > >
> > > > > I did do some profiling of the virtio_load call for virtio-net to try and
> > > > > narrow down where exactly most of the downtime is coming from during the
> > > > > stop-and-copy phase.
> > > > >
> > > > > Pretty much the entirety of the downtime comes from the vmstate_load_state
> > > > > call for the vmstate_virtio's subsections:
> > > > >
> > > > > /* Subsections */
> > > > > ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
> > > > > if (ret) {
> > > > > return ret;
> > > > > }
> > > > >
> > > > > More specifically, the vmstate_virtio_virtqueues and
> > > > > vmstate_virtio_extra_state subsections.
> > > > >
> > > > > For example, currently (with no iterative migration), for a virtio-net
> > > > > device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
> > > > > was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
> > > > >
> > > > > Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
> > > > > ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
> > > > > I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
> > > >
> > > > Can we optimize it simply by sending a bitmap of used vqs?
> > >
> > > +1.
> > >
> > > For example devices like virtio-net may know exactly the number of
> > > virtqueues that will be used.
> >
> > Ok, I think it comes from the following subsections:
> >
> > static const VMStateDescription vmstate_virtio_virtqueues = {
> > .name = "virtio/virtqueues",
> > .version_id = 1,
> > .minimum_version_id = 1,
> > .needed = &virtio_virtqueue_needed,
> > .fields = (const VMStateField[]) {
> > VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> > VIRTIO_QUEUE_MAX, 0, vmstate_virtqueue, VirtQueue),
> > VMSTATE_END_OF_LIST()
> > }
> > };
> >
> > static const VMStateDescription vmstate_virtio_packed_virtqueues = {
> > .name = "virtio/packed_virtqueues",
> > .version_id = 1,
> > .minimum_version_id = 1,
> > .needed = &virtio_packed_virtqueue_needed,
> > .fields = (const VMStateField[]) {
> > VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> > VIRTIO_QUEUE_MAX, 0, vmstate_packed_virtqueue, VirtQueue),
> > VMSTATE_END_OF_LIST()
> > }
> > };
> >
> > A rough idea is to disable those subsections and use new subsections
> > instead (and do the compatibility work) like virtio_save():
> >
> > for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
> > if (vdev->vq[i].vring.num == 0)
> > break;
> > }
> >
> > qemu_put_be32(f, i);
> > ....
> >
>
> While I think this is a very good area to explore, I think we will get
> more benefits by pre-warming vhost-vdpa devices, as they take one or
> two orders of magnitude more than sending and processing the
> virtio-net state (1s~10s vs 10~100ms).
Yes, but note that Jonah does the testing on a software virtio device.
Thanks
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 0/6] virtio-net: initial iterative live migration support
2025-07-29 2:38 ` Jason Wang
@ 2025-07-29 12:41 ` Jonah Palmer
0 siblings, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-07-29 12:41 UTC (permalink / raw)
To: Jason Wang, Eugenio Perez Martin
Cc: Michael S. Tsirkin, qemu-devel, peterx, farosas, eblake, armbru,
si-wei.liu, boris.ostrovsky
On 7/28/25 10:38 PM, Jason Wang wrote:
> On Mon, Jul 28, 2025 at 10:51 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
>>
>> On Mon, Jul 28, 2025 at 9:36 AM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> On Mon, Jul 28, 2025 at 3:09 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>> On Fri, Jul 25, 2025 at 5:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>
>>>>> On Thu, Jul 24, 2025 at 05:59:20PM -0400, Jonah Palmer wrote:
>>>>>>
>>>>>>
>>>>>> On 7/23/25 1:51 AM, Jason Wang wrote:
>>>>>>> On Tue, Jul 22, 2025 at 8:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>>>
>>>>>>>> This series is an RFC initial implementation of iterative live
>>>>>>>> migration for virtio-net devices.
>>>>>>>>
>>>>>>>> The main motivation behind implementing iterative migration for
>>>>>>>> virtio-net devices is to start on heavy, time-consuming operations
>>>>>>>> for the destination while the source is still active (i.e. before
>>>>>>>> the stop-and-copy phase).
>>>>>>>
>>>>>>> It would be better to explain which kind of operations were heavy and
>>>>>>> time-consuming and how iterative migration help.
>>>>>>>
>>>>>>
>>>>>> You're right. Apologies for being vague here.
>>>>>>
>>>>>> I did do some profiling of the virtio_load call for virtio-net to try and
>>>>>> narrow down where exactly most of the downtime is coming from during the
>>>>>> stop-and-copy phase.
>>>>>>
>>>>>> Pretty much the entirety of the downtime comes from the vmstate_load_state
>>>>>> call for the vmstate_virtio's subsections:
>>>>>>
>>>>>> /* Subsections */
>>>>>> ret = vmstate_load_state(f, &vmstate_virtio, vdev, 1);
>>>>>> if (ret) {
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> More specifically, the vmstate_virtio_virtqueues and
>>>>>> vmstate_virtio_extra_state subsections.
>>>>>>
>>>>>> For example, currently (with no iterative migration), for a virtio-net
>>>>>> device, the virtio_load call took 13.29ms to finish. 13.20ms of that time
>>>>>> was spent in vmstate_load_state(f, &vmstate_virtio, vdev, 1).
>>>>>>
>>>>>> Of that 13.21ms, ~6.83ms was spent migrating vmstate_virtio_virtqueues and
>>>>>> ~6.33ms was spent migrating the vmstate_virtio_extra_state subsections. And
>>>>>> I believe this is from walking VIRTIO_QUEUE_MAX virtqueues, twice.
>>>>>
>>>>> Can we optimize it simply by sending a bitmap of used vqs?
>>>>
>>>> +1.
>>>>
>>>> For example devices like virtio-net may know exactly the number of
>>>> virtqueues that will be used.
>>>
>>> Ok, I think it comes from the following subsections:
>>>
>>> static const VMStateDescription vmstate_virtio_virtqueues = {
>>> .name = "virtio/virtqueues",
>>> .version_id = 1,
>>> .minimum_version_id = 1,
>>> .needed = &virtio_virtqueue_needed,
>>> .fields = (const VMStateField[]) {
>>> VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
>>> VIRTIO_QUEUE_MAX, 0, vmstate_virtqueue, VirtQueue),
>>> VMSTATE_END_OF_LIST()
>>> }
>>> };
>>>
>>> static const VMStateDescription vmstate_virtio_packed_virtqueues = {
>>> .name = "virtio/packed_virtqueues",
>>> .version_id = 1,
>>> .minimum_version_id = 1,
>>> .needed = &virtio_packed_virtqueue_needed,
>>> .fields = (const VMStateField[]) {
>>> VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
>>> VIRTIO_QUEUE_MAX, 0, vmstate_packed_virtqueue, VirtQueue),
>>> VMSTATE_END_OF_LIST()
>>> }
>>> };
>>>
>>> A rough idea is to disable those subsections and use new subsections
>>> instead (and do the compatibility work) like virtio_save():
>>>
>>> for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
>>> if (vdev->vq[i].vring.num == 0)
>>> break;
>>> }
>>>
>>> qemu_put_be32(f, i);
>>> ....
>>>
>>
>> While I think this is a very good area to explore, I think we will get
>> more benefits by pre-warming vhost-vdpa devices, as they take one or
>> two orders of magnitude more than sending and processing the
>> virtio-net state (1s~10s vs 10~100ms).
>
> Yes, but note that Jonah does the testing on a software virtio device.
>
> Thanks
>
All good. I also have MLX vDPA hardware for testing this on.
Jonah
>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-07-28 16:23 ` Jonah Palmer
@ 2025-07-30 8:59 ` Eugenio Perez Martin
0 siblings, 0 replies; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-07-30 8:59 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky
On Mon, Jul 28, 2025 at 6:24 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 7/28/25 11:30 AM, Eugenio Perez Martin wrote:
> > On Tue, Jul 22, 2025 at 2:41 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >> Iterative live migration for virtio-net sends an initial
> >> VMStateDescription while the source is still active. Because data
> >> continues to flow for virtio-net, the guest's avail index continues to
> >> increment after last_avail_idx had already been sent. This causes the
> >> destination to often see something like this from virtio_error():
> >>
> >> VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
> >>
> >> This patch suppresses this consistency check if we're loading the
> >> initial VMStateDescriptions via iterative migration and unsuppresses
> >> it for the stop-and-copy phase when the final VMStateDescriptions
> >> (carrying the correct indices) are loaded.
> >>
> >> A temporary VirtIODevMigration migration data structure is introduced here to
> >> represent the iterative migration process for a VirtIODevice. For now it
> >> just holds a flag to indicate whether or not the initial
> >> VMStateDescription was sent during the iterative live migration process.
> >>
> >> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> >> ---
> >> hw/net/virtio-net.c | 13 +++++++++++++
> >> hw/virtio/virtio.c | 32 ++++++++++++++++++++++++--------
> >> include/hw/virtio/virtio.h | 6 ++++++
> >> 3 files changed, 43 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> >> index 86a6fe5b91..b7ac5e8278 100644
> >> --- a/hw/net/virtio-net.c
> >> +++ b/hw/net/virtio-net.c
> >> @@ -3843,12 +3843,19 @@ static void virtio_net_save_cleanup(void *opaque)
> >>
> >> static int virtio_net_load_setup(QEMUFile *f, void *opaque, Error **errp)
> >> {
> >> + VirtIONet *n = opaque;
> >> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
> >> + vdev->migration = g_new0(VirtIODevMigration, 1);
> >> + vdev->migration->iterative_vmstate_loaded = false;
> >> +
> >> return 0;
> >> }
> >>
> >> static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
> >> {
> >> VirtIONet *n = opaque;
> >> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
> >> + VirtIODevMigration *mig = vdev->migration;
> >> uint64_t flag;
> >>
> >> flag = qemu_get_be64(f);
> >> @@ -3861,6 +3868,7 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
> >> case VNET_MIG_F_INIT_STATE:
> >> {
> >> vmstate_load_state(f, &vmstate_virtio_net, n, VIRTIO_NET_VM_VERSION);
> >> + mig->iterative_vmstate_loaded = true;
> >
> > This code will need to change if we send the status iteratively more
> > than once. For example, if the guest changes the mac address, the
> > number of vqs, etc.
> >
>
> Hopefully we can reach a solution where we'd only need to call the full
> vmstate_load_state(f, &vmstate_virtio_net, ...) for a virtio-net device
> once and then handle any changes afterwards individually.
>
> Perhaps, maybe for simplicity, we could just send the
> sub-states/subsections (instead of the whole state again) iteratively if
> there were any changes in the fields that those sub-states/subsections
> govern.
>
> Definitely something I'll keep in mind as this series develops.
>
> > In my opinion, we should set a flag named "in_iterative_migration" (or
> > equivalent) in virtio_net_load_setup and clear it in
> > virtio_net_load_cleanup. That's enough to tell in virtio_load if we
> > should perform actions like checking for inconsistent indices.
> >
>
> I did actually try something like this but I realized that the
> .load_cleanup and .save_cleanup hooks actually fire at the very end of
> live migration (e.g. during the stop-and-copy phase). I thought they
> fired at the end of the iterative portion of live migration, but this
> didn't appear to be the case.
>
Ok that makes a lot of sense. What about .switchover_start ? We need
the switchover capability though, not sure if it is a good idea to
mandate it as a requirement. So yes, maybe this patch is the most
reliable way to do so.
> >> break;
> >> }
> >> default:
> >> @@ -3875,6 +3883,11 @@ static int virtio_net_load_state(QEMUFile *f, void *opaque, int version_id)
> >>
> >> static int virtio_net_load_cleanup(void *opaque)
> >> {
> >> + VirtIONet *n = opaque;
> >> + VirtIODevice *vdev = VIRTIO_DEVICE(n);
> >> + g_free(vdev->migration);
> >> + vdev->migration = NULL;
> >> +
> >> return 0;
> >> }
> >>
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index 5534251e01..68957ee7d1 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -3222,6 +3222,7 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
> >> int32_t config_len;
> >> uint32_t num;
> >> uint32_t features;
> >> + bool inconsistent_indices;
> >> BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> >> VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> >> VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
> >> @@ -3365,6 +3366,16 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
> >> if (vdev->vq[i].vring.desc) {
> >> uint16_t nheads;
> >>
> >> + /*
> >> + * Ring indices will be inconsistent during iterative migration. The actual
> >> + * indices will be sent later during the stop-and-copy phase.
> >> + */
> >> + if (vdev->migration) {
> >> + inconsistent_indices = !vdev->migration->iterative_vmstate_loaded;
> >> + } else {
> >> + inconsistent_indices = false;
> >> + }
> >
> > Nit, "inconsistent_indices = vdev->migration &&
> > !vdev->migration->iterative_vmstate_loaded" ? I'm happy with the
> > current "if else" too, but I think the one line is clearer. Your call
> > :).
> >
>
> Ah, nice catch! I like the one-liner more :) Will change this for next
> series.
>
> >> +
> >> /*
> >> * VIRTIO-1 devices migrate desc, used, and avail ring addresses so
> >> * only the region cache needs to be set up. Legacy devices need
> >> @@ -3384,14 +3395,19 @@ virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
> >> continue;
> >> }
> >>
> >> - nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
> >> - /* Check it isn't doing strange things with descriptor numbers. */
> >> - if (nheads > vdev->vq[i].vring.num) {
> >> - virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
> >> - "inconsistent with Host index 0x%x: delta 0x%x",
> >> - i, vdev->vq[i].vring.num,
> >> - vring_avail_idx(&vdev->vq[i]),
> >> - vdev->vq[i].last_avail_idx, nheads);
> >> + if (!inconsistent_indices) {
> >> + nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
> >> + /* Check it isn't doing strange things with descriptor numbers. */
> >> + if (nheads > vdev->vq[i].vring.num) {
> >> + virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
> >> + "inconsistent with Host index 0x%x: delta 0x%x",
> >> + i, vdev->vq[i].vring.num,
> >> + vring_avail_idx(&vdev->vq[i]),
> >> + vdev->vq[i].last_avail_idx, nheads);
> >> + inconsistent_indices = true;
> >> + }
> >> + }
> >> + if (inconsistent_indices) {
> >> vdev->vq[i].used_idx = 0;
> >> vdev->vq[i].shadow_avail_idx = 0;
> >> vdev->vq[i].inuse = 0;
> >> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> >> index 214d4a77e9..06b6e6ba65 100644
> >> --- a/include/hw/virtio/virtio.h
> >> +++ b/include/hw/virtio/virtio.h
> >> @@ -98,6 +98,11 @@ enum virtio_device_endian {
> >> VIRTIO_DEVICE_ENDIAN_BIG,
> >> };
> >>
> >> +/* VirtIODevice iterative live migration data structure */
> >> +typedef struct VirtIODevMigration {
> >> + bool iterative_vmstate_loaded;
> >> +} VirtIODevMigration;
> >> +
> >> /**
> >> * struct VirtIODevice - common VirtIO structure
> >> * @name: name of the device
> >> @@ -151,6 +156,7 @@ struct VirtIODevice
> >> bool disable_legacy_check;
> >> bool vhost_started;
> >> VMChangeStateEntry *vmstate;
> >> + VirtIODevMigration *migration;
> >> char *bus_name;
> >> uint8_t device_endian;
> >> /**
> >> --
> >> 2.47.1
> >>
> >
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-07-22 12:41 ` [RFC 1/6] migration: Add virtio-iterative capability Jonah Palmer
@ 2025-08-06 15:58 ` Peter Xu
2025-08-07 12:50 ` Jonah Palmer
2025-08-08 10:48 ` Markus Armbruster
1 sibling, 1 reply; 66+ messages in thread
From: Peter Xu @ 2025-08-06 15:58 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On Tue, Jul 22, 2025 at 12:41:22PM +0000, Jonah Palmer wrote:
> Adds a new migration capability 'virtio-iterative' that will allow
> virtio devices, where supported, to iteratively migrate configuration
> changes that occur during the migration process.
>
> This capability is added to the validated capabilities list to ensure
> both the source and destination support it before enabling.
>
> The capability defaults to off to maintain backward compatibility.
>
> To enable the capability via HMP:
> (qemu) migrate_set_capability virtio-iterative on
>
> To enable the capability via QMP:
> {"execute": "migrate-set-capabilities", "arguments": {
> "capabilities": [
> { "capability": "virtio-iterative", "state": true }
> ]
> }
> }
>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---
> migration/savevm.c | 1 +
> qapi/migration.json | 7 ++++++-
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index bb04a4520d..40a2189866 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
> switch (capability) {
> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
> case MIGRATION_CAPABILITY_MAPPED_RAM:
> + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
> return true;
> default:
> return false;
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 4963f6ca12..8f042c3ba5 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -479,6 +479,11 @@
> # each RAM page. Requires a migration URI that supports seeking,
> # such as a file. (since 9.0)
> #
> +# @virtio-iterative: Enable iterative migration for virtio devices, if
> +# the device supports it. When enabled, and where supported, virtio
> +# devices will track and migrate configuration changes that may
> +# occur during the migration process. (Since 10.1)
> +#
Having a migration capability to enable iterative support for a specific
type of device sounds wrong.
If virtio will be able to support iterative saves, it could provide the
save_live_iterate() function. Any explanation why it needs to be a
migration capability?
> # Features:
> #
> # @unstable: Members @x-colo and @x-ignore-shared are experimental.
> @@ -498,7 +503,7 @@
> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> 'validate-uuid', 'background-snapshot',
> 'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
> - 'dirty-limit', 'mapped-ram'] }
> + 'dirty-limit', 'mapped-ram', 'virtio-iterative'] }
>
> ##
> # @MigrationCapabilityStatus:
> --
> 2.47.1
>
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-07-22 12:41 ` [RFC 5/6] virtio, virtio-net: skip consistency check in virtio_load for iterative migration Jonah Palmer via
2025-07-28 15:30 ` [RFC 5/6] virtio,virtio-net: " Eugenio Perez Martin
@ 2025-08-06 16:27 ` Peter Xu
2025-08-07 14:18 ` Jonah Palmer
1 sibling, 1 reply; 66+ messages in thread
From: Peter Xu @ 2025-08-06 16:27 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On Tue, Jul 22, 2025 at 12:41:26PM +0000, Jonah Palmer wrote:
> Iterative live migration for virtio-net sends an initial
> VMStateDescription while the source is still active. Because data
> continues to flow for virtio-net, the guest's avail index continues to
> increment after last_avail_idx had already been sent. This causes the
> destination to often see something like this from virtio_error():
>
> VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
This is pretty much understanable, as vmstate_save() / vmstate_load() are,
IMHO, not designed to be used while VM is running.
To me, it's still illegal (per previous patch) to use vmstate_save_state()
while VM is running, in a save_setup() phase.
Some very high level questions from migration POV:
- Have we figured out why the downtime can be shrinked just by sending the
vmstate twice?
If we suspect it's memory got preheated, have we tried other ways to
simply heat the memory up on dest side? For example, some form of
mlock[all]()? IMHO it's pretty important we figure out the root of why
such optimization came from.
I do remember we have downtime issue with number of max_vqueues that may
cause post_load() to be slow, I wonder there're other ways to improve it
instead of vmstate_save(), especially in setup phase.
- Normally devices need iterative phase because:
(a) the device may contain huge amount of data to transfer
E.g. RAM and VFIO are good examples and fall into this category.
(b) the device states are "iterable" from concept
RAM is definitely true. VFIO somehow mimiced that even though it was
a streamed binary protocol..
What's the answer for virtio-net here? How large is the device state?
Is this relevant to vDPA and real hardware (so virtio-net can look
similar to VFIO at some point)?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-06 15:58 ` Peter Xu
@ 2025-08-07 12:50 ` Jonah Palmer
2025-08-07 13:13 ` Peter Xu
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-07 12:50 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/6/25 11:58 AM, Peter Xu wrote:
> On Tue, Jul 22, 2025 at 12:41:22PM +0000, Jonah Palmer wrote:
>> Adds a new migration capability 'virtio-iterative' that will allow
>> virtio devices, where supported, to iteratively migrate configuration
>> changes that occur during the migration process.
>>
>> This capability is added to the validated capabilities list to ensure
>> both the source and destination support it before enabling.
>>
>> The capability defaults to off to maintain backward compatibility.
>>
>> To enable the capability via HMP:
>> (qemu) migrate_set_capability virtio-iterative on
>>
>> To enable the capability via QMP:
>> {"execute": "migrate-set-capabilities", "arguments": {
>> "capabilities": [
>> { "capability": "virtio-iterative", "state": true }
>> ]
>> }
>> }
>>
>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>> ---
>> migration/savevm.c | 1 +
>> qapi/migration.json | 7 ++++++-
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index bb04a4520d..40a2189866 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
>> switch (capability) {
>> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
>> case MIGRATION_CAPABILITY_MAPPED_RAM:
>> + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
>> return true;
>> default:
>> return false;
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 4963f6ca12..8f042c3ba5 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -479,6 +479,11 @@
>> # each RAM page. Requires a migration URI that supports seeking,
>> # such as a file. (since 9.0)
>> #
>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>> +# the device supports it. When enabled, and where supported, virtio
>> +# devices will track and migrate configuration changes that may
>> +# occur during the migration process. (Since 10.1)
>> +#
>
> Having a migration capability to enable iterative support for a specific
> type of device sounds wrong.
>
> If virtio will be able to support iterative saves, it could provide the
> save_live_iterate() function. Any explanation why it needs to be a
> migration capability?
>
It certainly doesn't have to be a migration capability. Perhaps it's
better as a per-device compatibility property? E.g.:
-device virtio-net-pci,x-iterative-migration=on,...
I was just thinking along the lines of not having this feature enabled
by default for backwards-compatibility (and something to toggle to
compare performance during development).
Totally open to suggestions though. I wasn't really sure how best a
feature/capability like this should be introduced.
>> # Features:
>> #
>> # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>> @@ -498,7 +503,7 @@
>> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>> 'validate-uuid', 'background-snapshot',
>> 'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
>> - 'dirty-limit', 'mapped-ram'] }
>> + 'dirty-limit', 'mapped-ram', 'virtio-iterative'] }
>>
>> ##
>> # @MigrationCapabilityStatus:
>> --
>> 2.47.1
>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-07 12:50 ` Jonah Palmer
@ 2025-08-07 13:13 ` Peter Xu
2025-08-07 14:20 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Peter Xu @ 2025-08-07 13:13 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On Thu, Aug 07, 2025 at 08:50:38AM -0400, Jonah Palmer wrote:
>
>
> On 8/6/25 11:58 AM, Peter Xu wrote:
> > On Tue, Jul 22, 2025 at 12:41:22PM +0000, Jonah Palmer wrote:
> > > Adds a new migration capability 'virtio-iterative' that will allow
> > > virtio devices, where supported, to iteratively migrate configuration
> > > changes that occur during the migration process.
> > >
> > > This capability is added to the validated capabilities list to ensure
> > > both the source and destination support it before enabling.
> > >
> > > The capability defaults to off to maintain backward compatibility.
> > >
> > > To enable the capability via HMP:
> > > (qemu) migrate_set_capability virtio-iterative on
> > >
> > > To enable the capability via QMP:
> > > {"execute": "migrate-set-capabilities", "arguments": {
> > > "capabilities": [
> > > { "capability": "virtio-iterative", "state": true }
> > > ]
> > > }
> > > }
> > >
> > > Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> > > ---
> > > migration/savevm.c | 1 +
> > > qapi/migration.json | 7 ++++++-
> > > 2 files changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index bb04a4520d..40a2189866 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
> > > switch (capability) {
> > > case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
> > > case MIGRATION_CAPABILITY_MAPPED_RAM:
> > > + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
> > > return true;
> > > default:
> > > return false;
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 4963f6ca12..8f042c3ba5 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -479,6 +479,11 @@
> > > # each RAM page. Requires a migration URI that supports seeking,
> > > # such as a file. (since 9.0)
> > > #
> > > +# @virtio-iterative: Enable iterative migration for virtio devices, if
> > > +# the device supports it. When enabled, and where supported, virtio
> > > +# devices will track and migrate configuration changes that may
> > > +# occur during the migration process. (Since 10.1)
> > > +#
> >
> > Having a migration capability to enable iterative support for a specific
> > type of device sounds wrong.
> >
> > If virtio will be able to support iterative saves, it could provide the
> > save_live_iterate() function. Any explanation why it needs to be a
> > migration capability?
> >
>
> It certainly doesn't have to be a migration capability. Perhaps it's better
> as a per-device compatibility property? E.g.:
>
> -device virtio-net-pci,x-iterative-migration=on,...
>
> I was just thinking along the lines of not having this feature enabled by
> default for backwards-compatibility (and something to toggle to compare
> performance during development).
>
> Totally open to suggestions though. I wasn't really sure how best a
> feature/capability like this should be introduced.
Yep, for RFC is fine, if there'll be a formal patch please propose it as a
device property whenever needed, thanks.
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-06 16:27 ` Peter Xu
@ 2025-08-07 14:18 ` Jonah Palmer
2025-08-07 16:31 ` Peter Xu
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-07 14:18 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/6/25 12:27 PM, Peter Xu wrote:
> On Tue, Jul 22, 2025 at 12:41:26PM +0000, Jonah Palmer wrote:
>> Iterative live migration for virtio-net sends an initial
>> VMStateDescription while the source is still active. Because data
>> continues to flow for virtio-net, the guest's avail index continues to
>> increment after last_avail_idx had already been sent. This causes the
>> destination to often see something like this from virtio_error():
>>
>> VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
>
> This is pretty much understanable, as vmstate_save() / vmstate_load() are,
> IMHO, not designed to be used while VM is running.
>
> To me, it's still illegal (per previous patch) to use vmstate_save_state()
> while VM is running, in a save_setup() phase.
Yea I understand where you're coming from. It just seemed too good to
pass up on as a way to send and receive the entire state of a device.
I felt that if I were to implement something similar for iterative
migration only that I'd, more or less, be duplicating a lot of already
existing code or vmstate logic.
>
> Some very high level questions from migration POV:
>
> - Have we figured out why the downtime can be shrinked just by sending the
> vmstate twice?
>
> If we suspect it's memory got preheated, have we tried other ways to
> simply heat the memory up on dest side? For example, some form of
> mlock[all]()? IMHO it's pretty important we figure out the root of why
> such optimization came from.
>
> I do remember we have downtime issue with number of max_vqueues that may
> cause post_load() to be slow, I wonder there're other ways to improve it
> instead of vmstate_save(), especially in setup phase.
>
Yea I believe that the downtime shrinks on the second vmstate_load_state
due to preheated memory. But I'd like to stress that it's not my
intention to resend the entire vmstate again during the stop-and-copy
phase if iterative migration was used. A future iteration of this series
will eventually include a more efficient approach to update the
destination with any deltas since the vmstate was sent during the
iterative portion (instead of just resending the entire vmstate again).
And yea there is an inefficiency regarding walking through
VIRTIO_QUEUE_MAX (1024) VQs (twice with PCI) that I mentioned here in
another comment:
https://lore.kernel.org/qemu-devel/0f5b804d-3852-4159-b151-308a57f1ec74@oracle.com/
This might be better handled in a separate series though rather than as
part of this one.
> - Normally devices need iterative phase because:
>
> (a) the device may contain huge amount of data to transfer
>
> E.g. RAM and VFIO are good examples and fall into this category.
>
> (b) the device states are "iterable" from concept
>
> RAM is definitely true. VFIO somehow mimiced that even though it was
> a streamed binary protocol..
>
> What's the answer for virtio-net here? How large is the device state?
> Is this relevant to vDPA and real hardware (so virtio-net can look
> similar to VFIO at some point)?
>
The main motivation behind implementing iterative migration for
virtio-net is really to improve the guest visible downtime seen when
migrating a vDPA device.
That is, by implementing iterative migration for virtio-net, we can see
the state of the device early on and get a head start on work that's
currently being done during the stop-and-copy phase. If we do this work
before the stop-and-copy phase, we can further decrease the time spent
in this window.
This would include work such as sending down the CVQ commands for
queue-pair creation (even more beneficial for multiqueue), RSS, filters,
etc.
I'm hoping to show this more explicitly in the next version of this RFC
series that I'm working on now.
> Thanks,
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-07 13:13 ` Peter Xu
@ 2025-08-07 14:20 ` Jonah Palmer
0 siblings, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-08-07 14:20 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/7/25 9:13 AM, Peter Xu wrote:
> On Thu, Aug 07, 2025 at 08:50:38AM -0400, Jonah Palmer wrote:
>>
>>
>> On 8/6/25 11:58 AM, Peter Xu wrote:
>>> On Tue, Jul 22, 2025 at 12:41:22PM +0000, Jonah Palmer wrote:
>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>> virtio devices, where supported, to iteratively migrate configuration
>>>> changes that occur during the migration process.
>>>>
>>>> This capability is added to the validated capabilities list to ensure
>>>> both the source and destination support it before enabling.
>>>>
>>>> The capability defaults to off to maintain backward compatibility.
>>>>
>>>> To enable the capability via HMP:
>>>> (qemu) migrate_set_capability virtio-iterative on
>>>>
>>>> To enable the capability via QMP:
>>>> {"execute": "migrate-set-capabilities", "arguments": {
>>>> "capabilities": [
>>>> { "capability": "virtio-iterative", "state": true }
>>>> ]
>>>> }
>>>> }
>>>>
>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>> ---
>>>> migration/savevm.c | 1 +
>>>> qapi/migration.json | 7 ++++++-
>>>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index bb04a4520d..40a2189866 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
>>>> switch (capability) {
>>>> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
>>>> case MIGRATION_CAPABILITY_MAPPED_RAM:
>>>> + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
>>>> return true;
>>>> default:
>>>> return false;
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 4963f6ca12..8f042c3ba5 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -479,6 +479,11 @@
>>>> # each RAM page. Requires a migration URI that supports seeking,
>>>> # such as a file. (since 9.0)
>>>> #
>>>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>>>> +# the device supports it. When enabled, and where supported, virtio
>>>> +# devices will track and migrate configuration changes that may
>>>> +# occur during the migration process. (Since 10.1)
>>>> +#
>>>
>>> Having a migration capability to enable iterative support for a specific
>>> type of device sounds wrong.
>>>
>>> If virtio will be able to support iterative saves, it could provide the
>>> save_live_iterate() function. Any explanation why it needs to be a
>>> migration capability?
>>>
>>
>> It certainly doesn't have to be a migration capability. Perhaps it's better
>> as a per-device compatibility property? E.g.:
>>
>> -device virtio-net-pci,x-iterative-migration=on,...
>>
>> I was just thinking along the lines of not having this feature enabled by
>> default for backwards-compatibility (and something to toggle to compare
>> performance during development).
>>
>> Totally open to suggestions though. I wasn't really sure how best a
>> feature/capability like this should be introduced.
>
> Yep, for RFC is fine, if there'll be a formal patch please propose it as a
> device property whenever needed, thanks.
>
Gotcha, will do! Thanks for the suggestion :)
Jonah
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-07 14:18 ` Jonah Palmer
@ 2025-08-07 16:31 ` Peter Xu
2025-08-11 12:30 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Peter Xu @ 2025-08-07 16:31 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On Thu, Aug 07, 2025 at 10:18:38AM -0400, Jonah Palmer wrote:
>
>
> On 8/6/25 12:27 PM, Peter Xu wrote:
> > On Tue, Jul 22, 2025 at 12:41:26PM +0000, Jonah Palmer wrote:
> > > Iterative live migration for virtio-net sends an initial
> > > VMStateDescription while the source is still active. Because data
> > > continues to flow for virtio-net, the guest's avail index continues to
> > > increment after last_avail_idx had already been sent. This causes the
> > > destination to often see something like this from virtio_error():
> > >
> > > VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
> >
> > This is pretty much understanable, as vmstate_save() / vmstate_load() are,
> > IMHO, not designed to be used while VM is running.
> >
> > To me, it's still illegal (per previous patch) to use vmstate_save_state()
> > while VM is running, in a save_setup() phase.
>
> Yea I understand where you're coming from. It just seemed too good to pass
> up on as a way to send and receive the entire state of a device.
>
> I felt that if I were to implement something similar for iterative migration
> only that I'd, more or less, be duplicating a lot of already existing code
> or vmstate logic.
>
> >
> > Some very high level questions from migration POV:
> >
> > - Have we figured out why the downtime can be shrinked just by sending the
> > vmstate twice?
> >
> > If we suspect it's memory got preheated, have we tried other ways to
> > simply heat the memory up on dest side? For example, some form of
> > mlock[all]()? IMHO it's pretty important we figure out the root of why
> > such optimization came from.
> >
> > I do remember we have downtime issue with number of max_vqueues that may
> > cause post_load() to be slow, I wonder there're other ways to improve it
> > instead of vmstate_save(), especially in setup phase.
> >
>
> Yea I believe that the downtime shrinks on the second vmstate_load_state due
> to preheated memory. But I'd like to stress that it's not my intention to
> resend the entire vmstate again during the stop-and-copy phase if iterative
> migration was used. A future iteration of this series will eventually
> include a more efficient approach to update the destination with any deltas
> since the vmstate was sent during the iterative portion (instead of just
> resending the entire vmstate again).
>
> And yea there is an inefficiency regarding walking through VIRTIO_QUEUE_MAX
> (1024) VQs (twice with PCI) that I mentioned here in another comment: https://lore.kernel.org/qemu-devel/0f5b804d-3852-4159-b151-308a57f1ec74@oracle.com/
>
> This might be better handled in a separate series though rather than as part
> of this one.
One thing to mention is I recall some other developer was trying to
optimize device load from memory side:
https://lore.kernel.org/all/20230317081904.24389-1-xuchuangxclwt@bytedance.com/
So maybe there're more than one way of doing this, and I'm not sure which
way is better, or both.
>
> > - Normally devices need iterative phase because:
> >
> > (a) the device may contain huge amount of data to transfer
> >
> > E.g. RAM and VFIO are good examples and fall into this category.
> >
> > (b) the device states are "iterable" from concept
> >
> > RAM is definitely true. VFIO somehow mimiced that even though it was
> > a streamed binary protocol..
> >
> > What's the answer for virtio-net here? How large is the device state?
> > Is this relevant to vDPA and real hardware (so virtio-net can look
> > similar to VFIO at some point)?
>
>
> The main motivation behind implementing iterative migration for virtio-net
> is really to improve the guest visible downtime seen when migrating a vDPA
> device.
>
> That is, by implementing iterative migration for virtio-net, we can see the
> state of the device early on and get a head start on work that's currently
> being done during the stop-and-copy phase. If we do this work before the
> stop-and-copy phase, we can further decrease the time spent in this window.
>
> This would include work such as sending down the CVQ commands for queue-pair
> creation (even more beneficial for multiqueue), RSS, filters, etc.
>
> I'm hoping to show this more explicitly in the next version of this RFC
> series that I'm working on now.
OK, thanks for the context. I can wait and read the new version.
In all cases, please be noted that since migration thread does not take
BQL, it means either the setup or iterable phase may happen concurrently
with any of the vCPU threads. I think it means maybe it's not wise to try
to iterate everything: please be ready to see e.g. 64bits MMIO register
being partially updated when dumping it to the wire, for example.
Do you have a rough estimation of the size of the device states to migrate?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-07-22 12:41 ` [RFC 1/6] migration: Add virtio-iterative capability Jonah Palmer
2025-08-06 15:58 ` Peter Xu
@ 2025-08-08 10:48 ` Markus Armbruster
2025-08-11 12:18 ` Jonah Palmer
1 sibling, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2025-08-08 10:48 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, eperezma, boris.ostrovsky
I apologize for the lateness of my review.
Jonah Palmer <jonah.palmer@oracle.com> writes:
> Adds a new migration capability 'virtio-iterative' that will allow
> virtio devices, where supported, to iteratively migrate configuration
> changes that occur during the migration process.
Why is that desirable?
> This capability is added to the validated capabilities list to ensure
> both the source and destination support it before enabling.
What happens when only one side enables it?
> The capability defaults to off to maintain backward compatibility.
>
> To enable the capability via HMP:
> (qemu) migrate_set_capability virtio-iterative on
>
> To enable the capability via QMP:
> {"execute": "migrate-set-capabilities", "arguments": {
> "capabilities": [
> { "capability": "virtio-iterative", "state": true }
> ]
> }
> }
>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---
> migration/savevm.c | 1 +
> qapi/migration.json | 7 ++++++-
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index bb04a4520d..40a2189866 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
> switch (capability) {
> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
> case MIGRATION_CAPABILITY_MAPPED_RAM:
> + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
> return true;
> default:
> return false;
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 4963f6ca12..8f042c3ba5 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -479,6 +479,11 @@
> # each RAM page. Requires a migration URI that supports seeking,
> # such as a file. (since 9.0)
> #
> +# @virtio-iterative: Enable iterative migration for virtio devices, if
> +# the device supports it. When enabled, and where supported, virtio
> +# devices will track and migrate configuration changes that may
> +# occur during the migration process. (Since 10.1)
When and why should the user enable this?
What exactly do you mean by "where supported"?
docs/devel/qapi-code-gen.rst:
For legibility, wrap text paragraphs so every line is at most 70
characters long.
Separate sentences with two spaces.
> +#
> # Features:
> #
> # @unstable: Members @x-colo and @x-ignore-shared are experimental.
> @@ -498,7 +503,7 @@
> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> 'validate-uuid', 'background-snapshot',
> 'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
> - 'dirty-limit', 'mapped-ram'] }
> + 'dirty-limit', 'mapped-ram', 'virtio-iterative'] }
>
> ##
> # @MigrationCapabilityStatus:
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-08 10:48 ` Markus Armbruster
@ 2025-08-11 12:18 ` Jonah Palmer
2025-08-25 12:44 ` Markus Armbruster
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-11 12:18 UTC (permalink / raw)
To: Markus Armbruster
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/8/25 6:48 AM, Markus Armbruster wrote:
> I apologize for the lateness of my review.
>
> Jonah Palmer <jonah.palmer@oracle.com> writes:
>
>> Adds a new migration capability 'virtio-iterative' that will allow
>> virtio devices, where supported, to iteratively migrate configuration
>> changes that occur during the migration process.
>
> Why is that desirable?
>
To be frank, I wasn't sure if having a migration capability, or even
have it toggleable at all, would be desirable or not. It appears though
that this might be better off as a per-device feature set via
--device virtio-net-pci,iterative-mig=on,..., for example.
And by "iteratively migrate configuration changes" I meant more along
the lines of the device's state as it continues running on the source.
But perhaps actual configuration changes (e.g. changing the number of
queue pairs) could also be supported mid-migration like this?
>> This capability is added to the validated capabilities list to ensure
>> both the source and destination support it before enabling.
>
> What happens when only one side enables it?
>
The migration stream breaks if only one side enables it.
This is poor wording on my part, my apologies. I don't think it's even
possible to know the capabilities between the source & destination.
>> The capability defaults to off to maintain backward compatibility.
>>
>> To enable the capability via HMP:
>> (qemu) migrate_set_capability virtio-iterative on
>>
>> To enable the capability via QMP:
>> {"execute": "migrate-set-capabilities", "arguments": {
>> "capabilities": [
>> { "capability": "virtio-iterative", "state": true }
>> ]
>> }
>> }
>>
>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>> ---
>> migration/savevm.c | 1 +
>> qapi/migration.json | 7 ++++++-
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index bb04a4520d..40a2189866 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
>> switch (capability) {
>> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
>> case MIGRATION_CAPABILITY_MAPPED_RAM:
>> + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
>> return true;
>> default:
>> return false;
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 4963f6ca12..8f042c3ba5 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -479,6 +479,11 @@
>> # each RAM page. Requires a migration URI that supports seeking,
>> # such as a file. (since 9.0)
>> #
>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>> +# the device supports it. When enabled, and where supported, virtio
>> +# devices will track and migrate configuration changes that may
>> +# occur during the migration process. (Since 10.1)
>
> When and why should the user enable this?
>
Well if all goes according to plan, always (at least for virtio-net).
This should improve the overall speed of live migration for a virtio-net
device (and vhost-net/vhost-vdpa).
> What exactly do you mean by "where supported"?
>
I meant if both source's Qemu and destination's Qemu support it, as well
as for other virtio devices in the future if they decide to implement
iterative migration (e.g. a more general "enable iterative migration for
virtio devices").
But I think for now this is better left as a virtio-net configuration
rather than as a migration capability (e.g. --device
virtio-net-pci,iterative-mig=on/off,...)
> docs/devel/qapi-code-gen.rst:
>
> For legibility, wrap text paragraphs so every line is at most 70
> characters long.
>
> Separate sentences with two spaces.
>
Ack - thank you.
>> +#
>> # Features:
>> #
>> # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>> @@ -498,7 +503,7 @@
>> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>> 'validate-uuid', 'background-snapshot',
>> 'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
>> - 'dirty-limit', 'mapped-ram'] }
>> + 'dirty-limit', 'mapped-ram', 'virtio-iterative'] }
>>
>> ##
>> # @MigrationCapabilityStatus:
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-07 16:31 ` Peter Xu
@ 2025-08-11 12:30 ` Jonah Palmer
2025-08-11 13:39 ` Peter Xu
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-11 12:30 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/7/25 12:31 PM, Peter Xu wrote:
> On Thu, Aug 07, 2025 at 10:18:38AM -0400, Jonah Palmer wrote:
>>
>>
>> On 8/6/25 12:27 PM, Peter Xu wrote:
>>> On Tue, Jul 22, 2025 at 12:41:26PM +0000, Jonah Palmer wrote:
>>>> Iterative live migration for virtio-net sends an initial
>>>> VMStateDescription while the source is still active. Because data
>>>> continues to flow for virtio-net, the guest's avail index continues to
>>>> increment after last_avail_idx had already been sent. This causes the
>>>> destination to often see something like this from virtio_error():
>>>>
>>>> VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
>>>
>>> This is pretty much understanable, as vmstate_save() / vmstate_load() are,
>>> IMHO, not designed to be used while VM is running.
>>>
>>> To me, it's still illegal (per previous patch) to use vmstate_save_state()
>>> while VM is running, in a save_setup() phase.
>>
>> Yea I understand where you're coming from. It just seemed too good to pass
>> up on as a way to send and receive the entire state of a device.
>>
>> I felt that if I were to implement something similar for iterative migration
>> only that I'd, more or less, be duplicating a lot of already existing code
>> or vmstate logic.
>>
>>>
>>> Some very high level questions from migration POV:
>>>
>>> - Have we figured out why the downtime can be shrinked just by sending the
>>> vmstate twice?
>>>
>>> If we suspect it's memory got preheated, have we tried other ways to
>>> simply heat the memory up on dest side? For example, some form of
>>> mlock[all]()? IMHO it's pretty important we figure out the root of why
>>> such optimization came from.
>>>
>>> I do remember we have downtime issue with number of max_vqueues that may
>>> cause post_load() to be slow, I wonder there're other ways to improve it
>>> instead of vmstate_save(), especially in setup phase.
>>>
>>
>> Yea I believe that the downtime shrinks on the second vmstate_load_state due
>> to preheated memory. But I'd like to stress that it's not my intention to
>> resend the entire vmstate again during the stop-and-copy phase if iterative
>> migration was used. A future iteration of this series will eventually
>> include a more efficient approach to update the destination with any deltas
>> since the vmstate was sent during the iterative portion (instead of just
>> resending the entire vmstate again).
>>
>> And yea there is an inefficiency regarding walking through VIRTIO_QUEUE_MAX
>> (1024) VQs (twice with PCI) that I mentioned here in another comment: https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/0f5b804d-3852-4159-b151-308a57f1ec74@oracle.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjFhL0IUC$
>>
>> This might be better handled in a separate series though rather than as part
>> of this one.
>
> One thing to mention is I recall some other developer was trying to
> optimize device load from memory side:
>
> https://urldefense.com/v3/__https://lore.kernel.org/all/20230317081904.24389-1-xuchuangxclwt@bytedance.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjBifRrAz$
>
> So maybe there're more than one way of doing this, and I'm not sure which
> way is better, or both.
>
Ack. I'll take a look at this.
>>
>>> - Normally devices need iterative phase because:
>>>
>>> (a) the device may contain huge amount of data to transfer
>>>
>>> E.g. RAM and VFIO are good examples and fall into this category.
>>>
>>> (b) the device states are "iterable" from concept
>>>
>>> RAM is definitely true. VFIO somehow mimiced that even though it was
>>> a streamed binary protocol..
>>>
>>> What's the answer for virtio-net here? How large is the device state?
>>> Is this relevant to vDPA and real hardware (so virtio-net can look
>>> similar to VFIO at some point)?
>>
>>
>> The main motivation behind implementing iterative migration for virtio-net
>> is really to improve the guest visible downtime seen when migrating a vDPA
>> device.
>>
>> That is, by implementing iterative migration for virtio-net, we can see the
>> state of the device early on and get a head start on work that's currently
>> being done during the stop-and-copy phase. If we do this work before the
>> stop-and-copy phase, we can further decrease the time spent in this window.
>>
>> This would include work such as sending down the CVQ commands for queue-pair
>> creation (even more beneficial for multiqueue), RSS, filters, etc.
>>
>> I'm hoping to show this more explicitly in the next version of this RFC
>> series that I'm working on now.
>
> OK, thanks for the context. I can wait and read the new version.
>
> In all cases, please be noted that since migration thread does not take
> BQL, it means either the setup or iterable phase may happen concurrently
> with any of the vCPU threads. I think it means maybe it's not wise to try
> to iterate everything: please be ready to see e.g. 64bits MMIO register
> being partially updated when dumping it to the wire, for example.
>
Gotcha. Some of the iterative hooks though like .save_setup,
.load_state, etc. do hold the BQL though, right?
> Do you have a rough estimation of the size of the device states to migrate?
>
Do you have a method at how I might be able to estimate this? I've been
trying to get some kind of rough estimation but failing to do so.
> Thanks,
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-11 12:30 ` Jonah Palmer
@ 2025-08-11 13:39 ` Peter Xu
2025-08-11 21:26 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Peter Xu @ 2025-08-11 13:39 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On Mon, Aug 11, 2025 at 08:30:19AM -0400, Jonah Palmer wrote:
>
>
> On 8/7/25 12:31 PM, Peter Xu wrote:
> > On Thu, Aug 07, 2025 at 10:18:38AM -0400, Jonah Palmer wrote:
> > >
> > >
> > > On 8/6/25 12:27 PM, Peter Xu wrote:
> > > > On Tue, Jul 22, 2025 at 12:41:26PM +0000, Jonah Palmer wrote:
> > > > > Iterative live migration for virtio-net sends an initial
> > > > > VMStateDescription while the source is still active. Because data
> > > > > continues to flow for virtio-net, the guest's avail index continues to
> > > > > increment after last_avail_idx had already been sent. This causes the
> > > > > destination to often see something like this from virtio_error():
> > > > >
> > > > > VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
> > > >
> > > > This is pretty much understanable, as vmstate_save() / vmstate_load() are,
> > > > IMHO, not designed to be used while VM is running.
> > > >
> > > > To me, it's still illegal (per previous patch) to use vmstate_save_state()
> > > > while VM is running, in a save_setup() phase.
> > >
> > > Yea I understand where you're coming from. It just seemed too good to pass
> > > up on as a way to send and receive the entire state of a device.
> > >
> > > I felt that if I were to implement something similar for iterative migration
> > > only that I'd, more or less, be duplicating a lot of already existing code
> > > or vmstate logic.
> > >
> > > >
> > > > Some very high level questions from migration POV:
> > > >
> > > > - Have we figured out why the downtime can be shrinked just by sending the
> > > > vmstate twice?
> > > >
> > > > If we suspect it's memory got preheated, have we tried other ways to
> > > > simply heat the memory up on dest side? For example, some form of
> > > > mlock[all]()? IMHO it's pretty important we figure out the root of why
> > > > such optimization came from.
> > > >
> > > > I do remember we have downtime issue with number of max_vqueues that may
> > > > cause post_load() to be slow, I wonder there're other ways to improve it
> > > > instead of vmstate_save(), especially in setup phase.
> > > >
> > >
> > > Yea I believe that the downtime shrinks on the second vmstate_load_state due
> > > to preheated memory. But I'd like to stress that it's not my intention to
> > > resend the entire vmstate again during the stop-and-copy phase if iterative
> > > migration was used. A future iteration of this series will eventually
> > > include a more efficient approach to update the destination with any deltas
> > > since the vmstate was sent during the iterative portion (instead of just
> > > resending the entire vmstate again).
> > >
> > > And yea there is an inefficiency regarding walking through VIRTIO_QUEUE_MAX
> > > (1024) VQs (twice with PCI) that I mentioned here in another comment: https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/0f5b804d-3852-4159-b151-308a57f1ec74@oracle.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjFhL0IUC$
> > >
> > > This might be better handled in a separate series though rather than as part
> > > of this one.
> >
> > One thing to mention is I recall some other developer was trying to
> > optimize device load from memory side:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/all/20230317081904.24389-1-xuchuangxclwt@bytedance.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjBifRrAz$
> >
> > So maybe there're more than one way of doing this, and I'm not sure which
> > way is better, or both.
> >
>
> Ack. I'll take a look at this.
>
> > >
> > > > - Normally devices need iterative phase because:
> > > >
> > > > (a) the device may contain huge amount of data to transfer
> > > >
> > > > E.g. RAM and VFIO are good examples and fall into this category.
> > > >
> > > > (b) the device states are "iterable" from concept
> > > >
> > > > RAM is definitely true. VFIO somehow mimiced that even though it was
> > > > a streamed binary protocol..
> > > >
> > > > What's the answer for virtio-net here? How large is the device state?
> > > > Is this relevant to vDPA and real hardware (so virtio-net can look
> > > > similar to VFIO at some point)?
> > >
> > >
> > > The main motivation behind implementing iterative migration for virtio-net
> > > is really to improve the guest visible downtime seen when migrating a vDPA
> > > device.
> > >
> > > That is, by implementing iterative migration for virtio-net, we can see the
> > > state of the device early on and get a head start on work that's currently
> > > being done during the stop-and-copy phase. If we do this work before the
> > > stop-and-copy phase, we can further decrease the time spent in this window.
> > >
> > > This would include work such as sending down the CVQ commands for queue-pair
> > > creation (even more beneficial for multiqueue), RSS, filters, etc.
> > >
> > > I'm hoping to show this more explicitly in the next version of this RFC
> > > series that I'm working on now.
> >
> > OK, thanks for the context. I can wait and read the new version.
> >
> > In all cases, please be noted that since migration thread does not take
> > BQL, it means either the setup or iterable phase may happen concurrently
> > with any of the vCPU threads. I think it means maybe it's not wise to try
> > to iterate everything: please be ready to see e.g. 64bits MMIO register
> > being partially updated when dumping it to the wire, for example.
> >
>
> Gotcha. Some of the iterative hooks though like .save_setup, .load_state,
> etc. do hold the BQL though, right?
load_state() definitely needs the lock.
save_setup(), yes we have bql, but I really wish we don't depend on it, and
I don't know whether it'll keep holding true - AFAIU, the majority of it
really doesn't need the lock.. and I always wanted to see whether I can
remove it.
Normal iterations definitely runs without the lock.
>
> > Do you have a rough estimation of the size of the device states to migrate?
> >
>
> Do you have a method at how I might be able to estimate this? I've been
> trying to get some kind of rough estimation but failing to do so.
Could I ask why you started this "migrate virtio-net in iteration phase"
effort?
I thought it was because there're a lot of data to migrate, and there
should be a way to estimate the minumum. So is it not the case?
How about vDPA devices? Do those devices have a lot of data to migrate?
We really need a good enough reason to have a device provide
save_iterate(). If it's only about "preheat some MMIO registers", we
should, IMHO, look at more generic ways first.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-11 13:39 ` Peter Xu
@ 2025-08-11 21:26 ` Jonah Palmer
2025-08-11 21:55 ` Peter Xu
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-11 21:26 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/11/25 9:39 AM, Peter Xu wrote:
> On Mon, Aug 11, 2025 at 08:30:19AM -0400, Jonah Palmer wrote:
>>
>>
>> On 8/7/25 12:31 PM, Peter Xu wrote:
>>> On Thu, Aug 07, 2025 at 10:18:38AM -0400, Jonah Palmer wrote:
>>>>
>>>>
>>>> On 8/6/25 12:27 PM, Peter Xu wrote:
>>>>> On Tue, Jul 22, 2025 at 12:41:26PM +0000, Jonah Palmer wrote:
>>>>>> Iterative live migration for virtio-net sends an initial
>>>>>> VMStateDescription while the source is still active. Because data
>>>>>> continues to flow for virtio-net, the guest's avail index continues to
>>>>>> increment after last_avail_idx had already been sent. This causes the
>>>>>> destination to often see something like this from virtio_error():
>>>>>>
>>>>>> VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: delta 0xfff4
>>>>>
>>>>> This is pretty much understanable, as vmstate_save() / vmstate_load() are,
>>>>> IMHO, not designed to be used while VM is running.
>>>>>
>>>>> To me, it's still illegal (per previous patch) to use vmstate_save_state()
>>>>> while VM is running, in a save_setup() phase.
>>>>
>>>> Yea I understand where you're coming from. It just seemed too good to pass
>>>> up on as a way to send and receive the entire state of a device.
>>>>
>>>> I felt that if I were to implement something similar for iterative migration
>>>> only that I'd, more or less, be duplicating a lot of already existing code
>>>> or vmstate logic.
>>>>
>>>>>
>>>>> Some very high level questions from migration POV:
>>>>>
>>>>> - Have we figured out why the downtime can be shrinked just by sending the
>>>>> vmstate twice?
>>>>>
>>>>> If we suspect it's memory got preheated, have we tried other ways to
>>>>> simply heat the memory up on dest side? For example, some form of
>>>>> mlock[all]()? IMHO it's pretty important we figure out the root of why
>>>>> such optimization came from.
>>>>>
>>>>> I do remember we have downtime issue with number of max_vqueues that may
>>>>> cause post_load() to be slow, I wonder there're other ways to improve it
>>>>> instead of vmstate_save(), especially in setup phase.
>>>>>
>>>>
>>>> Yea I believe that the downtime shrinks on the second vmstate_load_state due
>>>> to preheated memory. But I'd like to stress that it's not my intention to
>>>> resend the entire vmstate again during the stop-and-copy phase if iterative
>>>> migration was used. A future iteration of this series will eventually
>>>> include a more efficient approach to update the destination with any deltas
>>>> since the vmstate was sent during the iterative portion (instead of just
>>>> resending the entire vmstate again).
>>>>
>>>> And yea there is an inefficiency regarding walking through VIRTIO_QUEUE_MAX
>>>> (1024) VQs (twice with PCI) that I mentioned here in another comment: https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/0f5b804d-3852-4159-b151-308a57f1ec74@oracle.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjFhL0IUC$
>>>>
>>>> This might be better handled in a separate series though rather than as part
>>>> of this one.
>>>
>>> One thing to mention is I recall some other developer was trying to
>>> optimize device load from memory side:
>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20230317081904.24389-1-xuchuangxclwt@bytedance.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjBifRrAz$
>>>
>>> So maybe there're more than one way of doing this, and I'm not sure which
>>> way is better, or both.
>>>
>>
>> Ack. I'll take a look at this.
>>
>>>>
>>>>> - Normally devices need iterative phase because:
>>>>>
>>>>> (a) the device may contain huge amount of data to transfer
>>>>>
>>>>> E.g. RAM and VFIO are good examples and fall into this category.
>>>>>
>>>>> (b) the device states are "iterable" from concept
>>>>>
>>>>> RAM is definitely true. VFIO somehow mimiced that even though it was
>>>>> a streamed binary protocol..
>>>>>
>>>>> What's the answer for virtio-net here? How large is the device state?
>>>>> Is this relevant to vDPA and real hardware (so virtio-net can look
>>>>> similar to VFIO at some point)?
>>>>
>>>>
>>>> The main motivation behind implementing iterative migration for virtio-net
>>>> is really to improve the guest visible downtime seen when migrating a vDPA
>>>> device.
>>>>
>>>> That is, by implementing iterative migration for virtio-net, we can see the
>>>> state of the device early on and get a head start on work that's currently
>>>> being done during the stop-and-copy phase. If we do this work before the
>>>> stop-and-copy phase, we can further decrease the time spent in this window.
>>>>
>>>> This would include work such as sending down the CVQ commands for queue-pair
>>>> creation (even more beneficial for multiqueue), RSS, filters, etc.
>>>>
>>>> I'm hoping to show this more explicitly in the next version of this RFC
>>>> series that I'm working on now.
>>>
>>> OK, thanks for the context. I can wait and read the new version.
>>>
>>> In all cases, please be noted that since migration thread does not take
>>> BQL, it means either the setup or iterable phase may happen concurrently
>>> with any of the vCPU threads. I think it means maybe it's not wise to try
>>> to iterate everything: please be ready to see e.g. 64bits MMIO register
>>> being partially updated when dumping it to the wire, for example.
>>>
>>
>> Gotcha. Some of the iterative hooks though like .save_setup, .load_state,
>> etc. do hold the BQL though, right?
>
> load_state() definitely needs the lock.
>
> save_setup(), yes we have bql, but I really wish we don't depend on it, and
> I don't know whether it'll keep holding true - AFAIU, the majority of it
> really doesn't need the lock.. and I always wanted to see whether I can
> remove it.
>
> Normal iterations definitely runs without the lock.
>
Gotcha. Shouldn't be an issue for my implementation (for .save_setup
anyway).
>>
>>> Do you have a rough estimation of the size of the device states to migrate?
>>>
>>
>> Do you have a method at how I might be able to estimate this? I've been
>> trying to get some kind of rough estimation but failing to do so.
>
> Could I ask why you started this "migrate virtio-net in iteration phase"
> effort?
>
> I thought it was because there're a lot of data to migrate, and there
> should be a way to estimate the minumum. So is it not the case?
>
> How about vDPA devices? Do those devices have a lot of data to migrate?
>
> We really need a good enough reason to have a device provide
> save_iterate(). If it's only about "preheat some MMIO registers", we
> should, IMHO, look at more generic ways first.
>
This effort was started to reduce the guest visible downtime by
virtio-net/vhost-net/vhost-vDPA during live migration, especially
vhost-vDPA.
The downtime contributed by vhost-vDPA, for example, is not from having
to migrate a lot of state but rather expensive backend control-plane
latency like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN
filters, offload settings, MTU, etc.). Doing this requires kernel/HW NIC
operations which dominates its downtime.
In other words, by migrating the state of virtio-net early (before the
stop-and-copy phase), we can also start staging backend configurations,
which is the main contributor of downtime when migrating a vhost-vDPA
device.
I apologize if this series gives the impression that we're migrating a
lot of data here. It's more along the lines of moving control-plane
latency out of the stop-and-copy phase.
> Thanks,
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-11 21:26 ` Jonah Palmer
@ 2025-08-11 21:55 ` Peter Xu
2025-08-12 15:51 ` Jonah Palmer
2025-08-13 9:25 ` Eugenio Perez Martin
0 siblings, 2 replies; 66+ messages in thread
From: Peter Xu @ 2025-08-11 21:55 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> This effort was started to reduce the guest visible downtime by
> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> vhost-vDPA.
>
> The downtime contributed by vhost-vDPA, for example, is not from having to
> migrate a lot of state but rather expensive backend control-plane latency
> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> dominates its downtime.
>
> In other words, by migrating the state of virtio-net early (before the
> stop-and-copy phase), we can also start staging backend configurations,
> which is the main contributor of downtime when migrating a vhost-vDPA
> device.
>
> I apologize if this series gives the impression that we're migrating a lot
> of data here. It's more along the lines of moving control-plane latency out
> of the stop-and-copy phase.
I see, thanks.
Please add these into the cover letter of the next post. IMHO it's
extremely important information to explain the real goal of this work. I
bet it is not expected for most people when reading the current cover
letter.
Then it could have nothing to do with iterative phase, am I right?
What are the data needed for the dest QEMU to start staging backend
configurations to the HWs underneath? Does dest QEMU already have them in
the cmdlines?
Asking this because I want to know whether it can be done completely
without src QEMU at all, e.g. when dest QEMU starts.
If src QEMU's data is still needed, please also first consider providing
such facility using an "early VMSD" if it is ever possible: feel free to
refer to commit 3b95a71b22827d26178.
So the data to be transferred is still in VMSD form, aka, data are still
described by VMSD macros, instead of hard-coded streamline protocols using
e.g. qemufile APIs using save_setup()/load_setup().
When things are described in VMSDs, it get the most benefit from the live
migration framework, and it's much, much more flexible. It's the most
suggested way for device to cooperate with live migration, savevmhandlers
are only the last resort because it's almost not in control of migration..
In short, please avoid using savevmhandlers as long as there can be any
other way to achieve similar results.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-11 21:55 ` Peter Xu
@ 2025-08-12 15:51 ` Jonah Palmer
2025-08-13 9:25 ` Eugenio Perez Martin
1 sibling, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-08-12 15:51 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/11/25 5:55 PM, Peter Xu wrote:
> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>> This effort was started to reduce the guest visible downtime by
>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>> vhost-vDPA.
>>
>> The downtime contributed by vhost-vDPA, for example, is not from having to
>> migrate a lot of state but rather expensive backend control-plane latency
>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
>> dominates its downtime.
>>
>> In other words, by migrating the state of virtio-net early (before the
>> stop-and-copy phase), we can also start staging backend configurations,
>> which is the main contributor of downtime when migrating a vhost-vDPA
>> device.
>>
>> I apologize if this series gives the impression that we're migrating a lot
>> of data here. It's more along the lines of moving control-plane latency out
>> of the stop-and-copy phase.
>
> I see, thanks.
>
> Please add these into the cover letter of the next post. IMHO it's
> extremely important information to explain the real goal of this work. I
> bet it is not expected for most people when reading the current cover
> letter.
>
> Then it could have nothing to do with iterative phase, am I right?
>
> What are the data needed for the dest QEMU to start staging backend
> configurations to the HWs underneath? Does dest QEMU already have them in
> the cmdlines?
>
> Asking this because I want to know whether it can be done completely
> without src QEMU at all, e.g. when dest QEMU starts.
>
> If src QEMU's data is still needed, please also first consider providing
> such facility using an "early VMSD" if it is ever possible: feel free to
> refer to commit 3b95a71b22827d26178.
>
> So the data to be transferred is still in VMSD form, aka, data are still
> described by VMSD macros, instead of hard-coded streamline protocols using
> e.g. qemufile APIs using save_setup()/load_setup().
>
> When things are described in VMSDs, it get the most benefit from the live
> migration framework, and it's much, much more flexible. It's the most
> suggested way for device to cooperate with live migration, savevmhandlers
> are only the last resort because it's almost not in control of migration..
>
> In short, please avoid using savevmhandlers as long as there can be any
> other way to achieve similar results.
>
Oh this early VMSD is interesting and, at first glance, appears to be
suitable for what we're trying to do here. I'll take a look at it and
see if this is something we can use instead of the SaveVMHandlers hooks.
Thank you for mentioning this.
Jonah
> Thanks,
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-11 21:55 ` Peter Xu
2025-08-12 15:51 ` Jonah Palmer
@ 2025-08-13 9:25 ` Eugenio Perez Martin
2025-08-13 14:06 ` Peter Xu
1 sibling, 1 reply; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-08-13 9:25 UTC (permalink / raw)
To: Peter Xu
Cc: Jonah Palmer, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky
On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> > This effort was started to reduce the guest visible downtime by
> > virtio-net/vhost-net/vhost-vDPA during live migration, especially
> > vhost-vDPA.
> >
> > The downtime contributed by vhost-vDPA, for example, is not from having to
> > migrate a lot of state but rather expensive backend control-plane latency
> > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> > settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> > dominates its downtime.
> >
> > In other words, by migrating the state of virtio-net early (before the
> > stop-and-copy phase), we can also start staging backend configurations,
> > which is the main contributor of downtime when migrating a vhost-vDPA
> > device.
> >
> > I apologize if this series gives the impression that we're migrating a lot
> > of data here. It's more along the lines of moving control-plane latency out
> > of the stop-and-copy phase.
>
> I see, thanks.
>
> Please add these into the cover letter of the next post. IMHO it's
> extremely important information to explain the real goal of this work. I
> bet it is not expected for most people when reading the current cover
> letter.
>
> Then it could have nothing to do with iterative phase, am I right?
>
> What are the data needed for the dest QEMU to start staging backend
> configurations to the HWs underneath? Does dest QEMU already have them in
> the cmdlines?
>
> Asking this because I want to know whether it can be done completely
> without src QEMU at all, e.g. when dest QEMU starts.
>
> If src QEMU's data is still needed, please also first consider providing
> such facility using an "early VMSD" if it is ever possible: feel free to
> refer to commit 3b95a71b22827d26178.
>
While it works for this series, it does not allow to resend the state
when the src device changes. For example, if the number of virtqueues
is modified.
> So the data to be transferred is still in VMSD form, aka, data are still
> described by VMSD macros, instead of hard-coded streamline protocols using
> e.g. qemufile APIs using save_setup()/load_setup().
>
> When things are described in VMSDs, it get the most benefit from the live
> migration framework, and it's much, much more flexible. It's the most
> suggested way for device to cooperate with live migration, savevmhandlers
> are only the last resort because it's almost not in control of migration..
>
> In short, please avoid using savevmhandlers as long as there can be any
> other way to achieve similar results.
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-13 9:25 ` Eugenio Perez Martin
@ 2025-08-13 14:06 ` Peter Xu
2025-08-14 9:28 ` Eugenio Perez Martin
0 siblings, 1 reply; 66+ messages in thread
From: Peter Xu @ 2025-08-13 14:06 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Jonah Palmer, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky
On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> > > This effort was started to reduce the guest visible downtime by
> > > virtio-net/vhost-net/vhost-vDPA during live migration, especially
> > > vhost-vDPA.
> > >
> > > The downtime contributed by vhost-vDPA, for example, is not from having to
> > > migrate a lot of state but rather expensive backend control-plane latency
> > > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> > > settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> > > dominates its downtime.
> > >
> > > In other words, by migrating the state of virtio-net early (before the
> > > stop-and-copy phase), we can also start staging backend configurations,
> > > which is the main contributor of downtime when migrating a vhost-vDPA
> > > device.
> > >
> > > I apologize if this series gives the impression that we're migrating a lot
> > > of data here. It's more along the lines of moving control-plane latency out
> > > of the stop-and-copy phase.
> >
> > I see, thanks.
> >
> > Please add these into the cover letter of the next post. IMHO it's
> > extremely important information to explain the real goal of this work. I
> > bet it is not expected for most people when reading the current cover
> > letter.
> >
> > Then it could have nothing to do with iterative phase, am I right?
> >
> > What are the data needed for the dest QEMU to start staging backend
> > configurations to the HWs underneath? Does dest QEMU already have them in
> > the cmdlines?
> >
> > Asking this because I want to know whether it can be done completely
> > without src QEMU at all, e.g. when dest QEMU starts.
> >
> > If src QEMU's data is still needed, please also first consider providing
> > such facility using an "early VMSD" if it is ever possible: feel free to
> > refer to commit 3b95a71b22827d26178.
> >
>
> While it works for this series, it does not allow to resend the state
> when the src device changes. For example, if the number of virtqueues
> is modified.
Some explanation on "how sync number of vqueues helps downtime" would help.
Not "it might preheat things", but exactly why, and how that differs when
it's pure software, and when hardware will be involved.
If it's only about pre-heat, could dest qemu preheat with max num of
vqueues? Is it the same cost of downtime when growing num of queues,
v.s. shrinking num of queues?
For softwares, is it about memory transaction updates due to the vqueues?
If so, have we investigated a more generic approach on memory side, likely
some form of continuation from Chuang's work I previously mentioned?
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-13 14:06 ` Peter Xu
@ 2025-08-14 9:28 ` Eugenio Perez Martin
2025-08-14 16:16 ` Dragos Tatulea
` (2 more replies)
0 siblings, 3 replies; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-08-14 9:28 UTC (permalink / raw)
To: Peter Xu
Cc: Jonah Palmer, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> > On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> > > > This effort was started to reduce the guest visible downtime by
> > > > virtio-net/vhost-net/vhost-vDPA during live migration, especially
> > > > vhost-vDPA.
> > > >
> > > > The downtime contributed by vhost-vDPA, for example, is not from having to
> > > > migrate a lot of state but rather expensive backend control-plane latency
> > > > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> > > > settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> > > > dominates its downtime.
> > > >
> > > > In other words, by migrating the state of virtio-net early (before the
> > > > stop-and-copy phase), we can also start staging backend configurations,
> > > > which is the main contributor of downtime when migrating a vhost-vDPA
> > > > device.
> > > >
> > > > I apologize if this series gives the impression that we're migrating a lot
> > > > of data here. It's more along the lines of moving control-plane latency out
> > > > of the stop-and-copy phase.
> > >
> > > I see, thanks.
> > >
> > > Please add these into the cover letter of the next post. IMHO it's
> > > extremely important information to explain the real goal of this work. I
> > > bet it is not expected for most people when reading the current cover
> > > letter.
> > >
> > > Then it could have nothing to do with iterative phase, am I right?
> > >
> > > What are the data needed for the dest QEMU to start staging backend
> > > configurations to the HWs underneath? Does dest QEMU already have them in
> > > the cmdlines?
> > >
> > > Asking this because I want to know whether it can be done completely
> > > without src QEMU at all, e.g. when dest QEMU starts.
> > >
> > > If src QEMU's data is still needed, please also first consider providing
> > > such facility using an "early VMSD" if it is ever possible: feel free to
> > > refer to commit 3b95a71b22827d26178.
> > >
> >
> > While it works for this series, it does not allow to resend the state
> > when the src device changes. For example, if the number of virtqueues
> > is modified.
>
> Some explanation on "how sync number of vqueues helps downtime" would help.
> Not "it might preheat things", but exactly why, and how that differs when
> it's pure software, and when hardware will be involved.
>
By nvidia engineers to configure vqs (number, size, RSS, etc) takes
about ~200ms:
https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
Adding Dragos here in case he can provide more details. Maybe the
numbers have changed though.
And I guess the difference with pure SW will always come down to PCI
communications, which assume it is slower than configuring the host SW
device in RAM or even CPU cache. But I admin that proper profiling is
needed before making those claims.
Jonah, can you print the time it takes to configure the vDPA device
with traces vs the time it takes to enable the dataplane of the
device? So we can get an idea of how much time we save with this.
> If it's only about pre-heat, could dest qemu preheat with max num of
> vqueues? Is it the same cost of downtime when growing num of queues,
> v.s. shrinking num of queues?
>
Well you need to send the vq addresses and properties to preheat
these. If the address is invalid, the destination device will
interpret the vq address as the avail ring, for example, and will read
an invalid avail idx.
> For softwares, is it about memory transaction updates due to the vqueues?
> If so, have we investigated a more generic approach on memory side, likely
> some form of continuation from Chuang's work I previously mentioned?
>
This work is very interesting, and most of the downtime was because of
memory pinning indeed. Thanks for bringing it up! But the downtime is
not caused for the individual vq memory config, but for pinning all
the guest's memory for the device to access to it.
I think it is worth exploring if it affects the downtime in the case
of HW. I don't see any reason to reject that series but lack of
reviews, isn't it?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-14 9:28 ` Eugenio Perez Martin
@ 2025-08-14 16:16 ` Dragos Tatulea
2025-08-14 20:27 ` Peter Xu
2025-08-15 14:50 ` Jonah Palmer
2 siblings, 0 replies; 66+ messages in thread
From: Dragos Tatulea @ 2025-08-14 16:16 UTC (permalink / raw)
To: Eugenio Perez Martin, Peter Xu
Cc: Jonah Palmer, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky
On Thu, Aug 14, 2025 at 11:28:24AM +0200, Eugenio Perez Martin wrote:
> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> > > On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> > > > > This effort was started to reduce the guest visible downtime by
> > > > > virtio-net/vhost-net/vhost-vDPA during live migration, especially
> > > > > vhost-vDPA.
> > > > >
> > > > > The downtime contributed by vhost-vDPA, for example, is not from having to
> > > > > migrate a lot of state but rather expensive backend control-plane latency
> > > > > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> > > > > settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> > > > > dominates its downtime.
> > > > >
> > > > > In other words, by migrating the state of virtio-net early (before the
> > > > > stop-and-copy phase), we can also start staging backend configurations,
> > > > > which is the main contributor of downtime when migrating a vhost-vDPA
> > > > > device.
> > > > >
> > > > > I apologize if this series gives the impression that we're migrating a lot
> > > > > of data here. It's more along the lines of moving control-plane latency out
> > > > > of the stop-and-copy phase.
> > > >
> > > > I see, thanks.
> > > >
> > > > Please add these into the cover letter of the next post. IMHO it's
> > > > extremely important information to explain the real goal of this work. I
> > > > bet it is not expected for most people when reading the current cover
> > > > letter.
> > > >
> > > > Then it could have nothing to do with iterative phase, am I right?
> > > >
> > > > What are the data needed for the dest QEMU to start staging backend
> > > > configurations to the HWs underneath? Does dest QEMU already have them in
> > > > the cmdlines?
> > > >
> > > > Asking this because I want to know whether it can be done completely
> > > > without src QEMU at all, e.g. when dest QEMU starts.
> > > >
> > > > If src QEMU's data is still needed, please also first consider providing
> > > > such facility using an "early VMSD" if it is ever possible: feel free to
> > > > refer to commit 3b95a71b22827d26178.
> > > >
> > >
> > > While it works for this series, it does not allow to resend the state
> > > when the src device changes. For example, if the number of virtqueues
> > > is modified.
> >
> > Some explanation on "how sync number of vqueues helps downtime" would help.
> > Not "it might preheat things", but exactly why, and how that differs when
> > it's pure software, and when hardware will be involved.
> >
>
> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> about ~200ms:
> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>
> Adding Dragos here in case he can provide more details. Maybe the
> numbers have changed though.
For kernel mlx5_vdpa it can be even more on larger systems (256 GB VM
with 32 VQs):
https://lore.kernel.org/virtualization/20240830105838.2666587-2-dtatulea@nvidia.com/
As pointed in the above link, configuring VQs can amount to a lot of
time whem many VQs are used (32 in our example). So having them
pre-configured during migration would be a worthwhile optimization.
Thanks,
Dragos
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-14 9:28 ` Eugenio Perez Martin
2025-08-14 16:16 ` Dragos Tatulea
@ 2025-08-14 20:27 ` Peter Xu
2025-08-15 14:50 ` Jonah Palmer
2 siblings, 0 replies; 66+ messages in thread
From: Peter Xu @ 2025-08-14 20:27 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Jonah Palmer, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE, Stefan Hajnoczi
On Thu, Aug 14, 2025 at 11:28:24AM +0200, Eugenio Perez Martin wrote:
> Well you need to send the vq addresses and properties to preheat
> these. If the address is invalid, the destination device will
> interpret the vq address as the avail ring, for example, and will read
> an invalid avail idx.
I see now. But.. isn't vq addresses assigned by the guest driver? What
happens if one pre-heated the vqs but VM rebooted right before live
migration decides to switchover to dest QEMU?
>
> > For softwares, is it about memory transaction updates due to the vqueues?
> > If so, have we investigated a more generic approach on memory side, likely
> > some form of continuation from Chuang's work I previously mentioned?
> >
>
> This work is very interesting, and most of the downtime was because of
> memory pinning indeed. Thanks for bringing it up! But the downtime is
> not caused for the individual vq memory config, but for pinning all
> the guest's memory for the device to access to it.
>
> I think it is worth exploring if it affects the downtime in the case
> of HW. I don't see any reason to reject that series but lack of
> reviews, isn't it?
Partly yes.. but not fully.
I don't remember many details, but I do remember the series tried to mark
the whole device load to be one memory transaction, which will cause the
guest GPA flatview being obsolete during that period.
The issue should be that some of the special devices will need to access
guest memory during post_load(), hence one transaction wouldn't be enough,
and I didn't remember whether we have captured all the outliers of such, or
any side effects due to a possible obsolete flatview's presence.
In one of the later discussions, Stefan used to mention we could provide a
smaller transaction window and I think that might be something we can also
try.
For example, I think it's worthwhile to try one transaction per virtio-net
device, then all the vqueues will be loaded in one transaction as long as
the load of the virtio-net device doesn't need to access guest memory.
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-14 9:28 ` Eugenio Perez Martin
2025-08-14 16:16 ` Dragos Tatulea
2025-08-14 20:27 ` Peter Xu
@ 2025-08-15 14:50 ` Jonah Palmer
2025-08-15 19:35 ` Si-Wei Liu
2025-08-18 6:51 ` Eugenio Perez Martin
2 siblings, 2 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-08-15 14:50 UTC (permalink / raw)
To: Eugenio Perez Martin, Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst, si-wei.liu,
boris.ostrovsky, Dragos Tatulea DE
On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>>>>> This effort was started to reduce the guest visible downtime by
>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>>>>> vhost-vDPA.
>>>>>
>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
>>>>> migrate a lot of state but rather expensive backend control-plane latency
>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
>>>>> dominates its downtime.
>>>>>
>>>>> In other words, by migrating the state of virtio-net early (before the
>>>>> stop-and-copy phase), we can also start staging backend configurations,
>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
>>>>> device.
>>>>>
>>>>> I apologize if this series gives the impression that we're migrating a lot
>>>>> of data here. It's more along the lines of moving control-plane latency out
>>>>> of the stop-and-copy phase.
>>>>
>>>> I see, thanks.
>>>>
>>>> Please add these into the cover letter of the next post. IMHO it's
>>>> extremely important information to explain the real goal of this work. I
>>>> bet it is not expected for most people when reading the current cover
>>>> letter.
>>>>
>>>> Then it could have nothing to do with iterative phase, am I right?
>>>>
>>>> What are the data needed for the dest QEMU to start staging backend
>>>> configurations to the HWs underneath? Does dest QEMU already have them in
>>>> the cmdlines?
>>>>
>>>> Asking this because I want to know whether it can be done completely
>>>> without src QEMU at all, e.g. when dest QEMU starts.
>>>>
>>>> If src QEMU's data is still needed, please also first consider providing
>>>> such facility using an "early VMSD" if it is ever possible: feel free to
>>>> refer to commit 3b95a71b22827d26178.
>>>>
>>>
>>> While it works for this series, it does not allow to resend the state
>>> when the src device changes. For example, if the number of virtqueues
>>> is modified.
>>
>> Some explanation on "how sync number of vqueues helps downtime" would help.
>> Not "it might preheat things", but exactly why, and how that differs when
>> it's pure software, and when hardware will be involved.
>>
>
> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> about ~200ms:
> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
>
> Adding Dragos here in case he can provide more details. Maybe the
> numbers have changed though.
>
> And I guess the difference with pure SW will always come down to PCI
> communications, which assume it is slower than configuring the host SW
> device in RAM or even CPU cache. But I admin that proper profiling is
> needed before making those claims.
>
> Jonah, can you print the time it takes to configure the vDPA device
> with traces vs the time it takes to enable the dataplane of the
> device? So we can get an idea of how much time we save with this.
>
Let me know if this isn't what you're looking for.
I'm assuming by "configuration time" you mean:
- Time from device startup (entry to vhost_vdpa_dev_start()) to right
before we start enabling the vrings (e.g.
VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
And by "time taken to enable the dataplane" I'm assuming you mean:
- Time right before we start enabling the vrings (see above) to right
after we enable the last vring (at the end of
vhost_vdpa_net_cvq_load())
Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
-netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
queues=8,x-svq=on
-device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
ctrl_vlan=off,vectors=18,host_mtu=9000,
disable-legacy=on,disable-modern=off
---
Configuration time: ~31s
Dataplane enable time: ~0.14ms
>> If it's only about pre-heat, could dest qemu preheat with max num of
>> vqueues? Is it the same cost of downtime when growing num of queues,
>> v.s. shrinking num of queues?
>>
>
> Well you need to send the vq addresses and properties to preheat
> these. If the address is invalid, the destination device will
> interpret the vq address as the avail ring, for example, and will read
> an invalid avail idx.
>
>> For softwares, is it about memory transaction updates due to the vqueues?
>> If so, have we investigated a more generic approach on memory side, likely
>> some form of continuation from Chuang's work I previously mentioned?
>>
>
> This work is very interesting, and most of the downtime was because of
> memory pinning indeed. Thanks for bringing it up! But the downtime is
> not caused for the individual vq memory config, but for pinning all
> the guest's memory for the device to access to it.
>
> I think it is worth exploring if it affects the downtime in the case
> of HW. I don't see any reason to reject that series but lack of
> reviews, isn't it?
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-15 14:50 ` Jonah Palmer
@ 2025-08-15 19:35 ` Si-Wei Liu
2025-08-18 6:51 ` Eugenio Perez Martin
1 sibling, 0 replies; 66+ messages in thread
From: Si-Wei Liu @ 2025-08-15 19:35 UTC (permalink / raw)
To: Jonah Palmer, Eugenio Perez Martin, Peter Xu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst,
boris.ostrovsky, Dragos Tatulea DE
Hi Jonah,
On 8/15/2025 7:50 AM, Jonah Palmer wrote:
>
>
> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>>>
>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>>>>>> This effort was started to reduce the guest visible downtime by
>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>>>>>> vhost-vDPA.
>>>>>>
>>>>>> The downtime contributed by vhost-vDPA, for example, is not from
>>>>>> having to
>>>>>> migrate a lot of state but rather expensive backend control-plane
>>>>>> latency
>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN
>>>>>> filters, offload
>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC
>>>>>> operations which
>>>>>> dominates its downtime.
>>>>>>
>>>>>> In other words, by migrating the state of virtio-net early
>>>>>> (before the
>>>>>> stop-and-copy phase), we can also start staging backend
>>>>>> configurations,
>>>>>> which is the main contributor of downtime when migrating a
>>>>>> vhost-vDPA
>>>>>> device.
>>>>>>
>>>>>> I apologize if this series gives the impression that we're
>>>>>> migrating a lot
>>>>>> of data here. It's more along the lines of moving control-plane
>>>>>> latency out
>>>>>> of the stop-and-copy phase.
>>>>>
>>>>> I see, thanks.
>>>>>
>>>>> Please add these into the cover letter of the next post. IMHO it's
>>>>> extremely important information to explain the real goal of this
>>>>> work. I
>>>>> bet it is not expected for most people when reading the current cover
>>>>> letter.
>>>>>
>>>>> Then it could have nothing to do with iterative phase, am I right?
>>>>>
>>>>> What are the data needed for the dest QEMU to start staging backend
>>>>> configurations to the HWs underneath? Does dest QEMU already have
>>>>> them in
>>>>> the cmdlines?
>>>>>
>>>>> Asking this because I want to know whether it can be done completely
>>>>> without src QEMU at all, e.g. when dest QEMU starts.
>>>>>
>>>>> If src QEMU's data is still needed, please also first consider
>>>>> providing
>>>>> such facility using an "early VMSD" if it is ever possible: feel
>>>>> free to
>>>>> refer to commit 3b95a71b22827d26178.
>>>>>
>>>>
>>>> While it works for this series, it does not allow to resend the state
>>>> when the src device changes. For example, if the number of virtqueues
>>>> is modified.
>>>
>>> Some explanation on "how sync number of vqueues helps downtime"
>>> would help.
>>> Not "it might preheat things", but exactly why, and how that differs
>>> when
>>> it's pure software, and when hardware will be involved.
>>>
>>
>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
>> about ~200ms:
>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
>>
>>
>> Adding Dragos here in case he can provide more details. Maybe the
>> numbers have changed though.
>>
>> And I guess the difference with pure SW will always come down to PCI
>> communications, which assume it is slower than configuring the host SW
>> device in RAM or even CPU cache. But I admin that proper profiling is
>> needed before making those claims.
>>
>> Jonah, can you print the time it takes to configure the vDPA device
>> with traces vs the time it takes to enable the dataplane of the
>> device? So we can get an idea of how much time we save with this.
>>
>
> Let me know if this isn't what you're looking for.
>
> I'm assuming by "configuration time" you mean:
> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> before we start enabling the vrings (e.g.
> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
>
> And by "time taken to enable the dataplane" I'm assuming you mean:
> - Time right before we start enabling the vrings (see above) to right
> after we enable the last vring (at the end of
> vhost_vdpa_net_cvq_load())
>
> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
I guess what Eugenio may want to see is the config with SVQ=off (i.e.
without x-svq=on in below netdev line). Do you have number for that as
well? Then since vhost_vdpa_dev_start() it should exclude the time for
pinning, you could easily profile/measure vq configure time (the CVQ
commands to configure vq number, size, RSS, etc) vs dataplane
enablement, same way as you did for SVQ=on.
Regards,
-Siwei
>
> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> queues=8,x-svq=on
>
> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> ctrl_vlan=off,vectors=18,host_mtu=9000,
> disable-legacy=on,disable-modern=off
>
> ---
>
> Configuration time: ~31s
> Dataplane enable time: ~0.14ms
>
>>> If it's only about pre-heat, could dest qemu preheat with max num of
>>> vqueues? Is it the same cost of downtime when growing num of queues,
>>> v.s. shrinking num of queues?
>>>
>>
>> Well you need to send the vq addresses and properties to preheat
>> these. If the address is invalid, the destination device will
>> interpret the vq address as the avail ring, for example, and will read
>> an invalid avail idx.
>>
>>> For softwares, is it about memory transaction updates due to the
>>> vqueues?
>>> If so, have we investigated a more generic approach on memory side,
>>> likely
>>> some form of continuation from Chuang's work I previously mentioned?
>>>
>>
>> This work is very interesting, and most of the downtime was because of
>> memory pinning indeed. Thanks for bringing it up! But the downtime is
>> not caused for the individual vq memory config, but for pinning all
>> the guest's memory for the device to access to it.
>>
>> I think it is worth exploring if it affects the downtime in the case
>> of HW. I don't see any reason to reject that series but lack of
>> reviews, isn't it?
>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-15 14:50 ` Jonah Palmer
2025-08-15 19:35 ` Si-Wei Liu
@ 2025-08-18 6:51 ` Eugenio Perez Martin
2025-08-18 14:46 ` Jonah Palmer
1 sibling, 1 reply; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-08-18 6:51 UTC (permalink / raw)
To: Jonah Palmer
Cc: Peter Xu, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> > On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> >>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>
> >>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>> This effort was started to reduce the guest visible downtime by
> >>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> >>>>> vhost-vDPA.
> >>>>>
> >>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
> >>>>> migrate a lot of state but rather expensive backend control-plane latency
> >>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> >>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> >>>>> dominates its downtime.
> >>>>>
> >>>>> In other words, by migrating the state of virtio-net early (before the
> >>>>> stop-and-copy phase), we can also start staging backend configurations,
> >>>>> which is the main contributor of downtime when migrating a vhost-vDPA
> >>>>> device.
> >>>>>
> >>>>> I apologize if this series gives the impression that we're migrating a lot
> >>>>> of data here. It's more along the lines of moving control-plane latency out
> >>>>> of the stop-and-copy phase.
> >>>>
> >>>> I see, thanks.
> >>>>
> >>>> Please add these into the cover letter of the next post. IMHO it's
> >>>> extremely important information to explain the real goal of this work. I
> >>>> bet it is not expected for most people when reading the current cover
> >>>> letter.
> >>>>
> >>>> Then it could have nothing to do with iterative phase, am I right?
> >>>>
> >>>> What are the data needed for the dest QEMU to start staging backend
> >>>> configurations to the HWs underneath? Does dest QEMU already have them in
> >>>> the cmdlines?
> >>>>
> >>>> Asking this because I want to know whether it can be done completely
> >>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>
> >>>> If src QEMU's data is still needed, please also first consider providing
> >>>> such facility using an "early VMSD" if it is ever possible: feel free to
> >>>> refer to commit 3b95a71b22827d26178.
> >>>>
> >>>
> >>> While it works for this series, it does not allow to resend the state
> >>> when the src device changes. For example, if the number of virtqueues
> >>> is modified.
> >>
> >> Some explanation on "how sync number of vqueues helps downtime" would help.
> >> Not "it might preheat things", but exactly why, and how that differs when
> >> it's pure software, and when hardware will be involved.
> >>
> >
> > By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> > about ~200ms:
> > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >
> > Adding Dragos here in case he can provide more details. Maybe the
> > numbers have changed though.
> >
> > And I guess the difference with pure SW will always come down to PCI
> > communications, which assume it is slower than configuring the host SW
> > device in RAM or even CPU cache. But I admin that proper profiling is
> > needed before making those claims.
> >
> > Jonah, can you print the time it takes to configure the vDPA device
> > with traces vs the time it takes to enable the dataplane of the
> > device? So we can get an idea of how much time we save with this.
> >
>
> Let me know if this isn't what you're looking for.
>
> I'm assuming by "configuration time" you mean:
> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> before we start enabling the vrings (e.g.
> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
>
> And by "time taken to enable the dataplane" I'm assuming you mean:
> - Time right before we start enabling the vrings (see above) to right
> after we enable the last vring (at the end of
> vhost_vdpa_net_cvq_load())
>
> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
>
> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> queues=8,x-svq=on
>
> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> ctrl_vlan=off,vectors=18,host_mtu=9000,
> disable-legacy=on,disable-modern=off
>
> ---
>
> Configuration time: ~31s
> Dataplane enable time: ~0.14ms
>
I was vague, but yes, that's representative enough! It would be more
accurate if the configuration time ends by the time QEMU enables the
first queue of the dataplane though.
As Si-Wei mentions, is v->shared->listener_registered == true at the
beginning of vhost_vdpa_dev_start?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-18 6:51 ` Eugenio Perez Martin
@ 2025-08-18 14:46 ` Jonah Palmer
2025-08-18 16:21 ` Peter Xu
2025-08-19 7:10 ` Eugenio Perez Martin
0 siblings, 2 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-08-18 14:46 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Peter Xu, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>
>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>>>>>>> This effort was started to reduce the guest visible downtime by
>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>>>>>>> vhost-vDPA.
>>>>>>>
>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
>>>>>>> dominates its downtime.
>>>>>>>
>>>>>>> In other words, by migrating the state of virtio-net early (before the
>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
>>>>>>> device.
>>>>>>>
>>>>>>> I apologize if this series gives the impression that we're migrating a lot
>>>>>>> of data here. It's more along the lines of moving control-plane latency out
>>>>>>> of the stop-and-copy phase.
>>>>>>
>>>>>> I see, thanks.
>>>>>>
>>>>>> Please add these into the cover letter of the next post. IMHO it's
>>>>>> extremely important information to explain the real goal of this work. I
>>>>>> bet it is not expected for most people when reading the current cover
>>>>>> letter.
>>>>>>
>>>>>> Then it could have nothing to do with iterative phase, am I right?
>>>>>>
>>>>>> What are the data needed for the dest QEMU to start staging backend
>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
>>>>>> the cmdlines?
>>>>>>
>>>>>> Asking this because I want to know whether it can be done completely
>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
>>>>>>
>>>>>> If src QEMU's data is still needed, please also first consider providing
>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
>>>>>> refer to commit 3b95a71b22827d26178.
>>>>>>
>>>>>
>>>>> While it works for this series, it does not allow to resend the state
>>>>> when the src device changes. For example, if the number of virtqueues
>>>>> is modified.
>>>>
>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
>>>> Not "it might preheat things", but exactly why, and how that differs when
>>>> it's pure software, and when hardware will be involved.
>>>>
>>>
>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
>>> about ~200ms:
>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
>>>
>>> Adding Dragos here in case he can provide more details. Maybe the
>>> numbers have changed though.
>>>
>>> And I guess the difference with pure SW will always come down to PCI
>>> communications, which assume it is slower than configuring the host SW
>>> device in RAM or even CPU cache. But I admin that proper profiling is
>>> needed before making those claims.
>>>
>>> Jonah, can you print the time it takes to configure the vDPA device
>>> with traces vs the time it takes to enable the dataplane of the
>>> device? So we can get an idea of how much time we save with this.
>>>
>>
>> Let me know if this isn't what you're looking for.
>>
>> I'm assuming by "configuration time" you mean:
>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
>> before we start enabling the vrings (e.g.
>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
>>
>> And by "time taken to enable the dataplane" I'm assuming you mean:
>> - Time right before we start enabling the vrings (see above) to right
>> after we enable the last vring (at the end of
>> vhost_vdpa_net_cvq_load())
>>
>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
>>
>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
>> queues=8,x-svq=on
>>
>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
>> ctrl_vlan=off,vectors=18,host_mtu=9000,
>> disable-legacy=on,disable-modern=off
>>
>> ---
>>
>> Configuration time: ~31s
>> Dataplane enable time: ~0.14ms
>>
>
> I was vague, but yes, that's representative enough! It would be more
> accurate if the configuration time ends by the time QEMU enables the
> first queue of the dataplane though.
>
> As Si-Wei mentions, is v->shared->listener_registered == true at the
> beginning of vhost_vdpa_dev_start?
>
Ah, I also realized that Qemu I was using for measurements was using a
version before the listener_registered member was introduced.
I retested with the latest changes in Qemu and set x-svq=off, e.g.:
guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
times for measurements.
v->shared->listener_registered == false at the beginning of
vhost_vdpa_dev_start().
---
Configuration time: Time from first entry into vhost_vdpa_dev_start() to
right after Qemu enables the first VQ.
- 26.947s, 26.606s, 27.326s
Enable dataplane: Time from right after first VQ is enabled to right
after the last VQ is enabled.
- 0.081ms, 0.081ms, 0.079ms
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-18 14:46 ` Jonah Palmer
@ 2025-08-18 16:21 ` Peter Xu
2025-08-19 7:20 ` Eugenio Perez Martin
2025-08-19 7:10 ` Eugenio Perez Martin
1 sibling, 1 reply; 66+ messages in thread
From: Peter Xu @ 2025-08-18 16:21 UTC (permalink / raw)
To: Jonah Palmer
Cc: Eugenio Perez Martin, qemu-devel, farosas, eblake, armbru,
jasowang, mst, si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On Mon, Aug 18, 2025 at 10:46:00AM -0400, Jonah Palmer wrote:
>
>
> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > >
> > >
> > >
> > > On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> > > > On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> > > > > > On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> > > > > > > > This effort was started to reduce the guest visible downtime by
> > > > > > > > virtio-net/vhost-net/vhost-vDPA during live migration, especially
> > > > > > > > vhost-vDPA.
> > > > > > > >
> > > > > > > > The downtime contributed by vhost-vDPA, for example, is not from having to
> > > > > > > > migrate a lot of state but rather expensive backend control-plane latency
> > > > > > > > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> > > > > > > > settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> > > > > > > > dominates its downtime.
> > > > > > > >
> > > > > > > > In other words, by migrating the state of virtio-net early (before the
> > > > > > > > stop-and-copy phase), we can also start staging backend configurations,
> > > > > > > > which is the main contributor of downtime when migrating a vhost-vDPA
> > > > > > > > device.
> > > > > > > >
> > > > > > > > I apologize if this series gives the impression that we're migrating a lot
> > > > > > > > of data here. It's more along the lines of moving control-plane latency out
> > > > > > > > of the stop-and-copy phase.
> > > > > > >
> > > > > > > I see, thanks.
> > > > > > >
> > > > > > > Please add these into the cover letter of the next post. IMHO it's
> > > > > > > extremely important information to explain the real goal of this work. I
> > > > > > > bet it is not expected for most people when reading the current cover
> > > > > > > letter.
> > > > > > >
> > > > > > > Then it could have nothing to do with iterative phase, am I right?
> > > > > > >
> > > > > > > What are the data needed for the dest QEMU to start staging backend
> > > > > > > configurations to the HWs underneath? Does dest QEMU already have them in
> > > > > > > the cmdlines?
> > > > > > >
> > > > > > > Asking this because I want to know whether it can be done completely
> > > > > > > without src QEMU at all, e.g. when dest QEMU starts.
> > > > > > >
> > > > > > > If src QEMU's data is still needed, please also first consider providing
> > > > > > > such facility using an "early VMSD" if it is ever possible: feel free to
> > > > > > > refer to commit 3b95a71b22827d26178.
> > > > > > >
> > > > > >
> > > > > > While it works for this series, it does not allow to resend the state
> > > > > > when the src device changes. For example, if the number of virtqueues
> > > > > > is modified.
> > > > >
> > > > > Some explanation on "how sync number of vqueues helps downtime" would help.
> > > > > Not "it might preheat things", but exactly why, and how that differs when
> > > > > it's pure software, and when hardware will be involved.
> > > > >
> > > >
> > > > By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> > > > about ~200ms:
> > > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> > > >
> > > > Adding Dragos here in case he can provide more details. Maybe the
> > > > numbers have changed though.
> > > >
> > > > And I guess the difference with pure SW will always come down to PCI
> > > > communications, which assume it is slower than configuring the host SW
> > > > device in RAM or even CPU cache. But I admin that proper profiling is
> > > > needed before making those claims.
> > > >
> > > > Jonah, can you print the time it takes to configure the vDPA device
> > > > with traces vs the time it takes to enable the dataplane of the
> > > > device? So we can get an idea of how much time we save with this.
> > > >
> > >
> > > Let me know if this isn't what you're looking for.
> > >
> > > I'm assuming by "configuration time" you mean:
> > > - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> > > before we start enabling the vrings (e.g.
> > > VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> > >
> > > And by "time taken to enable the dataplane" I'm assuming you mean:
> > > - Time right before we start enabling the vrings (see above) to right
> > > after we enable the last vring (at the end of
> > > vhost_vdpa_net_cvq_load())
> > >
> > > Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> > >
> > > -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> > > queues=8,x-svq=on
> > >
> > > -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> > > romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> > > ctrl_vlan=off,vectors=18,host_mtu=9000,
> > > disable-legacy=on,disable-modern=off
> > >
> > > ---
> > >
> > > Configuration time: ~31s
> > > Dataplane enable time: ~0.14ms
> > >
> >
> > I was vague, but yes, that's representative enough! It would be more
> > accurate if the configuration time ends by the time QEMU enables the
> > first queue of the dataplane though.
> >
> > As Si-Wei mentions, is v->shared->listener_registered == true at the
> > beginning of vhost_vdpa_dev_start?
> >
>
> Ah, I also realized that Qemu I was using for measurements was using a
> version before the listener_registered member was introduced.
>
> I retested with the latest changes in Qemu and set x-svq=off, e.g.: guest
> specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3 times for
> measurements.
>
> v->shared->listener_registered == false at the beginning of
> vhost_vdpa_dev_start().
>
> ---
>
> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> right after Qemu enables the first VQ.
> - 26.947s, 26.606s, 27.326s
It's surprising to know it takes 20+ seconds for one device to load.
Sorry I'm not familiar with CVQ, please bare with me on my ignorance: how
much CVQ=on contributes to this? Is page pinning involved here? Is 128GB
using small pages only?
It looks to me there can still be many things that vDPA will face similar
challenges that VFIO already had. For example, there's current work
optimizing pinning for VFIO here:
https://lore.kernel.org/all/20250814064714.56485-1-lizhe.67@bytedance.com/
For the long term, not sure if (for either VFIO or vDPA, or similar devices
that needs guest pinning) it would make more sense to start using 1G huge
pages just for the sake of fast pinning.
PFNMAP in VFIO already works with 1G pfnmaps with commit eb996eec783c.
Logically if we could use 1G pages (e.g. on x86_64) for guest, then pinning
/ unpinning can also be easily batched, and DMA pinning should be much
faster. The same logic may also apply to vDPA if it works the similar way.
The work above was still generic, but I mentioned the idea of optimizing
for 1G huge pages here:
https://lore.kernel.org/all/aC3z_gUxJbY1_JP7@x1.local/#t
Above is just FYI.. definitely not an request to work on that. So if we
can better split the issue into smaller but multiple scope of works it
would be nicer. The "iterable migratable virtio-net" might just hide too
many things under the hood.
>
> Enable dataplane: Time from right after first VQ is enabled to right after
> the last VQ is enabled.
> - 0.081ms, 0.081ms, 0.079ms
>
The other thing that might worth mention.. from migration perspective, VFIO
used to introduce one feature called switchover-ack:
# @switchover-ack: If enabled, migration will not stop the source VM
# and complete the migration until an ACK is received from the
# destination that it's OK to do so. Exactly when this ACK is
# sent depends on the migrated devices that use this feature. For
# example, a device can use it to make sure some of its data is
# sent and loaded in the destination before doing switchover.
# This can reduce downtime if devices that support this capability
# are present. 'return-path' capability must be enabled to use
# it. (since 8.1)
If above 20+ seconds are not avoidable, not sure if virtio-net would like
to opt-in in this feature too, so that switchover won't happen too soon
during an pre-mature preheat, so that won't be accounted into downtime.
Again, just FYI. I'm not sure if it's applicable.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-18 14:46 ` Jonah Palmer
2025-08-18 16:21 ` Peter Xu
@ 2025-08-19 7:10 ` Eugenio Perez Martin
2025-08-19 15:10 ` Jonah Palmer
1 sibling, 1 reply; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-08-19 7:10 UTC (permalink / raw)
To: Jonah Palmer
Cc: Peter Xu, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >>
> >> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>
> >>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> >>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>
> >>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>>>> This effort was started to reduce the guest visible downtime by
> >>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> >>>>>>> vhost-vDPA.
> >>>>>>>
> >>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
> >>>>>>> migrate a lot of state but rather expensive backend control-plane latency
> >>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> >>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> >>>>>>> dominates its downtime.
> >>>>>>>
> >>>>>>> In other words, by migrating the state of virtio-net early (before the
> >>>>>>> stop-and-copy phase), we can also start staging backend configurations,
> >>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
> >>>>>>> device.
> >>>>>>>
> >>>>>>> I apologize if this series gives the impression that we're migrating a lot
> >>>>>>> of data here. It's more along the lines of moving control-plane latency out
> >>>>>>> of the stop-and-copy phase.
> >>>>>>
> >>>>>> I see, thanks.
> >>>>>>
> >>>>>> Please add these into the cover letter of the next post. IMHO it's
> >>>>>> extremely important information to explain the real goal of this work. I
> >>>>>> bet it is not expected for most people when reading the current cover
> >>>>>> letter.
> >>>>>>
> >>>>>> Then it could have nothing to do with iterative phase, am I right?
> >>>>>>
> >>>>>> What are the data needed for the dest QEMU to start staging backend
> >>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
> >>>>>> the cmdlines?
> >>>>>>
> >>>>>> Asking this because I want to know whether it can be done completely
> >>>>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>>>
> >>>>>> If src QEMU's data is still needed, please also first consider providing
> >>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
> >>>>>> refer to commit 3b95a71b22827d26178.
> >>>>>>
> >>>>>
> >>>>> While it works for this series, it does not allow to resend the state
> >>>>> when the src device changes. For example, if the number of virtqueues
> >>>>> is modified.
> >>>>
> >>>> Some explanation on "how sync number of vqueues helps downtime" would help.
> >>>> Not "it might preheat things", but exactly why, and how that differs when
> >>>> it's pure software, and when hardware will be involved.
> >>>>
> >>>
> >>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> >>> about ~200ms:
> >>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >>>
> >>> Adding Dragos here in case he can provide more details. Maybe the
> >>> numbers have changed though.
> >>>
> >>> And I guess the difference with pure SW will always come down to PCI
> >>> communications, which assume it is slower than configuring the host SW
> >>> device in RAM or even CPU cache. But I admin that proper profiling is
> >>> needed before making those claims.
> >>>
> >>> Jonah, can you print the time it takes to configure the vDPA device
> >>> with traces vs the time it takes to enable the dataplane of the
> >>> device? So we can get an idea of how much time we save with this.
> >>>
> >>
> >> Let me know if this isn't what you're looking for.
> >>
> >> I'm assuming by "configuration time" you mean:
> >> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> >> before we start enabling the vrings (e.g.
> >> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> >>
> >> And by "time taken to enable the dataplane" I'm assuming you mean:
> >> - Time right before we start enabling the vrings (see above) to right
> >> after we enable the last vring (at the end of
> >> vhost_vdpa_net_cvq_load())
> >>
> >> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> >>
> >> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> >> queues=8,x-svq=on
> >>
> >> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> >> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> >> ctrl_vlan=off,vectors=18,host_mtu=9000,
> >> disable-legacy=on,disable-modern=off
> >>
> >> ---
> >>
> >> Configuration time: ~31s
> >> Dataplane enable time: ~0.14ms
> >>
> >
> > I was vague, but yes, that's representative enough! It would be more
> > accurate if the configuration time ends by the time QEMU enables the
> > first queue of the dataplane though.
> >
> > As Si-Wei mentions, is v->shared->listener_registered == true at the
> > beginning of vhost_vdpa_dev_start?
> >
>
> Ah, I also realized that Qemu I was using for measurements was using a
> version before the listener_registered member was introduced.
>
> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
> times for measurements.
>
> v->shared->listener_registered == false at the beginning of
> vhost_vdpa_dev_start().
>
Let's move out the effect of the mem pinning from the downtime by
registering the listener before the migration. Can you check why is it
not registered at vhost_vdpa_set_owner?
> ---
>
> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> right after Qemu enables the first VQ.
> - 26.947s, 26.606s, 27.326s
>
> Enable dataplane: Time from right after first VQ is enabled to right
> after the last VQ is enabled.
> - 0.081ms, 0.081ms, 0.079ms
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-18 16:21 ` Peter Xu
@ 2025-08-19 7:20 ` Eugenio Perez Martin
0 siblings, 0 replies; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-08-19 7:20 UTC (permalink / raw)
To: Peter Xu
Cc: Jonah Palmer, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On Mon, Aug 18, 2025 at 6:21 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Aug 18, 2025 at 10:46:00AM -0400, Jonah Palmer wrote:
> >
> >
> > On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> > > On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > >
> > > >
> > > >
> > > > On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> > > > > On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> > > > > > > On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> > > > > > > > > This effort was started to reduce the guest visible downtime by
> > > > > > > > > virtio-net/vhost-net/vhost-vDPA during live migration, especially
> > > > > > > > > vhost-vDPA.
> > > > > > > > >
> > > > > > > > > The downtime contributed by vhost-vDPA, for example, is not from having to
> > > > > > > > > migrate a lot of state but rather expensive backend control-plane latency
> > > > > > > > > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> > > > > > > > > settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> > > > > > > > > dominates its downtime.
> > > > > > > > >
> > > > > > > > > In other words, by migrating the state of virtio-net early (before the
> > > > > > > > > stop-and-copy phase), we can also start staging backend configurations,
> > > > > > > > > which is the main contributor of downtime when migrating a vhost-vDPA
> > > > > > > > > device.
> > > > > > > > >
> > > > > > > > > I apologize if this series gives the impression that we're migrating a lot
> > > > > > > > > of data here. It's more along the lines of moving control-plane latency out
> > > > > > > > > of the stop-and-copy phase.
> > > > > > > >
> > > > > > > > I see, thanks.
> > > > > > > >
> > > > > > > > Please add these into the cover letter of the next post. IMHO it's
> > > > > > > > extremely important information to explain the real goal of this work. I
> > > > > > > > bet it is not expected for most people when reading the current cover
> > > > > > > > letter.
> > > > > > > >
> > > > > > > > Then it could have nothing to do with iterative phase, am I right?
> > > > > > > >
> > > > > > > > What are the data needed for the dest QEMU to start staging backend
> > > > > > > > configurations to the HWs underneath? Does dest QEMU already have them in
> > > > > > > > the cmdlines?
> > > > > > > >
> > > > > > > > Asking this because I want to know whether it can be done completely
> > > > > > > > without src QEMU at all, e.g. when dest QEMU starts.
> > > > > > > >
> > > > > > > > If src QEMU's data is still needed, please also first consider providing
> > > > > > > > such facility using an "early VMSD" if it is ever possible: feel free to
> > > > > > > > refer to commit 3b95a71b22827d26178.
> > > > > > > >
> > > > > > >
> > > > > > > While it works for this series, it does not allow to resend the state
> > > > > > > when the src device changes. For example, if the number of virtqueues
> > > > > > > is modified.
> > > > > >
> > > > > > Some explanation on "how sync number of vqueues helps downtime" would help.
> > > > > > Not "it might preheat things", but exactly why, and how that differs when
> > > > > > it's pure software, and when hardware will be involved.
> > > > > >
> > > > >
> > > > > By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> > > > > about ~200ms:
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> > > > >
> > > > > Adding Dragos here in case he can provide more details. Maybe the
> > > > > numbers have changed though.
> > > > >
> > > > > And I guess the difference with pure SW will always come down to PCI
> > > > > communications, which assume it is slower than configuring the host SW
> > > > > device in RAM or even CPU cache. But I admin that proper profiling is
> > > > > needed before making those claims.
> > > > >
> > > > > Jonah, can you print the time it takes to configure the vDPA device
> > > > > with traces vs the time it takes to enable the dataplane of the
> > > > > device? So we can get an idea of how much time we save with this.
> > > > >
> > > >
> > > > Let me know if this isn't what you're looking for.
> > > >
> > > > I'm assuming by "configuration time" you mean:
> > > > - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> > > > before we start enabling the vrings (e.g.
> > > > VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> > > >
> > > > And by "time taken to enable the dataplane" I'm assuming you mean:
> > > > - Time right before we start enabling the vrings (see above) to right
> > > > after we enable the last vring (at the end of
> > > > vhost_vdpa_net_cvq_load())
> > > >
> > > > Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> > > >
> > > > -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> > > > queues=8,x-svq=on
> > > >
> > > > -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> > > > romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> > > > ctrl_vlan=off,vectors=18,host_mtu=9000,
> > > > disable-legacy=on,disable-modern=off
> > > >
> > > > ---
> > > >
> > > > Configuration time: ~31s
> > > > Dataplane enable time: ~0.14ms
> > > >
> > >
> > > I was vague, but yes, that's representative enough! It would be more
> > > accurate if the configuration time ends by the time QEMU enables the
> > > first queue of the dataplane though.
> > >
> > > As Si-Wei mentions, is v->shared->listener_registered == true at the
> > > beginning of vhost_vdpa_dev_start?
> > >
> >
> > Ah, I also realized that Qemu I was using for measurements was using a
> > version before the listener_registered member was introduced.
> >
> > I retested with the latest changes in Qemu and set x-svq=off, e.g.: guest
> > specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3 times for
> > measurements.
> >
> > v->shared->listener_registered == false at the beginning of
> > vhost_vdpa_dev_start().
> >
> > ---
> >
> > Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> > right after Qemu enables the first VQ.
> > - 26.947s, 26.606s, 27.326s
>
> It's surprising to know it takes 20+ seconds for one device to load.
>
> Sorry I'm not familiar with CVQ, please bare with me on my ignorance: how
> much CVQ=on contributes to this? Is page pinning involved here? Is 128GB
> using small pages only?
>
CVQ=on is just enabled so we can enable multiqueue, as the HW device
configuration time seems ~linear with this.
> It looks to me there can still be many things that vDPA will face similar
> challenges that VFIO already had. For example, there's current work
> optimizing pinning for VFIO here:
>
> https://lore.kernel.org/all/20250814064714.56485-1-lizhe.67@bytedance.com/
>
> For the long term, not sure if (for either VFIO or vDPA, or similar devices
> that needs guest pinning) it would make more sense to start using 1G huge
> pages just for the sake of fast pinning.
>
> PFNMAP in VFIO already works with 1G pfnmaps with commit eb996eec783c.
> Logically if we could use 1G pages (e.g. on x86_64) for guest, then pinning
> / unpinning can also be easily batched, and DMA pinning should be much
> faster. The same logic may also apply to vDPA if it works the similar way.
>
> The work above was still generic, but I mentioned the idea of optimizing
> for 1G huge pages here:
>
> https://lore.kernel.org/all/aC3z_gUxJbY1_JP7@x1.local/#t
>
> Above is just FYI.. definitely not an request to work on that. So if we
> can better split the issue into smaller but multiple scope of works it
> would be nicer.
I agree. QEMU master is already able to do the memory pinning before
the downtime, so let's profile that way.
> The "iterable migratable virtio-net" might just hide too
> many things under the hood.
>
> >
> > Enable dataplane: Time from right after first VQ is enabled to right after
> > the last VQ is enabled.
> > - 0.081ms, 0.081ms, 0.079ms
> >
>
> The other thing that might worth mention.. from migration perspective, VFIO
> used to introduce one feature called switchover-ack:
>
> # @switchover-ack: If enabled, migration will not stop the source VM
> # and complete the migration until an ACK is received from the
> # destination that it's OK to do so. Exactly when this ACK is
> # sent depends on the migrated devices that use this feature. For
> # example, a device can use it to make sure some of its data is
> # sent and loaded in the destination before doing switchover.
> # This can reduce downtime if devices that support this capability
> # are present. 'return-path' capability must be enabled to use
> # it. (since 8.1)
>
> If above 20+ seconds are not avoidable, not sure if virtio-net would like
> to opt-in in this feature too, so that switchover won't happen too soon
> during an pre-mature preheat, so that won't be accounted into downtime.
>
> Again, just FYI. I'm not sure if it's applicable.
>
Yes it is, my first versions used it :). As you said, maybe we need to
use it here so it is worth it to not miss it!
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-19 7:10 ` Eugenio Perez Martin
@ 2025-08-19 15:10 ` Jonah Palmer
2025-08-20 7:59 ` Eugenio Perez Martin
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-19 15:10 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Peter Xu, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>
>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>>>>>>>>> This effort was started to reduce the guest visible downtime by
>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>>>>>>>>> vhost-vDPA.
>>>>>>>>>
>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
>>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
>>>>>>>>> dominates its downtime.
>>>>>>>>>
>>>>>>>>> In other words, by migrating the state of virtio-net early (before the
>>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
>>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
>>>>>>>>> device.
>>>>>>>>>
>>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
>>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
>>>>>>>>> of the stop-and-copy phase.
>>>>>>>>
>>>>>>>> I see, thanks.
>>>>>>>>
>>>>>>>> Please add these into the cover letter of the next post. IMHO it's
>>>>>>>> extremely important information to explain the real goal of this work. I
>>>>>>>> bet it is not expected for most people when reading the current cover
>>>>>>>> letter.
>>>>>>>>
>>>>>>>> Then it could have nothing to do with iterative phase, am I right?
>>>>>>>>
>>>>>>>> What are the data needed for the dest QEMU to start staging backend
>>>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
>>>>>>>> the cmdlines?
>>>>>>>>
>>>>>>>> Asking this because I want to know whether it can be done completely
>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
>>>>>>>>
>>>>>>>> If src QEMU's data is still needed, please also first consider providing
>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
>>>>>>>> refer to commit 3b95a71b22827d26178.
>>>>>>>>
>>>>>>>
>>>>>>> While it works for this series, it does not allow to resend the state
>>>>>>> when the src device changes. For example, if the number of virtqueues
>>>>>>> is modified.
>>>>>>
>>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
>>>>>> Not "it might preheat things", but exactly why, and how that differs when
>>>>>> it's pure software, and when hardware will be involved.
>>>>>>
>>>>>
>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
>>>>> about ~200ms:
>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
>>>>>
>>>>> Adding Dragos here in case he can provide more details. Maybe the
>>>>> numbers have changed though.
>>>>>
>>>>> And I guess the difference with pure SW will always come down to PCI
>>>>> communications, which assume it is slower than configuring the host SW
>>>>> device in RAM or even CPU cache. But I admin that proper profiling is
>>>>> needed before making those claims.
>>>>>
>>>>> Jonah, can you print the time it takes to configure the vDPA device
>>>>> with traces vs the time it takes to enable the dataplane of the
>>>>> device? So we can get an idea of how much time we save with this.
>>>>>
>>>>
>>>> Let me know if this isn't what you're looking for.
>>>>
>>>> I'm assuming by "configuration time" you mean:
>>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
>>>> before we start enabling the vrings (e.g.
>>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
>>>>
>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
>>>> - Time right before we start enabling the vrings (see above) to right
>>>> after we enable the last vring (at the end of
>>>> vhost_vdpa_net_cvq_load())
>>>>
>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
>>>>
>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
>>>> queues=8,x-svq=on
>>>>
>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
>>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
>>>> ctrl_vlan=off,vectors=18,host_mtu=9000,
>>>> disable-legacy=on,disable-modern=off
>>>>
>>>> ---
>>>>
>>>> Configuration time: ~31s
>>>> Dataplane enable time: ~0.14ms
>>>>
>>>
>>> I was vague, but yes, that's representative enough! It would be more
>>> accurate if the configuration time ends by the time QEMU enables the
>>> first queue of the dataplane though.
>>>
>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
>>> beginning of vhost_vdpa_dev_start?
>>>
>>
>> Ah, I also realized that Qemu I was using for measurements was using a
>> version before the listener_registered member was introduced.
>>
>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
>> times for measurements.
>>
>> v->shared->listener_registered == false at the beginning of
>> vhost_vdpa_dev_start().
>>
>
> Let's move out the effect of the mem pinning from the downtime by
> registering the listener before the migration. Can you check why is it
> not registered at vhost_vdpa_set_owner?
>
Sorry I was profiling improperly. The listener is registered at
vhost_vdpa_set_owner initially and v->shared->listener_registered is set
to true, but once we reach the first vhost_vdpa_dev_start call, it shows
as false and is re-registered later in the function.
Should we always expect listener_registered == true at every
vhost_vdpa_dev_start call during startup? This is what I traced during
startup of a single guest (no migration). Tracepoint is right at the
start of the vhost_vdpa_dev_start function:
vhost_vdpa_set_owner() - register memory listener
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
...
* VQs are now being enabled *
I'm also seeing that when the guest is being shutdown,
dev->vhost_ops->vhost_get_vring_base() is failing in
do_vhost_virtqueue_stop():
...
[ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 114.719255] systemd-shutdown[1]: Powering off.
[ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 114.724826] ACPI: PM: Preparing to enter system sleep state S5
[ 114.725593] reboot: Power down
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
permitted (1)
qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
permitted (1)
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
permitted (1)
qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
permitted (1)
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
permitted (1)
qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
permitted (1)
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
permitted (1)
qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
permitted (1)
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
permitted (1)
qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
permitted (1)
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
permitted (1)
qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
permitted (1)
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
permitted (1)
qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
permitted (1)
vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
However when x-svq=on, I don't see these errors on shutdown.
>> ---
>>
>> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
>> right after Qemu enables the first VQ.
>> - 26.947s, 26.606s, 27.326s
>>
>> Enable dataplane: Time from right after first VQ is enabled to right
>> after the last VQ is enabled.
>> - 0.081ms, 0.081ms, 0.079ms
>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-19 15:10 ` Jonah Palmer
@ 2025-08-20 7:59 ` Eugenio Perez Martin
2025-08-25 12:16 ` Jonah Palmer
2025-08-27 16:55 ` Jonah Palmer
0 siblings, 2 replies; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-08-20 7:59 UTC (permalink / raw)
To: Jonah Palmer
Cc: Peter Xu, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
> > On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >>
> >> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>
> >>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> >>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>>>>>> This effort was started to reduce the guest visible downtime by
> >>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> >>>>>>>>> vhost-vDPA.
> >>>>>>>>>
> >>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
> >>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
> >>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> >>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> >>>>>>>>> dominates its downtime.
> >>>>>>>>>
> >>>>>>>>> In other words, by migrating the state of virtio-net early (before the
> >>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
> >>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
> >>>>>>>>> device.
> >>>>>>>>>
> >>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
> >>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
> >>>>>>>>> of the stop-and-copy phase.
> >>>>>>>>
> >>>>>>>> I see, thanks.
> >>>>>>>>
> >>>>>>>> Please add these into the cover letter of the next post. IMHO it's
> >>>>>>>> extremely important information to explain the real goal of this work. I
> >>>>>>>> bet it is not expected for most people when reading the current cover
> >>>>>>>> letter.
> >>>>>>>>
> >>>>>>>> Then it could have nothing to do with iterative phase, am I right?
> >>>>>>>>
> >>>>>>>> What are the data needed for the dest QEMU to start staging backend
> >>>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
> >>>>>>>> the cmdlines?
> >>>>>>>>
> >>>>>>>> Asking this because I want to know whether it can be done completely
> >>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>>>>>
> >>>>>>>> If src QEMU's data is still needed, please also first consider providing
> >>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
> >>>>>>>> refer to commit 3b95a71b22827d26178.
> >>>>>>>>
> >>>>>>>
> >>>>>>> While it works for this series, it does not allow to resend the state
> >>>>>>> when the src device changes. For example, if the number of virtqueues
> >>>>>>> is modified.
> >>>>>>
> >>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
> >>>>>> Not "it might preheat things", but exactly why, and how that differs when
> >>>>>> it's pure software, and when hardware will be involved.
> >>>>>>
> >>>>>
> >>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> >>>>> about ~200ms:
> >>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >>>>>
> >>>>> Adding Dragos here in case he can provide more details. Maybe the
> >>>>> numbers have changed though.
> >>>>>
> >>>>> And I guess the difference with pure SW will always come down to PCI
> >>>>> communications, which assume it is slower than configuring the host SW
> >>>>> device in RAM or even CPU cache. But I admin that proper profiling is
> >>>>> needed before making those claims.
> >>>>>
> >>>>> Jonah, can you print the time it takes to configure the vDPA device
> >>>>> with traces vs the time it takes to enable the dataplane of the
> >>>>> device? So we can get an idea of how much time we save with this.
> >>>>>
> >>>>
> >>>> Let me know if this isn't what you're looking for.
> >>>>
> >>>> I'm assuming by "configuration time" you mean:
> >>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> >>>> before we start enabling the vrings (e.g.
> >>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> >>>>
> >>>> And by "time taken to enable the dataplane" I'm assuming you mean:
> >>>> - Time right before we start enabling the vrings (see above) to right
> >>>> after we enable the last vring (at the end of
> >>>> vhost_vdpa_net_cvq_load())
> >>>>
> >>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> >>>>
> >>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> >>>> queues=8,x-svq=on
> >>>>
> >>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> >>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> >>>> ctrl_vlan=off,vectors=18,host_mtu=9000,
> >>>> disable-legacy=on,disable-modern=off
> >>>>
> >>>> ---
> >>>>
> >>>> Configuration time: ~31s
> >>>> Dataplane enable time: ~0.14ms
> >>>>
> >>>
> >>> I was vague, but yes, that's representative enough! It would be more
> >>> accurate if the configuration time ends by the time QEMU enables the
> >>> first queue of the dataplane though.
> >>>
> >>> As Si-Wei mentions, is v->shared->listener_registered == true at the
> >>> beginning of vhost_vdpa_dev_start?
> >>>
> >>
> >> Ah, I also realized that Qemu I was using for measurements was using a
> >> version before the listener_registered member was introduced.
> >>
> >> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
> >> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
> >> times for measurements.
> >>
> >> v->shared->listener_registered == false at the beginning of
> >> vhost_vdpa_dev_start().
> >>
> >
> > Let's move out the effect of the mem pinning from the downtime by
> > registering the listener before the migration. Can you check why is it
> > not registered at vhost_vdpa_set_owner?
> >
>
> Sorry I was profiling improperly. The listener is registered at
> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
> as false and is re-registered later in the function.
>
> Should we always expect listener_registered == true at every
> vhost_vdpa_dev_start call during startup?
Yes, that leaves all the memory pinning time out of the downtime.
> This is what I traced during
> startup of a single guest (no migration).
We can trace the destination's QEMU to be more accurate, but probably
it makes no difference.
> Tracepoint is right at the
> start of the vhost_vdpa_dev_start function:
>
> vhost_vdpa_set_owner() - register memory listener
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
This is surprising. Can you trace how listener_registered goes to 0 again?
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> ...
> * VQs are now being enabled *
>
> I'm also seeing that when the guest is being shutdown,
> dev->vhost_ops->vhost_get_vring_base() is failing in
> do_vhost_virtqueue_stop():
>
> ...
> [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
> [ 114.719255] systemd-shutdown[1]: Powering off.
> [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5
> [ 114.725593] reboot: Power down
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>
> However when x-svq=on, I don't see these errors on shutdown.
>
SVQ can mask this error as it does not need to forward the ring
restore message to the device. It can just start with 0 and convert
indexes.
Let's focus on listened_registered first :).
> >> ---
> >>
> >> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> >> right after Qemu enables the first VQ.
> >> - 26.947s, 26.606s, 27.326s
> >>
> >> Enable dataplane: Time from right after first VQ is enabled to right
> >> after the last VQ is enabled.
> >> - 0.081ms, 0.081ms, 0.079ms
> >>
> >
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-20 7:59 ` Eugenio Perez Martin
@ 2025-08-25 12:16 ` Jonah Palmer
2025-08-27 16:55 ` Jonah Palmer
1 sibling, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-08-25 12:16 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Peter Xu, qemu-devel, farosas, eblake, armbru, jasowang, mst,
si-wei.liu, boris.ostrovsky, Dragos Tatulea DE
On 8/20/25 3:59 AM, Eugenio Perez Martin wrote:
> On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
>>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
>>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>>>>>>>>>>> This effort was started to reduce the guest visible downtime by
>>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>>>>>>>>>>> vhost-vDPA.
>>>>>>>>>>>
>>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
>>>>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
>>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
>>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
>>>>>>>>>>> dominates its downtime.
>>>>>>>>>>>
>>>>>>>>>>> In other words, by migrating the state of virtio-net early (before the
>>>>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
>>>>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
>>>>>>>>>>> device.
>>>>>>>>>>>
>>>>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
>>>>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
>>>>>>>>>>> of the stop-and-copy phase.
>>>>>>>>>>
>>>>>>>>>> I see, thanks.
>>>>>>>>>>
>>>>>>>>>> Please add these into the cover letter of the next post. IMHO it's
>>>>>>>>>> extremely important information to explain the real goal of this work. I
>>>>>>>>>> bet it is not expected for most people when reading the current cover
>>>>>>>>>> letter.
>>>>>>>>>>
>>>>>>>>>> Then it could have nothing to do with iterative phase, am I right?
>>>>>>>>>>
>>>>>>>>>> What are the data needed for the dest QEMU to start staging backend
>>>>>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
>>>>>>>>>> the cmdlines?
>>>>>>>>>>
>>>>>>>>>> Asking this because I want to know whether it can be done completely
>>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
>>>>>>>>>>
>>>>>>>>>> If src QEMU's data is still needed, please also first consider providing
>>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
>>>>>>>>>> refer to commit 3b95a71b22827d26178.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While it works for this series, it does not allow to resend the state
>>>>>>>>> when the src device changes. For example, if the number of virtqueues
>>>>>>>>> is modified.
>>>>>>>>
>>>>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
>>>>>>>> Not "it might preheat things", but exactly why, and how that differs when
>>>>>>>> it's pure software, and when hardware will be involved.
>>>>>>>>
>>>>>>>
>>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
>>>>>>> about ~200ms:
>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
>>>>>>>
>>>>>>> Adding Dragos here in case he can provide more details. Maybe the
>>>>>>> numbers have changed though.
>>>>>>>
>>>>>>> And I guess the difference with pure SW will always come down to PCI
>>>>>>> communications, which assume it is slower than configuring the host SW
>>>>>>> device in RAM or even CPU cache. But I admin that proper profiling is
>>>>>>> needed before making those claims.
>>>>>>>
>>>>>>> Jonah, can you print the time it takes to configure the vDPA device
>>>>>>> with traces vs the time it takes to enable the dataplane of the
>>>>>>> device? So we can get an idea of how much time we save with this.
>>>>>>>
>>>>>>
>>>>>> Let me know if this isn't what you're looking for.
>>>>>>
>>>>>> I'm assuming by "configuration time" you mean:
>>>>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
>>>>>> before we start enabling the vrings (e.g.
>>>>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
>>>>>>
>>>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
>>>>>> - Time right before we start enabling the vrings (see above) to right
>>>>>> after we enable the last vring (at the end of
>>>>>> vhost_vdpa_net_cvq_load())
>>>>>>
>>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
>>>>>>
>>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
>>>>>> queues=8,x-svq=on
>>>>>>
>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
>>>>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
>>>>>> ctrl_vlan=off,vectors=18,host_mtu=9000,
>>>>>> disable-legacy=on,disable-modern=off
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Configuration time: ~31s
>>>>>> Dataplane enable time: ~0.14ms
>>>>>>
>>>>>
>>>>> I was vague, but yes, that's representative enough! It would be more
>>>>> accurate if the configuration time ends by the time QEMU enables the
>>>>> first queue of the dataplane though.
>>>>>
>>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
>>>>> beginning of vhost_vdpa_dev_start?
>>>>>
>>>>
>>>> Ah, I also realized that Qemu I was using for measurements was using a
>>>> version before the listener_registered member was introduced.
>>>>
>>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
>>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
>>>> times for measurements.
>>>>
>>>> v->shared->listener_registered == false at the beginning of
>>>> vhost_vdpa_dev_start().
>>>>
>>>
>>> Let's move out the effect of the mem pinning from the downtime by
>>> registering the listener before the migration. Can you check why is it
>>> not registered at vhost_vdpa_set_owner?
>>>
>>
>> Sorry I was profiling improperly. The listener is registered at
>> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
>> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
>> as false and is re-registered later in the function.
>>
>> Should we always expect listener_registered == true at every
>> vhost_vdpa_dev_start call during startup?
>
> Yes, that leaves all the memory pinning time out of the downtime.
>
>> This is what I traced during
>> startup of a single guest (no migration).
>
> We can trace the destination's QEMU to be more accurate, but probably
> it makes no difference.
>
>> Tracepoint is right at the
>> start of the vhost_vdpa_dev_start function:
>>
>> vhost_vdpa_set_owner() - register memory listener
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>
> This is surprising. Can you trace how listener_registered goes to 0 again?
>
When vhost_vdpa_dev_start gets called with started == false,
vhost_vdpa_suspend is called, which calls vhost_vdpa_reset_device. In
there is when v->shared->listener_registered = false.
And even by the first vhost_vdpa_dev_start there was another device
reset after registering the memory listener in vhost_vdpa_set_owner.
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> ...
>> * VQs are now being enabled *
>>
>> I'm also seeing that when the guest is being shutdown,
>> dev->vhost_ops->vhost_get_vring_base() is failing in
>> do_vhost_virtqueue_stop():
>>
>> ...
>> [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
>> [ 114.719255] systemd-shutdown[1]: Powering off.
>> [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5
>> [ 114.725593] reboot: Power down
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>
>> However when x-svq=on, I don't see these errors on shutdown.
>>
>
> SVQ can mask this error as it does not need to forward the ring
> restore message to the device. It can just start with 0 and convert
> indexes.
>
> Let's focus on listened_registered first :).
>
>>>> ---
>>>>
>>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
>>>> right after Qemu enables the first VQ.
>>>> - 26.947s, 26.606s, 27.326s
>>>>
>>>> Enable dataplane: Time from right after first VQ is enabled to right
>>>> after the last VQ is enabled.
>>>> - 0.081ms, 0.081ms, 0.079ms
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-11 12:18 ` Jonah Palmer
@ 2025-08-25 12:44 ` Markus Armbruster
2025-08-25 14:57 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2025-08-25 12:44 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
Please excuse the delay, I was on vacation.
Jonah Palmer <jonah.palmer@oracle.com> writes:
> On 8/8/25 6:48 AM, Markus Armbruster wrote:
>> I apologize for the lateness of my review.
Late again: I was on vacation.
>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>
>>> Adds a new migration capability 'virtio-iterative' that will allow
>>> virtio devices, where supported, to iteratively migrate configuration
>>> changes that occur during the migration process.
>>
>> Why is that desirable?
>
> To be frank, I wasn't sure if having a migration capability, or even
> have it toggleable at all, would be desirable or not. It appears though
> that this might be better off as a per-device feature set via
> --device virtio-net-pci,iterative-mig=on,..., for example.
See below.
> And by "iteratively migrate configuration changes" I meant more along
> the lines of the device's state as it continues running on the source.
Isn't that what migration does always?
> But perhaps actual configuration changes (e.g. changing the number of
> queue pairs) could also be supported mid-migration like this?
I don't know.
>>> This capability is added to the validated capabilities list to ensure
>>> both the source and destination support it before enabling.
>>
>> What happens when only one side enables it?
>
> The migration stream breaks if only one side enables it.
How does it break? Error message pointing out the misconfiguration?
> This is poor wording on my part, my apologies. I don't think it's even
> possible to know the capabilities between the source & destination.
>
>>> The capability defaults to off to maintain backward compatibility.
>>>
>>> To enable the capability via HMP:
>>> (qemu) migrate_set_capability virtio-iterative on
>>>
>>> To enable the capability via QMP:
>>> {"execute": "migrate-set-capabilities", "arguments": {
>>> "capabilities": [
>>> { "capability": "virtio-iterative", "state": true }
>>> ]
>>> }
>>> }
>>>
>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>> ---
>>> migration/savevm.c | 1 +
>>> qapi/migration.json | 7 ++++++-
>>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>> index bb04a4520d..40a2189866 100644
>>> --- a/migration/savevm.c
>>> +++ b/migration/savevm.c
>>> @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
>>> switch (capability) {
>>> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
>>> case MIGRATION_CAPABILITY_MAPPED_RAM:
>>> + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
>>> return true;
>>> default:
>>> return false;
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 4963f6ca12..8f042c3ba5 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -479,6 +479,11 @@
>>> # each RAM page. Requires a migration URI that supports seeking,
>>> # such as a file. (since 9.0)
>>> #
>>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>>> +# the device supports it. When enabled, and where supported, virtio
>>> +# devices will track and migrate configuration changes that may
>>> +# occur during the migration process. (Since 10.1)
>>
>> When and why should the user enable this?
>
> Well if all goes according to plan, always (at least for virtio-net).
> This should improve the overall speed of live migration for a virtio-net
> device (and vhost-net/vhost-vdpa).
So the only use for "disabled" would be when migrating to or from an
older version of QEMU that doesn't support this. Fair?
What's the default?
>> What exactly do you mean by "where supported"?
>
> I meant if both source's Qemu and destination's Qemu support it, as well
> as for other virtio devices in the future if they decide to implement
> iterative migration (e.g. a more general "enable iterative migration for
> virtio devices").
>
> But I think for now this is better left as a virtio-net configuration
> rather than as a migration capability (e.g. --device
> virtio-net-pci,iterative-mig=on/off,...)
Makes sense to me (but I'm not a migration expert).
[...]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-25 12:44 ` Markus Armbruster
@ 2025-08-25 14:57 ` Jonah Palmer
2025-08-26 6:11 ` Markus Armbruster
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-25 14:57 UTC (permalink / raw)
To: Markus Armbruster
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/25/25 8:44 AM, Markus Armbruster wrote:
> Please excuse the delay, I was on vacation.
>
> Jonah Palmer <jonah.palmer@oracle.com> writes:
>
>> On 8/8/25 6:48 AM, Markus Armbruster wrote:
>>> I apologize for the lateness of my review.
>
> Late again: I was on vacation.
>
>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>
>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>> virtio devices, where supported, to iteratively migrate configuration
>>>> changes that occur during the migration process.
>>>
>>> Why is that desirable?
>>
>> To be frank, I wasn't sure if having a migration capability, or even
>> have it toggleable at all, would be desirable or not. It appears though
>> that this might be better off as a per-device feature set via
>> --device virtio-net-pci,iterative-mig=on,..., for example.
>
> See below.
>
>> And by "iteratively migrate configuration changes" I meant more along
>> the lines of the device's state as it continues running on the source.
>
> Isn't that what migration does always?
>
Essentially yes, but today all of the state is only migrated at the end,
once the source has been paused. So the final correct state is always
sent to the destination.
If we're no longer waiting until the source has been paused and the
initial state is sent early, then we need to make sure that any changes
that happen is still communicated to the destination.
This RFC handles this by just re-sending the entire state again once the
source has been paused. But of course this isn't optimal and I'm looking
into how to better optimize this part.
>> But perhaps actual configuration changes (e.g. changing the number of
>> queue pairs) could also be supported mid-migration like this?
>
> I don't know.
>
>>>> This capability is added to the validated capabilities list to ensure
>>>> both the source and destination support it before enabling.
>>>
>>> What happens when only one side enables it?
>>
>> The migration stream breaks if only one side enables it.
>
> How does it break? Error message pointing out the misconfiguration?
>
The destination VM is torn down and the source just reports that
migration failed.
I don't believe the source/destination could be aware of the
misconfiguration. IIUC the destination reads the migration stream and
expects certain pieces of data in a certain order. If new data is added
to the migration stream or the order has changed and the destination
isn't expecting it, then the migration fails. It doesn't know exactly
why, just that it read-in data that it wasn't expecting.
>> This is poor wording on my part, my apologies. I don't think it's even
>> possible to know the capabilities between the source & destination.
>>
>>>> The capability defaults to off to maintain backward compatibility.
>>>>
>>>> To enable the capability via HMP:
>>>> (qemu) migrate_set_capability virtio-iterative on
>>>>
>>>> To enable the capability via QMP:
>>>> {"execute": "migrate-set-capabilities", "arguments": {
>>>> "capabilities": [
>>>> { "capability": "virtio-iterative", "state": true }
>>>> ]
>>>> }
>>>> }
>>>>
>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>> ---
>>>> migration/savevm.c | 1 +
>>>> qapi/migration.json | 7 ++++++-
>>>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index bb04a4520d..40a2189866 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
>>>> switch (capability) {
>>>> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
>>>> case MIGRATION_CAPABILITY_MAPPED_RAM:
>>>> + case MIGRATION_CAPABILITY_VIRTIO_ITERATIVE:
>>>> return true;
>>>> default:
>>>> return false;
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 4963f6ca12..8f042c3ba5 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -479,6 +479,11 @@
>>>> # each RAM page. Requires a migration URI that supports seeking,
>>>> # such as a file. (since 9.0)
>>>> #
>>>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>>>> +# the device supports it. When enabled, and where supported, virtio
>>>> +# devices will track and migrate configuration changes that may
>>>> +# occur during the migration process. (Since 10.1)
>>>
>>> When and why should the user enable this?
>>
>> Well if all goes according to plan, always (at least for virtio-net).
>> This should improve the overall speed of live migration for a virtio-net
>> device (and vhost-net/vhost-vdpa).
>
> So the only use for "disabled" would be when migrating to or from an
> older version of QEMU that doesn't support this. Fair?
>
Correct.
> What's the default?
>
Disabled.
>>> What exactly do you mean by "where supported"?
>>
>> I meant if both source's Qemu and destination's Qemu support it, as well
>> as for other virtio devices in the future if they decide to implement
>> iterative migration (e.g. a more general "enable iterative migration for
>> virtio devices").
>>
>> But I think for now this is better left as a virtio-net configuration
>> rather than as a migration capability (e.g. --device
>> virtio-net-pci,iterative-mig=on/off,...)
>
> Makes sense to me (but I'm not a migration expert).
>
> [...]
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-25 14:57 ` Jonah Palmer
@ 2025-08-26 6:11 ` Markus Armbruster
2025-08-26 18:08 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2025-08-26 6:11 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
Jonah Palmer <jonah.palmer@oracle.com> writes:
> On 8/25/25 8:44 AM, Markus Armbruster wrote:
[...]
>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>
>>> On 8/8/25 6:48 AM, Markus Armbruster wrote:
[...]
>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>>> virtio devices, where supported, to iteratively migrate configuration
>>>>> changes that occur during the migration process.
>>>>
>>>> Why is that desirable?
>>>
>>> To be frank, I wasn't sure if having a migration capability, or even
>>> have it toggleable at all, would be desirable or not. It appears though
>>> that this might be better off as a per-device feature set via
>>> --device virtio-net-pci,iterative-mig=on,..., for example.
>>
>> See below.
>>
>>> And by "iteratively migrate configuration changes" I meant more along
>>> the lines of the device's state as it continues running on the source.
>>
>> Isn't that what migration does always?
>
> Essentially yes, but today all of the state is only migrated at the end, once the source has been paused. So the final correct state is always sent to the destination.
As far as I understand (and ignoring lots of detail, including post
copy), we have three stages:
1. Source runs, migrate memory pages. Pages that get dirtied after they
are migrated need to be migrated again.
2. Neither source or destination runs, migrate remaining memory pages
and device state.
3. Destination starts to run.
If the duration of stage 2 (downtime) was of no concern, we'd switch to
it immediately, i.e. without migrating anything in stage 1. This would
minimize I/O.
Of course, we actually care for limiting downtime. We switch to stage 2
when "little enough" is left for stage two to migrate.
> If we're no longer waiting until the source has been paused and the initial state is sent early, then we need to make sure that any changes that happen is still communicated to the destination.
So you're proposing to treat suitable parts of the device state more
like memory pages. Correct?
Cover letter and commit message of PATCH 4 provide the motivation: you
observe a shorter downtime. You speculate this is due to moving "heavy
allocations and page-fault latencies" from stage 2 to stage 1. Correct?
Is there anything that makes virtio-net particularly suitable?
I think this patch's commit message should at least hint at the
motivation at a high level. Details like measurements are best left to
PATCH 4.
> This RFC handles this by just re-sending the entire state again once the source has been paused. But of course this isn't optimal and I'm looking into how to better optimize this part.
How much is the entire state?
>>> But perhaps actual configuration changes (e.g. changing the number of
>>> queue pairs) could also be supported mid-migration like this?
>>
>> I don't know.
>>
>>>>> This capability is added to the validated capabilities list to ensure
>>>>> both the source and destination support it before enabling.
>>>>
>>>> What happens when only one side enables it?
>>>
>>> The migration stream breaks if only one side enables it.
>>
>> How does it break? Error message pointing out the misconfiguration?
>>
>
> The destination VM is torn down and the source just reports that migration failed.
Exact same failure as for other misconfigurations, like missing a device
on the destination?
> I don't believe the source/destination could be aware of the misconfiguration. IIUC the destination reads the migration stream and expects certain pieces of data in a certain order. If new data is added to the migration stream or the order has changed and the destination isn't expecting it, then the migration fails. It doesn't know exactly why, just that it read-in data that it wasn't expecting.
>
>>> This is poor wording on my part, my apologies. I don't think it's even
>>> possible to know the capabilities between the source & destination.
>>>
>>>>> The capability defaults to off to maintain backward compatibility.
>>>>>
>>>>> To enable the capability via HMP:
>>>>> (qemu) migrate_set_capability virtio-iterative on
>>>>>
>>>>> To enable the capability via QMP:
>>>>> {"execute": "migrate-set-capabilities", "arguments": {
>>>>> "capabilities": [
>>>>> { "capability": "virtio-iterative", "state": true }
>>>>> ]
>>>>> }
>>>>> }
>>>>>
>>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
[...]
>>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>>> index 4963f6ca12..8f042c3ba5 100644
>>>>> --- a/qapi/migration.json
>>>>> +++ b/qapi/migration.json
>>>>> @@ -479,6 +479,11 @@
>>>>> # each RAM page. Requires a migration URI that supports seeking,
>>>>> # such as a file. (since 9.0)
>>>>> #
>>>>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>>>>> +# the device supports it. When enabled, and where supported, virtio
>>>>> +# devices will track and migrate configuration changes that may
>>>>> +# occur during the migration process. (Since 10.1)
>>>>
>>>> When and why should the user enable this?
>>>
>>> Well if all goes according to plan, always (at least for virtio-net).
>>> This should improve the overall speed of live migration for a virtio-net
>>> device (and vhost-net/vhost-vdpa).
>>
>> So the only use for "disabled" would be when migrating to or from an
>> older version of QEMU that doesn't support this. Fair?
>
> Correct.
>
>> What's the default?
>
> Disabled.
Awkward for something that should always be enabled. But see below.
Please document defaults in the doc comment.
>>>> What exactly do you mean by "where supported"?
>>>
>>> I meant if both source's Qemu and destination's Qemu support it, as well
>>> as for other virtio devices in the future if they decide to implement
>>> iterative migration (e.g. a more general "enable iterative migration for
>>> virtio devices").
>>>
>>> But I think for now this is better left as a virtio-net configuration
>>> rather than as a migration capability (e.g. --device
>>> virtio-net-pci,iterative-mig=on/off,...)
>>
>> Makes sense to me (but I'm not a migration expert).
A device property's default can depend on the machine type via compat
properties. This is normally used to restrict a guest-visible change to
newer machine types. Here, it's not guest-visible. But it can get you
this:
* Migrate new machine type from new QEMU to new QEMU (old QEMU doesn't
have the machine type): iterative is enabled by default. Good. User
can disable it on both ends to not get the improvement. Enabling it
on just one breaks migration.
All other cases go away with time.
* Migrate old machine type from new QEMU to new QEMU: iterative is
disabled by default, which is sad, but no worse than before. User can
enable it on both ends to get the improvement. Enabling it on just
one breaks migration.
* Migrate old machine type from new QEMU to old QEMU or vice versa:
iterative is off by default. Good. Enabling it on the new one breaks
migration.
* Migrate old machine type from old QEMU to old QEMU: iterative is off
I figure almost all users could simply ignore this configuration knob
then.
>> [...]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-26 6:11 ` Markus Armbruster
@ 2025-08-26 18:08 ` Jonah Palmer
2025-08-27 6:37 ` Markus Armbruster
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-26 18:08 UTC (permalink / raw)
To: Markus Armbruster
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/26/25 2:11 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
>
>> On 8/25/25 8:44 AM, Markus Armbruster wrote:
>
> [...]
>
>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>
>>>> On 8/8/25 6:48 AM, Markus Armbruster wrote:
>
> [...]
>
>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>>>> virtio devices, where supported, to iteratively migrate configuration
>>>>>> changes that occur during the migration process.
>>>>>
>>>>> Why is that desirable?
>>>>
>>>> To be frank, I wasn't sure if having a migration capability, or even
>>>> have it toggleable at all, would be desirable or not. It appears though
>>>> that this might be better off as a per-device feature set via
>>>> --device virtio-net-pci,iterative-mig=on,..., for example.
>>>
>>> See below.
>>>
>>>> And by "iteratively migrate configuration changes" I meant more along
>>>> the lines of the device's state as it continues running on the source.
>>>
>>> Isn't that what migration does always?
>>
>> Essentially yes, but today all of the state is only migrated at the end, once the source has been paused. So the final correct state is always sent to the destination.
>
> As far as I understand (and ignoring lots of detail, including post
> copy), we have three stages:
>
> 1. Source runs, migrate memory pages. Pages that get dirtied after they
> are migrated need to be migrated again.
>
> 2. Neither source or destination runs, migrate remaining memory pages
> and device state.
>
> 3. Destination starts to run.
>
> If the duration of stage 2 (downtime) was of no concern, we'd switch to
> it immediately, i.e. without migrating anything in stage 1. This would
> minimize I/O.
>
> Of course, we actually care for limiting downtime. We switch to stage 2
> when "little enough" is left for stage two to migrate.
>
>> If we're no longer waiting until the source has been paused and the initial state is sent early, then we need to make sure that any changes that happen is still communicated to the destination.
>
> So you're proposing to treat suitable parts of the device state more
> like memory pages. Correct?
>
Not in the sense of "something got dirtied so let's immediately re-send
that" like we would with RAM. It's more along the lines of "something
got dirtied so let's make sure that gets re-sent at the start of stage 2".
The entire state of a virtio-net device (even with vhost-net /
vhost-vDPA) is <10KB I believe. I don't believe there's much to gain by
"iteratively" re-sending changes for virtio-net. It should be suitable
enough to just re-send whatever changed during stage 1 (after the
initial state was sent) at the start of stage 2.
This is why I'm currently looking into a solution that uses VMSD's
.early_setup flag (that Peter recommended) rather than implementing a
suite of SaveVMHandlers hooks (like this RFC does). We don't need this
iterative capability as much as we need to start migrating the state
earlier (and doing corresponding config/prep work) during stage 1.
> Cover letter and commit message of PATCH 4 provide the motivation: you
> observe a shorter downtime. You speculate this is due to moving "heavy
> allocations and page-fault latencies" from stage 2 to stage 1. Correct?
>
Correct. But again I'd like to stress that this is just one part in
reducing downtime during stage 2. The biggest reductions will come from
the config/prep work that we're trying to move from stage 2 to stage 1,
especially when vhost-vDPA is involved. And we can only do this early
work once we have the state, hence why we're sending it earlier.
> Is there anything that makes virtio-net particularly suitable?
>
Yes, especially with vhost-vDPA and configuring VQs. See Eugenio's
comment here
https://lore.kernel.org/qemu-devel/CAJaqyWdUutZrAWKy9d=ip+h+y3BnptUrcL8Xj06XfizNxPtfpw@mail.gmail.com/.
> I think this patch's commit message should at least hint at the
> motivation at a high level. Details like measurements are best left to
> PATCH 4.
>
You're right, this was my bad for not framing this RFC more clearly and
the true motivations behind it. I will certainly be more direct and
descriptive in the next RFC for this effort.
>> This RFC handles this by just re-sending the entire state again once the source has been paused. But of course this isn't optimal and I'm looking into how to better optimize this part.
>
> How much is the entire state?
>
I'm not exactly sure how large it is but it should be <10KB even with
vhost-vDPA. It could be slightly larger if we really up the number of
queue pairs and/or have huge MAC/multicast lists.
>>>> But perhaps actual configuration changes (e.g. changing the number of
>>>> queue pairs) could also be supported mid-migration like this?
>>>
>>> I don't know.
>>>
>>>>>> This capability is added to the validated capabilities list to ensure
>>>>>> both the source and destination support it before enabling.
>>>>>
>>>>> What happens when only one side enables it?
>>>>
>>>> The migration stream breaks if only one side enables it.
>>>
>>> How does it break? Error message pointing out the misconfiguration?
>>>
>>
>> The destination VM is torn down and the source just reports that migration failed.
>
> Exact same failure as for other misconfigurations, like missing a device
> on the destination?
>
I hesitate to say "exact" but for example, when missing a device on one
side you might see something like below (I removed a serial device):
qemu-system-x86_64: Unknown ramblock "0000:00:03.0/virtio-net-pci.rom",
cannot accept migration
qemu-system-x86_64: error while loading state for instance 0x0 of device
'ram'
qemu-system-x86_64: load of migration failed: Invalid argument
...
The expected order gets messed up and eventually the wrong data will end
up somewhere else. In this case it was the RAM.
>> I don't believe the source/destination could be aware of the misconfiguration. IIUC the destination reads the migration stream and expects certain pieces of data in a certain order. If new data is added to the migration stream or the order has changed and the destination isn't expecting it, then the migration fails. It doesn't know exactly why, just that it read-in data that it wasn't expecting.
>>
>>>> This is poor wording on my part, my apologies. I don't think it's even
>>>> possible to know the capabilities between the source & destination.
>>>>
>>>>>> The capability defaults to off to maintain backward compatibility.
>>>>>>
>>>>>> To enable the capability via HMP:
>>>>>> (qemu) migrate_set_capability virtio-iterative on
>>>>>>
>>>>>> To enable the capability via QMP:
>>>>>> {"execute": "migrate-set-capabilities", "arguments": {
>>>>>> "capabilities": [
>>>>>> { "capability": "virtio-iterative", "state": true }
>>>>>> ]
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>
> [...]
>
>>>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>>>> index 4963f6ca12..8f042c3ba5 100644
>>>>>> --- a/qapi/migration.json
>>>>>> +++ b/qapi/migration.json
>>>>>> @@ -479,6 +479,11 @@
>>>>>> # each RAM page. Requires a migration URI that supports seeking,
>>>>>> # such as a file. (since 9.0)
>>>>>> #
>>>>>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>>>>>> +# the device supports it. When enabled, and where supported, virtio
>>>>>> +# devices will track and migrate configuration changes that may
>>>>>> +# occur during the migration process. (Since 10.1)
>>>>>
>>>>> When and why should the user enable this?
>>>>
>>>> Well if all goes according to plan, always (at least for virtio-net).
>>>> This should improve the overall speed of live migration for a virtio-net
>>>> device (and vhost-net/vhost-vdpa).
>>>
>>> So the only use for "disabled" would be when migrating to or from an
>>> older version of QEMU that doesn't support this. Fair?
>>
>> Correct.
>>
>>> What's the default?
>>
>> Disabled.
>
> Awkward for something that should always be enabled. But see below.
>
> Please document defaults in the doc comment.
>
Ack.
>>>>> What exactly do you mean by "where supported"?
>>>>
>>>> I meant if both source's Qemu and destination's Qemu support it, as well
>>>> as for other virtio devices in the future if they decide to implement
>>>> iterative migration (e.g. a more general "enable iterative migration for
>>>> virtio devices").
>>>>
>>>> But I think for now this is better left as a virtio-net configuration
>>>> rather than as a migration capability (e.g. --device
>>>> virtio-net-pci,iterative-mig=on/off,...)
>>>
>>> Makes sense to me (but I'm not a migration expert).
>
> A device property's default can depend on the machine type via compat
> properties. This is normally used to restrict a guest-visible change to
> newer machine types. Here, it's not guest-visible. But it can get you
> this:
>
> * Migrate new machine type from new QEMU to new QEMU (old QEMU doesn't
> have the machine type): iterative is enabled by default. Good. User
> can disable it on both ends to not get the improvement. Enabling it
> on just one breaks migration.
>
> All other cases go away with time.
>
> * Migrate old machine type from new QEMU to new QEMU: iterative is
> disabled by default, which is sad, but no worse than before. User can
> enable it on both ends to get the improvement. Enabling it on just
> one breaks migration.
>
> * Migrate old machine type from new QEMU to old QEMU or vice versa:
> iterative is off by default. Good. Enabling it on the new one breaks
> migration.
>
> * Migrate old machine type from old QEMU to old QEMU: iterative is off
>
> I figure almost all users could simply ignore this configuration knob
> then.
>
Oh, that's interesting. I wasn't aware of this. But couldn't this
potentially cause some headaches and confusion when attempting to
migrate between 2 guests where one VM is using a machine type does
support it and the other isn't?
For example, the source and destination VMs both specify '-machine
q35,...' and the q35 alias resolves into, say, pc-q35-10.1 for the
source VM and pc-q35-10.0 for the destination VM. And say this property
is supported on >= pc-q35-10.1.
IIUC, this would mean that iterative is enabled by default on the source
VM but disabled by default on the destination VM.
Then a user attempts the migration, the migration fails, and then they'd
have to try and figure out why it's failing.
Furthermore, since it's a device property that's essentially set at VM
creation time, either the source would have to be reset and explicitly
set this property to off or the destination would have to be reset and
use a newer (>= pc-q35-10.1) machine type before starting it back up and
perform the migration.
Am I understanding this correctly?
>>> [...]
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-26 18:08 ` Jonah Palmer
@ 2025-08-27 6:37 ` Markus Armbruster
2025-08-28 15:29 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2025-08-27 6:37 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
Jonah Palmer <jonah.palmer@oracle.com> writes:
> On 8/26/25 2:11 AM, Markus Armbruster wrote:
>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>
>>> On 8/25/25 8:44 AM, Markus Armbruster wrote:
>>
>> [...]
>>
>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>
>>>>> On 8/8/25 6:48 AM, Markus Armbruster wrote:
>>
>> [...]
>>
>>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>>>>> virtio devices, where supported, to iteratively migrate configuration
>>>>>>> changes that occur during the migration process.
>>>>>>
>>>>>> Why is that desirable?
>>>>>
>>>>> To be frank, I wasn't sure if having a migration capability, or even
>>>>> have it toggleable at all, would be desirable or not. It appears though
>>>>> that this might be better off as a per-device feature set via
>>>>> --device virtio-net-pci,iterative-mig=on,..., for example.
>>>>
>>>> See below.
>>>>
>>>>> And by "iteratively migrate configuration changes" I meant more along
>>>>> the lines of the device's state as it continues running on the source.
>>>>
>>>> Isn't that what migration does always?
>>>
>>> Essentially yes, but today all of the state is only migrated at the end, once the source has been paused. So the final correct state is always sent to the destination.
>>
>> As far as I understand (and ignoring lots of detail, including post
>> copy), we have three stages:
>>
>> 1. Source runs, migrate memory pages. Pages that get dirtied after they
>> are migrated need to be migrated again.
>>
>> 2. Neither source or destination runs, migrate remaining memory pages
>> and device state.
>>
>> 3. Destination starts to run.
>>
>> If the duration of stage 2 (downtime) was of no concern, we'd switch to
>> it immediately, i.e. without migrating anything in stage 1. This would
>> minimize I/O.
>>
>> Of course, we actually care for limiting downtime. We switch to stage 2
>> when "little enough" is left for stage two to migrate.
>>
>>> If we're no longer waiting until the source has been paused and the initial state is sent early, then we need to make sure that any changes that happen is still communicated to the destination.
>>
>> So you're proposing to treat suitable parts of the device state more
>> like memory pages. Correct?
>>
>
> Not in the sense of "something got dirtied so let's immediately re-send
> that" like we would with RAM. It's more along the lines of "something
> got dirtied so let's make sure that gets re-sent at the start of stage 2".
Or is it "something might have dirtied, just resend in stage 2"?
> The entire state of a virtio-net device (even with vhost-net /
> vhost-vDPA) is <10KB I believe. I don't believe there's much to gain by
> "iteratively" re-sending changes for virtio-net. It should be suitable
> enough to just re-send whatever changed during stage 1 (after the
> initial state was sent) at the start of stage 2.
Got it.
> This is why I'm currently looking into a solution that uses VMSD's
> .early_setup flag (that Peter recommended) rather than implementing a
> suite of SaveVMHandlers hooks (like this RFC does). We don't need this
> iterative capability as much as we need to start migrating the state
> earlier (and doing corresponding config/prep work) during stage 1.
>
>> Cover letter and commit message of PATCH 4 provide the motivation: you
>> observe a shorter downtime. You speculate this is due to moving "heavy
>> allocations and page-fault latencies" from stage 2 to stage 1. Correct?
>
> Correct. But again I'd like to stress that this is just one part in
> reducing downtime during stage 2. The biggest reductions will come from
> the config/prep work that we're trying to move from stage 2 to stage 1,
> especially when vhost-vDPA is involved. And we can only do this early
> work once we have the state, hence why we're sending it earlier.
This is an important bit of detail I've been missing so far. Easy
enough to fix in a future commit message and cover letter.
>> Is there anything that makes virtio-net particularly suitable?
>
> Yes, especially with vhost-vDPA and configuring VQs. See Eugenio's
> comment here
> https://lore.kernel.org/qemu-devel/CAJaqyWdUutZrAWKy9d=ip+h+y3BnptUrcL8Xj06XfizNxPtfpw@mail.gmail.com/.
Such prep work commonly depends only on device configuration, not state.
I'm curious: what state bits exactly does the prep work need?
Device configuration is available at the start of stage 1, state is
fully available only at the end of stage 2.
Your patches make *tentative* device state available in stage 1.
Tentative, because it may still change afterwards.
You use tentative state to do certain expensive work in stage 1 already,
in order to cut downtime in stage 2.
Fair?
Can state change in ways that invalidate this work?
If yes, how do you handle this?
If no, do you verify the "no change" design assumption holds?
>> I think this patch's commit message should at least hint at the
>> motivation at a high level. Details like measurements are best left to
>> PATCH 4.
>
> You're right, this was my bad for not framing this RFC more clearly and
> the true motivations behind it. I will certainly be more direct and
> descriptive in the next RFC for this effort.
>
>>> This RFC handles this by just re-sending the entire state again once the source has been paused. But of course this isn't optimal and I'm looking into how to better optimize this part.
>>
>> How much is the entire state?
>
> I'm not exactly sure how large it is but it should be <10KB even with
> vhost-vDPA. It could be slightly larger if we really up the number of
> queue pairs and/or have huge MAC/multicast lists.
No worries then.
>>>>> But perhaps actual configuration changes (e.g. changing the number of
>>>>> queue pairs) could also be supported mid-migration like this?
>>>>
>>>> I don't know.
>>>>
>>>>>>> This capability is added to the validated capabilities list to ensure
>>>>>>> both the source and destination support it before enabling.
>>>>>>
>>>>>> What happens when only one side enables it?
>>>>>
>>>>> The migration stream breaks if only one side enables it.
>>>>
>>>> How does it break? Error message pointing out the misconfiguration?
>>>>
>>>
>>> The destination VM is torn down and the source just reports that migration failed.
>>
>> Exact same failure as for other misconfigurations, like missing a device
>> on the destination?
>
> I hesitate to say "exact" but for example, when missing a device on one
> side you might see something like below (I removed a serial device):
>
> qemu-system-x86_64: Unknown ramblock "0000:00:03.0/virtio-net-pci.rom",
> cannot accept migration
> qemu-system-x86_64: error while loading state for instance 0x0 of device
> 'ram'
> qemu-system-x86_64: load of migration failed: Invalid argument
> ...
Aside: ugly error cascade due to migration's well-known failure to
propagate errors up properly.
> The expected order gets messed up and eventually the wrong data will end
> up somewhere else. In this case it was the RAM.
It's messy. If we started on a green field today, we'd do better, I
hope.
What error message do you observe when only one side enables
@virtio-iterative? Question is moot if you plan to switch to a
different interface. Answer it for that interface in a commit message
then.
>>> I don't believe the source/destination could be aware of the misconfiguration. IIUC the destination reads the migration stream and expects certain pieces of data in a certain order. If new data is added to the migration stream or the order has changed and the destination isn't expecting it, then the migration fails. It doesn't know exactly why, just that it read-in data that it wasn't expecting.
>>>
>>>>> This is poor wording on my part, my apologies. I don't think it's even
>>>>> possible to know the capabilities between the source & destination.
>>>>>
>>>>>>> The capability defaults to off to maintain backward compatibility.
>>>>>>>
>>>>>>> To enable the capability via HMP:
>>>>>>> (qemu) migrate_set_capability virtio-iterative on
>>>>>>>
>>>>>>> To enable the capability via QMP:
>>>>>>> {"execute": "migrate-set-capabilities", "arguments": {
>>>>>>> "capabilities": [
>>>>>>> { "capability": "virtio-iterative", "state": true }
>>>>>>> ]
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>
>> [...]
>>
>>>>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>>>>> index 4963f6ca12..8f042c3ba5 100644
>>>>>>> --- a/qapi/migration.json
>>>>>>> +++ b/qapi/migration.json
>>>>>>> @@ -479,6 +479,11 @@
>>>>>>> # each RAM page. Requires a migration URI that supports seeking,
>>>>>>> # such as a file. (since 9.0)
>>>>>>> #
>>>>>>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>>>>>>> +# the device supports it. When enabled, and where supported, virtio
>>>>>>> +# devices will track and migrate configuration changes that may
>>>>>>> +# occur during the migration process. (Since 10.1)
>>>>>>
>>>>>> When and why should the user enable this?
>>>>>
>>>>> Well if all goes according to plan, always (at least for virtio-net).
>>>>> This should improve the overall speed of live migration for a virtio-net
>>>>> device (and vhost-net/vhost-vdpa).
>>>>
>>>> So the only use for "disabled" would be when migrating to or from an
>>>> older version of QEMU that doesn't support this. Fair?
>>>
>>> Correct.
>>>
>>>> What's the default?
>>>
>>> Disabled.
>>
>> Awkward for something that should always be enabled. But see below.
>>
>> Please document defaults in the doc comment.
>
> Ack.
>
>>>>>> What exactly do you mean by "where supported"?
>>>>>
>>>>> I meant if both source's Qemu and destination's Qemu support it, as well
>>>>> as for other virtio devices in the future if they decide to implement
>>>>> iterative migration (e.g. a more general "enable iterative migration for
>>>>> virtio devices").
>>>>>
>>>>> But I think for now this is better left as a virtio-net configuration
>>>>> rather than as a migration capability (e.g. --device
>>>>> virtio-net-pci,iterative-mig=on/off,...)
>>>>
>>>> Makes sense to me (but I'm not a migration expert).
>>
>> A device property's default can depend on the machine type via compat
>> properties. This is normally used to restrict a guest-visible change to
>> newer machine types. Here, it's not guest-visible. But it can get you
>> this:
>>
>> * Migrate new machine type from new QEMU to new QEMU (old QEMU doesn't
>> have the machine type): iterative is enabled by default. Good. User
>> can disable it on both ends to not get the improvement. Enabling it
>> on just one breaks migration.
>>
>> All other cases go away with time.
>>
>> * Migrate old machine type from new QEMU to new QEMU: iterative is
>> disabled by default, which is sad, but no worse than before. User can
>> enable it on both ends to get the improvement. Enabling it on just
>> one breaks migration.
>>
>> * Migrate old machine type from new QEMU to old QEMU or vice versa:
>> iterative is off by default. Good. Enabling it on the new one breaks
>> migration.
>>
>> * Migrate old machine type from old QEMU to old QEMU: iterative is off
>>
>> I figure almost all users could simply ignore this configuration knob
>> then.
>
> Oh, that's interesting. I wasn't aware of this. But couldn't this
> potentially cause some headaches and confusion when attempting to
> migrate between 2 guests where one VM is using a machine type does
> support it and the other isn't?
>
> For example, the source and destination VMs both specify '-machine
> q35,...' and the q35 alias resolves into, say, pc-q35-10.1 for the
> source VM and pc-q35-10.0 for the destination VM. And say this property
> is supported on >= pc-q35-10.1.
In my understanding, migration requires identical machine types on both
ends, and all bets are off when they're different.
> IIUC, this would mean that iterative is enabled by default on the source
> VM but disabled by default on the destination VM.
>
> Then a user attempts the migration, the migration fails, and then they'd
> have to try and figure out why it's failing.
Migration failures due to mismatched configuration tend to be that way,
don't they?
> Furthermore, since it's a device property that's essentially set at VM
> creation time, either the source would have to be reset and explicitly
> set this property to off or the destination would have to be reset and
> use a newer (>= pc-q35-10.1) machine type before starting it back up and
> perform the migration.
You can use qom-set to change a device property after you created the
device. It might even work. However, qom-set is a deeply problematic
and seriously underdocumented interface. Avoid.
But will you need to change it?
If you started the source with an explicit property value, start the
destination the same way. Same as for any number of other configuration
knobs.
If you started the source with the default property value, start the
destination the same way. Values will match as long as the machine type
matches, as it should.
> Am I understanding this correctly?
>
>>>> [...]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-20 7:59 ` Eugenio Perez Martin
2025-08-25 12:16 ` Jonah Palmer
@ 2025-08-27 16:55 ` Jonah Palmer
2025-09-01 6:57 ` Eugenio Perez Martin
1 sibling, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-27 16:55 UTC (permalink / raw)
To: Eugenio Perez Martin, Peter Xu, si-wei.liu
Cc: qemu-devel, farosas, eblake, armbru, jasowang, mst,
boris.ostrovsky, Dragos Tatulea DE
On 8/20/25 3:59 AM, Eugenio Perez Martin wrote:
> On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
>>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
>>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>>>>>>>>>>> This effort was started to reduce the guest visible downtime by
>>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>>>>>>>>>>> vhost-vDPA.
>>>>>>>>>>>
>>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
>>>>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
>>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
>>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
>>>>>>>>>>> dominates its downtime.
>>>>>>>>>>>
>>>>>>>>>>> In other words, by migrating the state of virtio-net early (before the
>>>>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
>>>>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
>>>>>>>>>>> device.
>>>>>>>>>>>
>>>>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
>>>>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
>>>>>>>>>>> of the stop-and-copy phase.
>>>>>>>>>>
>>>>>>>>>> I see, thanks.
>>>>>>>>>>
>>>>>>>>>> Please add these into the cover letter of the next post. IMHO it's
>>>>>>>>>> extremely important information to explain the real goal of this work. I
>>>>>>>>>> bet it is not expected for most people when reading the current cover
>>>>>>>>>> letter.
>>>>>>>>>>
>>>>>>>>>> Then it could have nothing to do with iterative phase, am I right?
>>>>>>>>>>
>>>>>>>>>> What are the data needed for the dest QEMU to start staging backend
>>>>>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
>>>>>>>>>> the cmdlines?
>>>>>>>>>>
>>>>>>>>>> Asking this because I want to know whether it can be done completely
>>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
>>>>>>>>>>
>>>>>>>>>> If src QEMU's data is still needed, please also first consider providing
>>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
>>>>>>>>>> refer to commit 3b95a71b22827d26178.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While it works for this series, it does not allow to resend the state
>>>>>>>>> when the src device changes. For example, if the number of virtqueues
>>>>>>>>> is modified.
>>>>>>>>
>>>>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
>>>>>>>> Not "it might preheat things", but exactly why, and how that differs when
>>>>>>>> it's pure software, and when hardware will be involved.
>>>>>>>>
>>>>>>>
>>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
>>>>>>> about ~200ms:
>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
>>>>>>>
>>>>>>> Adding Dragos here in case he can provide more details. Maybe the
>>>>>>> numbers have changed though.
>>>>>>>
>>>>>>> And I guess the difference with pure SW will always come down to PCI
>>>>>>> communications, which assume it is slower than configuring the host SW
>>>>>>> device in RAM or even CPU cache. But I admin that proper profiling is
>>>>>>> needed before making those claims.
>>>>>>>
>>>>>>> Jonah, can you print the time it takes to configure the vDPA device
>>>>>>> with traces vs the time it takes to enable the dataplane of the
>>>>>>> device? So we can get an idea of how much time we save with this.
>>>>>>>
>>>>>>
>>>>>> Let me know if this isn't what you're looking for.
>>>>>>
>>>>>> I'm assuming by "configuration time" you mean:
>>>>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
>>>>>> before we start enabling the vrings (e.g.
>>>>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
>>>>>>
>>>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
>>>>>> - Time right before we start enabling the vrings (see above) to right
>>>>>> after we enable the last vring (at the end of
>>>>>> vhost_vdpa_net_cvq_load())
>>>>>>
>>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
>>>>>>
>>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
>>>>>> queues=8,x-svq=on
>>>>>>
>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
>>>>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
>>>>>> ctrl_vlan=off,vectors=18,host_mtu=9000,
>>>>>> disable-legacy=on,disable-modern=off
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Configuration time: ~31s
>>>>>> Dataplane enable time: ~0.14ms
>>>>>>
>>>>>
>>>>> I was vague, but yes, that's representative enough! It would be more
>>>>> accurate if the configuration time ends by the time QEMU enables the
>>>>> first queue of the dataplane though.
>>>>>
>>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
>>>>> beginning of vhost_vdpa_dev_start?
>>>>>
>>>>
>>>> Ah, I also realized that Qemu I was using for measurements was using a
>>>> version before the listener_registered member was introduced.
>>>>
>>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
>>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
>>>> times for measurements.
>>>>
>>>> v->shared->listener_registered == false at the beginning of
>>>> vhost_vdpa_dev_start().
>>>>
>>>
>>> Let's move out the effect of the mem pinning from the downtime by
>>> registering the listener before the migration. Can you check why is it
>>> not registered at vhost_vdpa_set_owner?
>>>
>>
>> Sorry I was profiling improperly. The listener is registered at
>> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
>> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
>> as false and is re-registered later in the function.
>>
>> Should we always expect listener_registered == true at every
>> vhost_vdpa_dev_start call during startup?
>
> Yes, that leaves all the memory pinning time out of the downtime.
>
>> This is what I traced during
>> startup of a single guest (no migration).
>
> We can trace the destination's QEMU to be more accurate, but probably
> it makes no difference.
>
>> Tracepoint is right at the
>> start of the vhost_vdpa_dev_start function:
>>
>> vhost_vdpa_set_owner() - register memory listener
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>
> This is surprising. Can you trace how listener_registered goes to 0 again?
>
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> ...
>> * VQs are now being enabled *
>>
>> I'm also seeing that when the guest is being shutdown,
>> dev->vhost_ops->vhost_get_vring_base() is failing in
>> do_vhost_virtqueue_stop():
>>
>> ...
>> [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
>> [ 114.719255] systemd-shutdown[1]: Powering off.
>> [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5
>> [ 114.725593] reboot: Power down
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
>> permitted (1)
>> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
>> permitted (1)
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>
>> However when x-svq=on, I don't see these errors on shutdown.
>>
>
> SVQ can mask this error as it does not need to forward the ring
> restore message to the device. It can just start with 0 and convert
> indexes.
>
> Let's focus on listened_registered first :).
>
>>>> ---
>>>>
>>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
>>>> right after Qemu enables the first VQ.
>>>> - 26.947s, 26.606s, 27.326s
>>>>
>>>> Enable dataplane: Time from right after first VQ is enabled to right
>>>> after the last VQ is enabled.
>>>> - 0.081ms, 0.081ms, 0.079ms
>>>>
>>>
>>
>
I looked into this a bit more and realized I was being naive thinking
that the vhost-vDPA device startup path of a single VM would be the same
as that on a destination VM during live migration. This is **not** the
case and I apologize for the confusion I caused.
What I described and profiled above is indeed true for the startup of a
single VM / source VM with a vhost-vDPA device. However, this is not
true on the destination side and its configuration time is drastically
different.
Under the same specs, but now with a live migration performed between a
source and destination VM (128G Mem, SVQ=off, CVQ=on, 8 queue pairs),
and using the same tracepoints to find the configuration time and enable
dataplane time, these are the measurements I found for the **destination
VM**:
Configuration time: Time from first entry into vhost_vdpa_dev_start to
right after Qemu enables the first VQ.
- 268.603ms, 241.515ms, 249.007ms
Enable dataplane time: Time from right after the first VQ is enabled to
right after the last VQ is enabled.
- 0.072ms, 0.071ms, 0.070ms
---
For those curious, using the same printouts as I did above, this is what
it actually looks like on the destination side:
* Destination VM is started *
vhost_vdpa_set_owner() - register memory listener
vhost_vdpa_reset_device() - unregistering listener
* Start live migration on source VM *
(qemu) migrate unix:/tmp/lm.sock
...
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
vhost_vdpa_dev_start() - register listener
And this is very different than the churning we saw in my previous email
that happens on the source / single guest VM with vhost-vDPA and its
startup path.
---
Again, apologies on the confusion this caused. This was my fault for not
being more careful.
Jonah
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-27 6:37 ` Markus Armbruster
@ 2025-08-28 15:29 ` Jonah Palmer
2025-08-29 9:24 ` Markus Armbruster
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-08-28 15:29 UTC (permalink / raw)
To: Markus Armbruster
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/27/25 2:37 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
>
>> On 8/26/25 2:11 AM, Markus Armbruster wrote:
>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>
>>>> On 8/25/25 8:44 AM, Markus Armbruster wrote:
>>>
>>> [...]
>>>
>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>
>>>>>> On 8/8/25 6:48 AM, Markus Armbruster wrote:
>>>
>>> [...]
>>>
>>>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>>>>>> virtio devices, where supported, to iteratively migrate configuration
>>>>>>>> changes that occur during the migration process.
>>>>>>>
>>>>>>> Why is that desirable?
>>>>>>
>>>>>> To be frank, I wasn't sure if having a migration capability, or even
>>>>>> have it toggleable at all, would be desirable or not. It appears though
>>>>>> that this might be better off as a per-device feature set via
>>>>>> --device virtio-net-pci,iterative-mig=on,..., for example.
>>>>>
>>>>> See below.
>>>>>
>>>>>> And by "iteratively migrate configuration changes" I meant more along
>>>>>> the lines of the device's state as it continues running on the source.
>>>>>
>>>>> Isn't that what migration does always?
>>>>
>>>> Essentially yes, but today all of the state is only migrated at the end, once the source has been paused. So the final correct state is always sent to the destination.
>>>
>>> As far as I understand (and ignoring lots of detail, including post
>>> copy), we have three stages:
>>>
>>> 1. Source runs, migrate memory pages. Pages that get dirtied after they
>>> are migrated need to be migrated again.
>>>
>>> 2. Neither source or destination runs, migrate remaining memory pages
>>> and device state.
>>>
>>> 3. Destination starts to run.
>>>
>>> If the duration of stage 2 (downtime) was of no concern, we'd switch to
>>> it immediately, i.e. without migrating anything in stage 1. This would
>>> minimize I/O.
>>>
>>> Of course, we actually care for limiting downtime. We switch to stage 2
>>> when "little enough" is left for stage two to migrate.
>>>
>>>> If we're no longer waiting until the source has been paused and the initial state is sent early, then we need to make sure that any changes that happen is still communicated to the destination.
>>>
>>> So you're proposing to treat suitable parts of the device state more
>>> like memory pages. Correct?
>>>
>>
>> Not in the sense of "something got dirtied so let's immediately re-send
>> that" like we would with RAM. It's more along the lines of "something
>> got dirtied so let's make sure that gets re-sent at the start of stage 2".
>
> Or is it "something might have dirtied, just resend in stage 2"?
>
Exactly. This is better wording since it doesn't necessarily have to be
sent at the "start" of stage 2. Just at some point during it.
>> The entire state of a virtio-net device (even with vhost-net /
>> vhost-vDPA) is <10KB I believe. I don't believe there's much to gain by
>> "iteratively" re-sending changes for virtio-net. It should be suitable
>> enough to just re-send whatever changed during stage 1 (after the
>> initial state was sent) at the start of stage 2.
>
> Got it.
>
>> This is why I'm currently looking into a solution that uses VMSD's
>> .early_setup flag (that Peter recommended) rather than implementing a
>> suite of SaveVMHandlers hooks (like this RFC does). We don't need this
>> iterative capability as much as we need to start migrating the state
>> earlier (and doing corresponding config/prep work) during stage 1.
>>
>>> Cover letter and commit message of PATCH 4 provide the motivation: you
>>> observe a shorter downtime. You speculate this is due to moving "heavy
>>> allocations and page-fault latencies" from stage 2 to stage 1. Correct?
>>
>> Correct. But again I'd like to stress that this is just one part in
>> reducing downtime during stage 2. The biggest reductions will come from
>> the config/prep work that we're trying to move from stage 2 to stage 1,
>> especially when vhost-vDPA is involved. And we can only do this early
>> work once we have the state, hence why we're sending it earlier.
>
> This is an important bit of detail I've been missing so far. Easy
> enough to fix in a future commit message and cover letter.
>
Ack.
>>> Is there anything that makes virtio-net particularly suitable?
>>
>> Yes, especially with vhost-vDPA and configuring VQs. See Eugenio's
>> comment here
>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAJaqyWdUutZrAWKy9d=ip*h*y3BnptUrcL8Xj06XfizNxPtfpw@mail.gmail.com/__;Kys!!ACWV5N9M2RV99hQ!MHkMmGR7j2n9i2We6Mh3xXX03yEve90Bhs0aEDCVYU4Z0n-op-0aDlpWMBGZ7CpjDOBnhTkDVJjx8QcQ$ .
>
> Such prep work commonly depends only on device configuration, not state.
> I'm curious: what state bits exactly does the prep work need?
>
> Device configuration is available at the start of stage 1, state is
> fully available only at the end of stage 2.
>
We pretty much need, more or less, all of the state of the VirtIODevice
itself as well as the bits of the VirtIONet device. Essentially, barring
ring indices, we'd need whatever is required throughout most of the
device's startup routine.
In this series, we get everything we need from the vmstate_save_state(f,
&vmstate_virtio_net, ...) and vmstate_load_state(f, &vmstate_virtio_net,
...) calls early during stage 1 (see patch 4/6).
Once we've gotten this data, we can start on the prep work that's
normally done today during stage 2.
> Your patches make *tentative* device state available in stage 1.
> Tentative, because it may still change afterwards.
>
> You use tentative state to do certain expensive work in stage 1 already,
> in order to cut downtime in stage 2.
>
> Fair?
Correct.
>
> Can state change in ways that invalidate this work?
>
If, for some reason, the guest wanted to change everything during
migration (specifically during stage 1), then it'd more or less negate
the early prep work we'd've done. How impactful this is would depend on
which route we go (see below). God forbid the guest just wait until
migration is complete.
> If yes, how do you handle this?
>
So it depends on the route this series goes. That is, whether we go the
truly iterative SaveVMHandlers hooks route (which this series uses) or
if we go the early_setup VMSD route (which Peter recommended).
---
If we go the truly iterative route, then technically we can still handle
these changes during stage 1 and still keep the work out of stage 2.
However, given the nicheness of such a corner case (where things are
being changed last minute during migration), handling these changes
iteratively might be overdesign.
And we'd have to guard against the scenario where the guest acts
maliciously by constantly changing things to prevent migration from
continuing.
---
If we go the early_setup VMSD route, where we get one shot early to do
stuff during stage 1 and one last shot to do things later during stage
2, then the more that gets changed means the less beneficial this early
work becomes. This is because any changes made during stage 1 could only
be handled during stage 2, which is what this overall effort is trying
to minimize.
> If no, do you verify the "no change" design assumption holds?
>
When it comes to early migration for devices, we can never support a "no
change" design assumption. The difference in the design lies in how (and
when) such changes are handled during migration.
>>> I think this patch's commit message should at least hint at the
>>> motivation at a high level. Details like measurements are best left to
>>> PATCH 4.
>>
>> You're right, this was my bad for not framing this RFC more clearly and
>> the true motivations behind it. I will certainly be more direct and
>> descriptive in the next RFC for this effort.
>>
>>>> This RFC handles this by just re-sending the entire state again once the source has been paused. But of course this isn't optimal and I'm looking into how to better optimize this part.
>>>
>>> How much is the entire state?
>>
>> I'm not exactly sure how large it is but it should be <10KB even with
>> vhost-vDPA. It could be slightly larger if we really up the number of
>> queue pairs and/or have huge MAC/multicast lists.
>
> No worries then.
>
>>>>>> But perhaps actual configuration changes (e.g. changing the number of
>>>>>> queue pairs) could also be supported mid-migration like this?
>>>>>
>>>>> I don't know.
>>>>>
>>>>>>>> This capability is added to the validated capabilities list to ensure
>>>>>>>> both the source and destination support it before enabling.
>>>>>>>
>>>>>>> What happens when only one side enables it?
>>>>>>
>>>>>> The migration stream breaks if only one side enables it.
>>>>>
>>>>> How does it break? Error message pointing out the misconfiguration?
>>>>>
>>>>
>>>> The destination VM is torn down and the source just reports that migration failed.
>>>
>>> Exact same failure as for other misconfigurations, like missing a device
>>> on the destination?
>>
>> I hesitate to say "exact" but for example, when missing a device on one
>> side you might see something like below (I removed a serial device):
>>
>> qemu-system-x86_64: Unknown ramblock "0000:00:03.0/virtio-net-pci.rom",
>> cannot accept migration
>> qemu-system-x86_64: error while loading state for instance 0x0 of device
>> 'ram'
>> qemu-system-x86_64: load of migration failed: Invalid argument
>> ...
>
> Aside: ugly error cascade due to migration's well-known failure to
> propagate errors up properly.
>
>> The expected order gets messed up and eventually the wrong data will end
>> up somewhere else. In this case it was the RAM.
>
> It's messy. If we started on a green field today, we'd do better, I
> hope.
>
> What error message do you observe when only one side enables
> @virtio-iterative? Question is moot if you plan to switch to a
> different interface. Answer it for that interface in a commit message
> then.
>
Will do.
>>>> I don't believe the source/destination could be aware of the misconfiguration. IIUC the destination reads the migration stream and expects certain pieces of data in a certain order. If new data is added to the migration stream or the order has changed and the destination isn't expecting it, then the migration fails. It doesn't know exactly why, just that it read-in data that it wasn't expecting.
>>>>
>>>>>> This is poor wording on my part, my apologies. I don't think it's even
>>>>>> possible to know the capabilities between the source & destination.
>>>>>>
>>>>>>>> The capability defaults to off to maintain backward compatibility.
>>>>>>>>
>>>>>>>> To enable the capability via HMP:
>>>>>>>> (qemu) migrate_set_capability virtio-iterative on
>>>>>>>>
>>>>>>>> To enable the capability via QMP:
>>>>>>>> {"execute": "migrate-set-capabilities", "arguments": {
>>>>>>>> "capabilities": [
>>>>>>>> { "capability": "virtio-iterative", "state": true }
>>>>>>>> ]
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>
>>> [...]
>>>
>>>>>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>>>>>> index 4963f6ca12..8f042c3ba5 100644
>>>>>>>> --- a/qapi/migration.json
>>>>>>>> +++ b/qapi/migration.json
>>>>>>>> @@ -479,6 +479,11 @@
>>>>>>>> # each RAM page. Requires a migration URI that supports seeking,
>>>>>>>> # such as a file. (since 9.0)
>>>>>>>> #
>>>>>>>> +# @virtio-iterative: Enable iterative migration for virtio devices, if
>>>>>>>> +# the device supports it. When enabled, and where supported, virtio
>>>>>>>> +# devices will track and migrate configuration changes that may
>>>>>>>> +# occur during the migration process. (Since 10.1)
>>>>>>>
>>>>>>> When and why should the user enable this?
>>>>>>
>>>>>> Well if all goes according to plan, always (at least for virtio-net).
>>>>>> This should improve the overall speed of live migration for a virtio-net
>>>>>> device (and vhost-net/vhost-vdpa).
>>>>>
>>>>> So the only use for "disabled" would be when migrating to or from an
>>>>> older version of QEMU that doesn't support this. Fair?
>>>>
>>>> Correct.
>>>>
>>>>> What's the default?
>>>>
>>>> Disabled.
>>>
>>> Awkward for something that should always be enabled. But see below.
>>>
>>> Please document defaults in the doc comment.
>>
>> Ack.
>>
>>>>>>> What exactly do you mean by "where supported"?
>>>>>>
>>>>>> I meant if both source's Qemu and destination's Qemu support it, as well
>>>>>> as for other virtio devices in the future if they decide to implement
>>>>>> iterative migration (e.g. a more general "enable iterative migration for
>>>>>> virtio devices").
>>>>>>
>>>>>> But I think for now this is better left as a virtio-net configuration
>>>>>> rather than as a migration capability (e.g. --device
>>>>>> virtio-net-pci,iterative-mig=on/off,...)
>>>>>
>>>>> Makes sense to me (but I'm not a migration expert).
>>>
>>> A device property's default can depend on the machine type via compat
>>> properties. This is normally used to restrict a guest-visible change to
>>> newer machine types. Here, it's not guest-visible. But it can get you
>>> this:
>>>
>>> * Migrate new machine type from new QEMU to new QEMU (old QEMU doesn't
>>> have the machine type): iterative is enabled by default. Good. User
>>> can disable it on both ends to not get the improvement. Enabling it
>>> on just one breaks migration.
>>>
>>> All other cases go away with time.
>>>
>>> * Migrate old machine type from new QEMU to new QEMU: iterative is
>>> disabled by default, which is sad, but no worse than before. User can
>>> enable it on both ends to get the improvement. Enabling it on just
>>> one breaks migration.
>>>
>>> * Migrate old machine type from new QEMU to old QEMU or vice versa:
>>> iterative is off by default. Good. Enabling it on the new one breaks
>>> migration.
>>>
>>> * Migrate old machine type from old QEMU to old QEMU: iterative is off
>>>
>>> I figure almost all users could simply ignore this configuration knob
>>> then.
>>
>> Oh, that's interesting. I wasn't aware of this. But couldn't this
>> potentially cause some headaches and confusion when attempting to
>> migrate between 2 guests where one VM is using a machine type does
>> support it and the other isn't?
>>
>> For example, the source and destination VMs both specify '-machine
>> q35,...' and the q35 alias resolves into, say, pc-q35-10.1 for the
>> source VM and pc-q35-10.0 for the destination VM. And say this property
>> is supported on >= pc-q35-10.1.
>
> In my understanding, migration requires identical machine types on both
> ends, and all bets are off when they're different.
>
Ah, true.
>> IIUC, this would mean that iterative is enabled by default on the source
>> VM but disabled by default on the destination VM.
>>
>> Then a user attempts the migration, the migration fails, and then they'd
>> have to try and figure out why it's failing.
>
> Migration failures due to mismatched configuration tend to be that way,
> don't they?
>
Right.
So if we pin this feature to always be enabled for machine type, say, >=
pc-q35-XX.X, then can we assume that both guests can actually support
this feature?
In other words, conversely, is it possible in production that both
guests use pc-q35-XX.X but one build supports this early migration
feature and the other doesn't?
If we can assume that, then this would probably be the right approach
for something like this.
>> Furthermore, since it's a device property that's essentially set at VM
>> creation time, either the source would have to be reset and explicitly
>> set this property to off or the destination would have to be reset and
>> use a newer (>= pc-q35-10.1) machine type before starting it back up and
>> perform the migration.
>
> You can use qom-set to change a device property after you created the
> device. It might even work. However, qom-set is a deeply problematic
> and seriously underdocumented interface. Avoid.
>
> But will you need to change it?
>
> If you started the source with an explicit property value, start the
> destination the same way. Same as for any number of other configuration
> knobs.
>
> If you started the source with the default property value, start the
> destination the same way. Values will match as long as the machine type
> matches, as it should.
>
Given that migration can only be done with matching machine types and if
we can assume that guests using pc-q35-XX.X, for example, will always
have this support, then my concerns about this are allayed.
>> Am I understanding this correctly?
>>
>>>>> [...]
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-28 15:29 ` Jonah Palmer
@ 2025-08-29 9:24 ` Markus Armbruster
2025-09-01 14:10 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2025-08-29 9:24 UTC (permalink / raw)
To: Jonah Palmer
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
Jonah Palmer <jonah.palmer@oracle.com> writes:
> On 8/27/25 2:37 AM, Markus Armbruster wrote:
>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>
>>> On 8/26/25 2:11 AM, Markus Armbruster wrote:
>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>
>>>>> On 8/25/25 8:44 AM, Markus Armbruster wrote:
>>>>
>>>> [...]
>>>>
>>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>>
>>>>>>> On 8/8/25 6:48 AM, Markus Armbruster wrote:
>>>>
>>>> [...]
>>>>
>>>>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>>>>>>> virtio devices, where supported, to iteratively migrate configuration
>>>>>>>>> changes that occur during the migration process.
>>>>>>>>
>>>>>>>> Why is that desirable?
>>>>>>>
>>>>>>> To be frank, I wasn't sure if having a migration capability, or even
>>>>>>> have it toggleable at all, would be desirable or not. It appears though
>>>>>>> that this might be better off as a per-device feature set via
>>>>>>> --device virtio-net-pci,iterative-mig=on,..., for example.
>>>>>>
>>>>>> See below.
>>>>>>
>>>>>>> And by "iteratively migrate configuration changes" I meant more along
>>>>>>> the lines of the device's state as it continues running on the source.
>>>>>>
>>>>>> Isn't that what migration does always?
>>>>>
>>>>> Essentially yes, but today all of the state is only migrated at the end, once the source has been paused. So the final correct state is always sent to the destination.
>>>>
>>>> As far as I understand (and ignoring lots of detail, including post
>>>> copy), we have three stages:
>>>>
>>>> 1. Source runs, migrate memory pages. Pages that get dirtied after they
>>>> are migrated need to be migrated again.
>>>>
>>>> 2. Neither source or destination runs, migrate remaining memory pages
>>>> and device state.
>>>>
>>>> 3. Destination starts to run.
>>>>
>>>> If the duration of stage 2 (downtime) was of no concern, we'd switch to
>>>> it immediately, i.e. without migrating anything in stage 1. This would
>>>> minimize I/O.
>>>>
>>>> Of course, we actually care for limiting downtime. We switch to stage 2
>>>> when "little enough" is left for stage two to migrate.
>>>>
>>>>> If we're no longer waiting until the source has been paused and the initial state is sent early, then we need to make sure that any changes that happen is still communicated to the destination.
>>>>
>>>> So you're proposing to treat suitable parts of the device state more
>>>> like memory pages. Correct?
>>>>
>>>
>>> Not in the sense of "something got dirtied so let's immediately re-send
>>> that" like we would with RAM. It's more along the lines of "something
>>> got dirtied so let's make sure that gets re-sent at the start of stage 2".
>>
>> Or is it "something might have dirtied, just resend in stage 2"?
>>
>
> Exactly. This is better wording since it doesn't necessarily have to be
> sent at the "start" of stage 2. Just at some point during it.
Got it.
>>> The entire state of a virtio-net device (even with vhost-net /
>>> vhost-vDPA) is <10KB I believe. I don't believe there's much to gain by
>>> "iteratively" re-sending changes for virtio-net. It should be suitable
>>> enough to just re-send whatever changed during stage 1 (after the
>>> initial state was sent) at the start of stage 2.
>>
>> Got it.
>>
>>> This is why I'm currently looking into a solution that uses VMSD's
>>> .early_setup flag (that Peter recommended) rather than implementing a
>>> suite of SaveVMHandlers hooks (like this RFC does). We don't need this
>>> iterative capability as much as we need to start migrating the state
>>> earlier (and doing corresponding config/prep work) during stage 1.
>>>
>>>> Cover letter and commit message of PATCH 4 provide the motivation: you
>>>> observe a shorter downtime. You speculate this is due to moving "heavy
>>>> allocations and page-fault latencies" from stage 2 to stage 1. Correct?
>>>
>>> Correct. But again I'd like to stress that this is just one part in
>>> reducing downtime during stage 2. The biggest reductions will come from
>>> the config/prep work that we're trying to move from stage 2 to stage 1,
>>> especially when vhost-vDPA is involved. And we can only do this early
>>> work once we have the state, hence why we're sending it earlier.
>>
>> This is an important bit of detail I've been missing so far. Easy
>> enough to fix in a future commit message and cover letter.
>>
>
> Ack.
>
>>>> Is there anything that makes virtio-net particularly suitable?
>>>
>>> Yes, especially with vhost-vDPA and configuring VQs. See Eugenio's
>>> comment here
>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAJaqyWdUutZrAWKy9d=ip*h*y3BnptUrcL8Xj06XfizNxPtfpw@mail.gmail.com/__;Kys!!ACWV5N9M2RV99hQ!MHkMmGR7j2n9i2We6Mh3xXX03yEve90Bhs0aEDCVYU4Z0n-op-0aDlpWMBGZ7CpjDOBnhTkDVJjx8QcQ$ .
>>
>> Such prep work commonly depends only on device configuration, not state.
>> I'm curious: what state bits exactly does the prep work need?
>>
>> Device configuration is available at the start of stage 1, state is
>> fully available only at the end of stage 2.
>>
>
> We pretty much need, more or less, all of the state of the VirtIODevice
> itself as well as the bits of the VirtIONet device. Essentially, barring
> ring indices, we'd need whatever is required throughout most of the
> device's startup routine.
>
> In this series, we get everything we need from the vmstate_save_state(f,
> &vmstate_virtio_net, ...) and vmstate_load_state(f, &vmstate_virtio_net,
> ...) calls early during stage 1 (see patch 4/6).
>
> Once we've gotten this data, we can start on the prep work that's
> normally done today during stage 2.
This is unusual. I'd like to understand it better.
Non-migration startup:
1. We create the device. This runs its .init().
2. We configure the device by setting device properties.
3. We realize the device. This runs its .realize(), which initializes
the device state according to its configuration.
4. The guest interacts with the device. Device state changes.
When is the expensive prep work we've been discussing done here?
>> Your patches make *tentative* device state available in stage 1.
>> Tentative, because it may still change afterwards.
>>
>> You use tentative state to do certain expensive work in stage 1 already,
>> in order to cut downtime in stage 2.
>>
>> Fair?
>
> Correct.
Got it.
>> Can state change in ways that invalidate this work?
>>
>
> If, for some reason, the guest wanted to change everything during
> migration (specifically during stage 1), then it'd more or less negate
> the early prep work we'd've done. How impactful this is would depend on
> which route we go (see below). God forbid the guest just wait until
> migration is complete.
So the answer is yes.
>> If yes, how do you handle this?
>>
>
> So it depends on the route this series goes. That is, whether we go the
> truly iterative SaveVMHandlers hooks route (which this series uses) or
> if we go the early_setup VMSD route (which Peter recommended).
>
> ---
>
> If we go the truly iterative route, then technically we can still handle
> these changes during stage 1 and still keep the work out of stage 2.
>
> However, given the nicheness of such a corner case (where things are
> being changed last minute during migration), handling these changes
> iteratively might be overdesign.
>
> And we'd have to guard against the scenario where the guest acts
> maliciously by constantly changing things to prevent migration from
> continuing.
Yes.
> ---
>
> If we go the early_setup VMSD route, where we get one shot early to do
> stuff during stage 1 and one last shot to do things later during stage
> 2, then the more that gets changed means the less beneficial this early
> work becomes. This is because any changes made during stage 1 could only
> be handled during stage 2, which is what this overall effort is trying
> to minimize.
Stupidest solution that could possibly work: if anything impacting the
prep work changed, redo it from scratch.
>> If no, do you verify the "no change" design assumption holds?
>>
>
> When it comes to early migration for devices, we can never support a "no
> change" design assumption. The difference in the design lies in how (and
> when) such changes are handled during migration.
Just checking :)
[...]
>>>>>>> But I think for now this is better left as a virtio-net configuration
>>>>>>> rather than as a migration capability (e.g. --device
>>>>>>> virtio-net-pci,iterative-mig=on/off,...)
>>>>>>
>>>>>> Makes sense to me (but I'm not a migration expert).
>>>>
>>>> A device property's default can depend on the machine type via compat
>>>> properties. This is normally used to restrict a guest-visible change to
>>>> newer machine types. Here, it's not guest-visible. But it can get you
>>>> this:
>>>>
>>>> * Migrate new machine type from new QEMU to new QEMU (old QEMU doesn't
>>>> have the machine type): iterative is enabled by default. Good. User
>>>> can disable it on both ends to not get the improvement. Enabling it
>>>> on just one breaks migration.
>>>>
>>>> All other cases go away with time.
>>>>
>>>> * Migrate old machine type from new QEMU to new QEMU: iterative is
>>>> disabled by default, which is sad, but no worse than before. User can
>>>> enable it on both ends to get the improvement. Enabling it on just
>>>> one breaks migration.
>>>>
>>>> * Migrate old machine type from new QEMU to old QEMU or vice versa:
>>>> iterative is off by default. Good. Enabling it on the new one breaks
>>>> migration.
>>>>
>>>> * Migrate old machine type from old QEMU to old QEMU: iterative is off
>>>>
>>>> I figure almost all users could simply ignore this configuration knob
>>>> then.
>>>
>>> Oh, that's interesting. I wasn't aware of this. But couldn't this
>>> potentially cause some headaches and confusion when attempting to
>>> migrate between 2 guests where one VM is using a machine type does
>>> support it and the other isn't?
>>>
>>> For example, the source and destination VMs both specify '-machine
>>> q35,...' and the q35 alias resolves into, say, pc-q35-10.1 for the
>>> source VM and pc-q35-10.0 for the destination VM. And say this property
>>> is supported on >= pc-q35-10.1.
>>
>> In my understanding, migration requires identical machine types on both
>> ends, and all bets are off when they're different.
>>
>
> Ah, true.
>
>>> IIUC, this would mean that iterative is enabled by default on the source
>>> VM but disabled by default on the destination VM.
>>>
>>> Then a user attempts the migration, the migration fails, and then they'd
>>> have to try and figure out why it's failing.
>>
>> Migration failures due to mismatched configuration tend to be that way,
>> don't they?
>>
>
> Right.
>
> So if we pin this feature to always be enabled for machine type, say, >=
> pc-q35-XX.X, then can we assume that both guests can actually support
> this feature?
>
> In other words, conversely, is it possible in production that both
> guests use pc-q35-XX.X but one build supports this early migration
> feature and the other doesn't?
I'd call that a bug.
Here's how we commonly code property defaulds depending on the machine
type.
The property defaults to the new default (here: feature enabled).
Machine types older than the current (unreleased) one use a compat
property to change it the old default (here: feature disabled). With
this value, the device must be compatible with its older versions in
prior release of QEMU, both for guest and for migration.
Once you got that right, it's fairly unlikely to break accidentally.
The current machine type then defaults the feature to enabled in the
current and all future versions of QEMU. The machine type doesn't exist
in older versions of QEMU.
Older machine types default it to disabled in the current and all future
versions of QEMU, which is compatible with older versions of QEMU.
> If we can assume that, then this would probably be the right approach
> for something like this.
>
>>> Furthermore, since it's a device property that's essentially set at VM
>>> creation time, either the source would have to be reset and explicitly
>>> set this property to off or the destination would have to be reset and
>>> use a newer (>= pc-q35-10.1) machine type before starting it back up and
>>> perform the migration.
>>
>> You can use qom-set to change a device property after you created the
>> device. It might even work. However, qom-set is a deeply problematic
>> and seriously underdocumented interface. Avoid.
>>
>> But will you need to change it?
>>
>> If you started the source with an explicit property value, start the
>> destination the same way. Same as for any number of other configuration
>> knobs.
>>
>> If you started the source with the default property value, start the
>> destination the same way. Values will match as long as the machine type
>> matches, as it should.
>>
>
> Given that migration can only be done with matching machine types and if
> we can assume that guests using pc-q35-XX.X, for example, will always
> have this support, then my concerns about this are allayed.
Glad I was able to assist here!
>>> Am I understanding this correctly?
>>>
>>>>>> [...]
>>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-08-27 16:55 ` Jonah Palmer
@ 2025-09-01 6:57 ` Eugenio Perez Martin
2025-09-01 13:17 ` Jonah Palmer
0 siblings, 1 reply; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-09-01 6:57 UTC (permalink / raw)
To: Jonah Palmer
Cc: Peter Xu, si-wei.liu, qemu-devel, farosas, eblake, armbru,
jasowang, mst, boris.ostrovsky, Dragos Tatulea DE
On Wed, Aug 27, 2025 at 6:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 8/20/25 3:59 AM, Eugenio Perez Martin wrote:
> > On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >>
> >> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
> >>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> >>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> >>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>>>>>>>> This effort was started to reduce the guest visible downtime by
> >>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> >>>>>>>>>>> vhost-vDPA.
> >>>>>>>>>>>
> >>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
> >>>>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
> >>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> >>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> >>>>>>>>>>> dominates its downtime.
> >>>>>>>>>>>
> >>>>>>>>>>> In other words, by migrating the state of virtio-net early (before the
> >>>>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
> >>>>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
> >>>>>>>>>>> device.
> >>>>>>>>>>>
> >>>>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
> >>>>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
> >>>>>>>>>>> of the stop-and-copy phase.
> >>>>>>>>>>
> >>>>>>>>>> I see, thanks.
> >>>>>>>>>>
> >>>>>>>>>> Please add these into the cover letter of the next post. IMHO it's
> >>>>>>>>>> extremely important information to explain the real goal of this work. I
> >>>>>>>>>> bet it is not expected for most people when reading the current cover
> >>>>>>>>>> letter.
> >>>>>>>>>>
> >>>>>>>>>> Then it could have nothing to do with iterative phase, am I right?
> >>>>>>>>>>
> >>>>>>>>>> What are the data needed for the dest QEMU to start staging backend
> >>>>>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
> >>>>>>>>>> the cmdlines?
> >>>>>>>>>>
> >>>>>>>>>> Asking this because I want to know whether it can be done completely
> >>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>>>>>>>
> >>>>>>>>>> If src QEMU's data is still needed, please also first consider providing
> >>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
> >>>>>>>>>> refer to commit 3b95a71b22827d26178.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> While it works for this series, it does not allow to resend the state
> >>>>>>>>> when the src device changes. For example, if the number of virtqueues
> >>>>>>>>> is modified.
> >>>>>>>>
> >>>>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
> >>>>>>>> Not "it might preheat things", but exactly why, and how that differs when
> >>>>>>>> it's pure software, and when hardware will be involved.
> >>>>>>>>
> >>>>>>>
> >>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> >>>>>>> about ~200ms:
> >>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >>>>>>>
> >>>>>>> Adding Dragos here in case he can provide more details. Maybe the
> >>>>>>> numbers have changed though.
> >>>>>>>
> >>>>>>> And I guess the difference with pure SW will always come down to PCI
> >>>>>>> communications, which assume it is slower than configuring the host SW
> >>>>>>> device in RAM or even CPU cache. But I admin that proper profiling is
> >>>>>>> needed before making those claims.
> >>>>>>>
> >>>>>>> Jonah, can you print the time it takes to configure the vDPA device
> >>>>>>> with traces vs the time it takes to enable the dataplane of the
> >>>>>>> device? So we can get an idea of how much time we save with this.
> >>>>>>>
> >>>>>>
> >>>>>> Let me know if this isn't what you're looking for.
> >>>>>>
> >>>>>> I'm assuming by "configuration time" you mean:
> >>>>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> >>>>>> before we start enabling the vrings (e.g.
> >>>>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> >>>>>>
> >>>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
> >>>>>> - Time right before we start enabling the vrings (see above) to right
> >>>>>> after we enable the last vring (at the end of
> >>>>>> vhost_vdpa_net_cvq_load())
> >>>>>>
> >>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> >>>>>>
> >>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> >>>>>> queues=8,x-svq=on
> >>>>>>
> >>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> >>>>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> >>>>>> ctrl_vlan=off,vectors=18,host_mtu=9000,
> >>>>>> disable-legacy=on,disable-modern=off
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Configuration time: ~31s
> >>>>>> Dataplane enable time: ~0.14ms
> >>>>>>
> >>>>>
> >>>>> I was vague, but yes, that's representative enough! It would be more
> >>>>> accurate if the configuration time ends by the time QEMU enables the
> >>>>> first queue of the dataplane though.
> >>>>>
> >>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
> >>>>> beginning of vhost_vdpa_dev_start?
> >>>>>
> >>>>
> >>>> Ah, I also realized that Qemu I was using for measurements was using a
> >>>> version before the listener_registered member was introduced.
> >>>>
> >>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
> >>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
> >>>> times for measurements.
> >>>>
> >>>> v->shared->listener_registered == false at the beginning of
> >>>> vhost_vdpa_dev_start().
> >>>>
> >>>
> >>> Let's move out the effect of the mem pinning from the downtime by
> >>> registering the listener before the migration. Can you check why is it
> >>> not registered at vhost_vdpa_set_owner?
> >>>
> >>
> >> Sorry I was profiling improperly. The listener is registered at
> >> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
> >> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
> >> as false and is re-registered later in the function.
> >>
> >> Should we always expect listener_registered == true at every
> >> vhost_vdpa_dev_start call during startup?
> >
> > Yes, that leaves all the memory pinning time out of the downtime.
> >
> >> This is what I traced during
> >> startup of a single guest (no migration).
> >
> > We can trace the destination's QEMU to be more accurate, but probably
> > it makes no difference.
> >
> >> Tracepoint is right at the
> >> start of the vhost_vdpa_dev_start function:
> >>
> >> vhost_vdpa_set_owner() - register memory listener
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >
> > This is surprising. Can you trace how listener_registered goes to 0 again?
> >
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> ...
> >> * VQs are now being enabled *
> >>
> >> I'm also seeing that when the guest is being shutdown,
> >> dev->vhost_ops->vhost_get_vring_base() is failing in
> >> do_vhost_virtqueue_stop():
> >>
> >> ...
> >> [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
> >> [ 114.719255] systemd-shutdown[1]: Powering off.
> >> [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >> [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5
> >> [ 114.725593] reboot: Power down
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
> >> permitted (1)
> >> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
> >> permitted (1)
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>
> >> However when x-svq=on, I don't see these errors on shutdown.
> >>
> >
> > SVQ can mask this error as it does not need to forward the ring
> > restore message to the device. It can just start with 0 and convert
> > indexes.
> >
> > Let's focus on listened_registered first :).
> >
> >>>> ---
> >>>>
> >>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> >>>> right after Qemu enables the first VQ.
> >>>> - 26.947s, 26.606s, 27.326s
> >>>>
> >>>> Enable dataplane: Time from right after first VQ is enabled to right
> >>>> after the last VQ is enabled.
> >>>> - 0.081ms, 0.081ms, 0.079ms
> >>>>
> >>>
> >>
> >
>
> I looked into this a bit more and realized I was being naive thinking
> that the vhost-vDPA device startup path of a single VM would be the same
> as that on a destination VM during live migration. This is **not** the
> case and I apologize for the confusion I caused.
>
> What I described and profiled above is indeed true for the startup of a
> single VM / source VM with a vhost-vDPA device. However, this is not
> true on the destination side and its configuration time is drastically
> different.
>
> Under the same specs, but now with a live migration performed between a
> source and destination VM (128G Mem, SVQ=off, CVQ=on, 8 queue pairs),
> and using the same tracepoints to find the configuration time and enable
> dataplane time, these are the measurements I found for the **destination
> VM**:
>
> Configuration time: Time from first entry into vhost_vdpa_dev_start to
> right after Qemu enables the first VQ.
> - 268.603ms, 241.515ms, 249.007ms
>
> Enable dataplane time: Time from right after the first VQ is enabled to
> right after the last VQ is enabled.
> - 0.072ms, 0.071ms, 0.070ms
>
> ---
>
> For those curious, using the same printouts as I did above, this is what
> it actually looks like on the destination side:
>
> * Destination VM is started *
>
> vhost_vdpa_set_owner() - register memory listener
> vhost_vdpa_reset_device() - unregistering listener
>
> * Start live migration on source VM *
> (qemu) migrate unix:/tmp/lm.sock
> ...
>
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - register listener
>
That's weird, can you check why the memory listener is not registered
at vhost_vdpa_set_owner? Or, if it is registered, why is it not
registered by the time vhost_vdpa_dev_start is called? This changes
the downtime a lot, more than half of the time is spent on this. So it
is worth fixing it before continuing.
> And this is very different than the churning we saw in my previous email
> that happens on the source / single guest VM with vhost-vDPA and its
> startup path.
>
> ---
>
> Again, apologies on the confusion this caused. This was my fault for not
> being more careful.
>
No worries!
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-09-01 6:57 ` Eugenio Perez Martin
@ 2025-09-01 13:17 ` Jonah Palmer
2025-09-02 7:31 ` Eugenio Perez Martin
0 siblings, 1 reply; 66+ messages in thread
From: Jonah Palmer @ 2025-09-01 13:17 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Peter Xu, si-wei.liu, qemu-devel, farosas, eblake, armbru,
jasowang, mst, boris.ostrovsky, Dragos Tatulea DE
On 9/1/25 2:57 AM, Eugenio Perez Martin wrote:
> On Wed, Aug 27, 2025 at 6:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 8/20/25 3:59 AM, Eugenio Perez Martin wrote:
>>> On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
>>>>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
>>>>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
>>>>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
>>>>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
>>>>>>>>>>>>> This effort was started to reduce the guest visible downtime by
>>>>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
>>>>>>>>>>>>> vhost-vDPA.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
>>>>>>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
>>>>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
>>>>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
>>>>>>>>>>>>> dominates its downtime.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In other words, by migrating the state of virtio-net early (before the
>>>>>>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
>>>>>>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
>>>>>>>>>>>>> device.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
>>>>>>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
>>>>>>>>>>>>> of the stop-and-copy phase.
>>>>>>>>>>>>
>>>>>>>>>>>> I see, thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> Please add these into the cover letter of the next post. IMHO it's
>>>>>>>>>>>> extremely important information to explain the real goal of this work. I
>>>>>>>>>>>> bet it is not expected for most people when reading the current cover
>>>>>>>>>>>> letter.
>>>>>>>>>>>>
>>>>>>>>>>>> Then it could have nothing to do with iterative phase, am I right?
>>>>>>>>>>>>
>>>>>>>>>>>> What are the data needed for the dest QEMU to start staging backend
>>>>>>>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
>>>>>>>>>>>> the cmdlines?
>>>>>>>>>>>>
>>>>>>>>>>>> Asking this because I want to know whether it can be done completely
>>>>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
>>>>>>>>>>>>
>>>>>>>>>>>> If src QEMU's data is still needed, please also first consider providing
>>>>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
>>>>>>>>>>>> refer to commit 3b95a71b22827d26178.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> While it works for this series, it does not allow to resend the state
>>>>>>>>>>> when the src device changes. For example, if the number of virtqueues
>>>>>>>>>>> is modified.
>>>>>>>>>>
>>>>>>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
>>>>>>>>>> Not "it might preheat things", but exactly why, and how that differs when
>>>>>>>>>> it's pure software, and when hardware will be involved.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
>>>>>>>>> about ~200ms:
>>>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
>>>>>>>>>
>>>>>>>>> Adding Dragos here in case he can provide more details. Maybe the
>>>>>>>>> numbers have changed though.
>>>>>>>>>
>>>>>>>>> And I guess the difference with pure SW will always come down to PCI
>>>>>>>>> communications, which assume it is slower than configuring the host SW
>>>>>>>>> device in RAM or even CPU cache. But I admin that proper profiling is
>>>>>>>>> needed before making those claims.
>>>>>>>>>
>>>>>>>>> Jonah, can you print the time it takes to configure the vDPA device
>>>>>>>>> with traces vs the time it takes to enable the dataplane of the
>>>>>>>>> device? So we can get an idea of how much time we save with this.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Let me know if this isn't what you're looking for.
>>>>>>>>
>>>>>>>> I'm assuming by "configuration time" you mean:
>>>>>>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
>>>>>>>> before we start enabling the vrings (e.g.
>>>>>>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
>>>>>>>>
>>>>>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
>>>>>>>> - Time right before we start enabling the vrings (see above) to right
>>>>>>>> after we enable the last vring (at the end of
>>>>>>>> vhost_vdpa_net_cvq_load())
>>>>>>>>
>>>>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
>>>>>>>>
>>>>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
>>>>>>>> queues=8,x-svq=on
>>>>>>>>
>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
>>>>>>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
>>>>>>>> ctrl_vlan=off,vectors=18,host_mtu=9000,
>>>>>>>> disable-legacy=on,disable-modern=off
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Configuration time: ~31s
>>>>>>>> Dataplane enable time: ~0.14ms
>>>>>>>>
>>>>>>>
>>>>>>> I was vague, but yes, that's representative enough! It would be more
>>>>>>> accurate if the configuration time ends by the time QEMU enables the
>>>>>>> first queue of the dataplane though.
>>>>>>>
>>>>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
>>>>>>> beginning of vhost_vdpa_dev_start?
>>>>>>>
>>>>>>
>>>>>> Ah, I also realized that Qemu I was using for measurements was using a
>>>>>> version before the listener_registered member was introduced.
>>>>>>
>>>>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
>>>>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
>>>>>> times for measurements.
>>>>>>
>>>>>> v->shared->listener_registered == false at the beginning of
>>>>>> vhost_vdpa_dev_start().
>>>>>>
>>>>>
>>>>> Let's move out the effect of the mem pinning from the downtime by
>>>>> registering the listener before the migration. Can you check why is it
>>>>> not registered at vhost_vdpa_set_owner?
>>>>>
>>>>
>>>> Sorry I was profiling improperly. The listener is registered at
>>>> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
>>>> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
>>>> as false and is re-registered later in the function.
>>>>
>>>> Should we always expect listener_registered == true at every
>>>> vhost_vdpa_dev_start call during startup?
>>>
>>> Yes, that leaves all the memory pinning time out of the downtime.
>>>
>>>> This is what I traced during
>>>> startup of a single guest (no migration).
>>>
>>> We can trace the destination's QEMU to be more accurate, but probably
>>> it makes no difference.
>>>
>>>> Tracepoint is right at the
>>>> start of the vhost_vdpa_dev_start function:
>>>>
>>>> vhost_vdpa_set_owner() - register memory listener
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>
>>> This is surprising. Can you trace how listener_registered goes to 0 again?
>>>
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>>>> ...
>>>> * VQs are now being enabled *
>>>>
>>>> I'm also seeing that when the guest is being shutdown,
>>>> dev->vhost_ops->vhost_get_vring_base() is failing in
>>>> do_vhost_virtqueue_stop():
>>>>
>>>> ...
>>>> [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
>>>> [ 114.719255] systemd-shutdown[1]: Powering off.
>>>> [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>> [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5
>>>> [ 114.725593] reboot: Power down
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
>>>> permitted (1)
>>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>>>>
>>>> However when x-svq=on, I don't see these errors on shutdown.
>>>>
>>>
>>> SVQ can mask this error as it does not need to forward the ring
>>> restore message to the device. It can just start with 0 and convert
>>> indexes.
>>>
>>> Let's focus on listened_registered first :).
>>>
>>>>>> ---
>>>>>>
>>>>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
>>>>>> right after Qemu enables the first VQ.
>>>>>> - 26.947s, 26.606s, 27.326s
>>>>>>
>>>>>> Enable dataplane: Time from right after first VQ is enabled to right
>>>>>> after the last VQ is enabled.
>>>>>> - 0.081ms, 0.081ms, 0.079ms
>>>>>>
>>>>>
>>>>
>>>
>>
>> I looked into this a bit more and realized I was being naive thinking
>> that the vhost-vDPA device startup path of a single VM would be the same
>> as that on a destination VM during live migration. This is **not** the
>> case and I apologize for the confusion I caused.
>>
>> What I described and profiled above is indeed true for the startup of a
>> single VM / source VM with a vhost-vDPA device. However, this is not
>> true on the destination side and its configuration time is drastically
>> different.
>>
>> Under the same specs, but now with a live migration performed between a
>> source and destination VM (128G Mem, SVQ=off, CVQ=on, 8 queue pairs),
>> and using the same tracepoints to find the configuration time and enable
>> dataplane time, these are the measurements I found for the **destination
>> VM**:
>>
>> Configuration time: Time from first entry into vhost_vdpa_dev_start to
>> right after Qemu enables the first VQ.
>> - 268.603ms, 241.515ms, 249.007ms
>>
>> Enable dataplane time: Time from right after the first VQ is enabled to
>> right after the last VQ is enabled.
>> - 0.072ms, 0.071ms, 0.070ms
>>
>> ---
>>
>> For those curious, using the same printouts as I did above, this is what
>> it actually looks like on the destination side:
>>
>> * Destination VM is started *
>>
>> vhost_vdpa_set_owner() - register memory listener
>> vhost_vdpa_reset_device() - unregistering listener
>>
>> * Start live migration on source VM *
>> (qemu) migrate unix:/tmp/lm.sock
>> ...
>>
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
>> vhost_vdpa_dev_start() - register listener
>>
>
> That's weird, can you check why the memory listener is not registered
> at vhost_vdpa_set_owner? Or, if it is registered, why is it not
> registered by the time vhost_vdpa_dev_start is called? This changes
> the downtime a lot, more than half of the time is spent on this. So it
> is worth fixing it before continuing.
>
The memory listener is registered at vhost_vdpa_set_owner, but the
reason we see v->shared->listener_registered == 0 by the time
vhost_vdpa_dev_start is called is due to the vhost_vdpa_reset_device
that's called shortly after.
But this re-registering is relatively quick compared to how long it
takes during its initialization sequence.
>> And this is very different than the churning we saw in my previous email
>> that happens on the source / single guest VM with vhost-vDPA and its
>> startup path.
>>
>> ---
>>
>> Again, apologies on the confusion this caused. This was my fault for not
>> being more careful.
>>
>
> No worries!
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 1/6] migration: Add virtio-iterative capability
2025-08-29 9:24 ` Markus Armbruster
@ 2025-09-01 14:10 ` Jonah Palmer
0 siblings, 0 replies; 66+ messages in thread
From: Jonah Palmer @ 2025-09-01 14:10 UTC (permalink / raw)
To: Markus Armbruster
Cc: qemu-devel, peterx, farosas, eblake, jasowang, mst, si-wei.liu,
eperezma, boris.ostrovsky
On 8/29/25 5:24 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
>
>> On 8/27/25 2:37 AM, Markus Armbruster wrote:
>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>
>>>> On 8/26/25 2:11 AM, Markus Armbruster wrote:
>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>
>>>>>> On 8/25/25 8:44 AM, Markus Armbruster wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>>>
>>>>>>>> On 8/8/25 6:48 AM, Markus Armbruster wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
>>>>>>>>>> Adds a new migration capability 'virtio-iterative' that will allow
>>>>>>>>>> virtio devices, where supported, to iteratively migrate configuration
>>>>>>>>>> changes that occur during the migration process.
>>>>>>>>>
>>>>>>>>> Why is that desirable?
>>>>>>>>
>>>>>>>> To be frank, I wasn't sure if having a migration capability, or even
>>>>>>>> have it toggleable at all, would be desirable or not. It appears though
>>>>>>>> that this might be better off as a per-device feature set via
>>>>>>>> --device virtio-net-pci,iterative-mig=on,..., for example.
>>>>>>>
>>>>>>> See below.
>>>>>>>
>>>>>>>> And by "iteratively migrate configuration changes" I meant more along
>>>>>>>> the lines of the device's state as it continues running on the source.
>>>>>>>
>>>>>>> Isn't that what migration does always?
>>>>>>
>>>>>> Essentially yes, but today all of the state is only migrated at the end, once the source has been paused. So the final correct state is always sent to the destination.
>>>>>
>>>>> As far as I understand (and ignoring lots of detail, including post
>>>>> copy), we have three stages:
>>>>>
>>>>> 1. Source runs, migrate memory pages. Pages that get dirtied after they
>>>>> are migrated need to be migrated again.
>>>>>
>>>>> 2. Neither source or destination runs, migrate remaining memory pages
>>>>> and device state.
>>>>>
>>>>> 3. Destination starts to run.
>>>>>
>>>>> If the duration of stage 2 (downtime) was of no concern, we'd switch to
>>>>> it immediately, i.e. without migrating anything in stage 1. This would
>>>>> minimize I/O.
>>>>>
>>>>> Of course, we actually care for limiting downtime. We switch to stage 2
>>>>> when "little enough" is left for stage two to migrate.
>>>>>
>>>>>> If we're no longer waiting until the source has been paused and the initial state is sent early, then we need to make sure that any changes that happen is still communicated to the destination.
>>>>>
>>>>> So you're proposing to treat suitable parts of the device state more
>>>>> like memory pages. Correct?
>>>>>
>>>>
>>>> Not in the sense of "something got dirtied so let's immediately re-send
>>>> that" like we would with RAM. It's more along the lines of "something
>>>> got dirtied so let's make sure that gets re-sent at the start of stage 2".
>>>
>>> Or is it "something might have dirtied, just resend in stage 2"?
>>>
>>
>> Exactly. This is better wording since it doesn't necessarily have to be
>> sent at the "start" of stage 2. Just at some point during it.
>
> Got it.
>
>>>> The entire state of a virtio-net device (even with vhost-net /
>>>> vhost-vDPA) is <10KB I believe. I don't believe there's much to gain by
>>>> "iteratively" re-sending changes for virtio-net. It should be suitable
>>>> enough to just re-send whatever changed during stage 1 (after the
>>>> initial state was sent) at the start of stage 2.
>>>
>>> Got it.
>>>
>>>> This is why I'm currently looking into a solution that uses VMSD's
>>>> .early_setup flag (that Peter recommended) rather than implementing a
>>>> suite of SaveVMHandlers hooks (like this RFC does). We don't need this
>>>> iterative capability as much as we need to start migrating the state
>>>> earlier (and doing corresponding config/prep work) during stage 1.
>>>>
>>>>> Cover letter and commit message of PATCH 4 provide the motivation: you
>>>>> observe a shorter downtime. You speculate this is due to moving "heavy
>>>>> allocations and page-fault latencies" from stage 2 to stage 1. Correct?
>>>>
>>>> Correct. But again I'd like to stress that this is just one part in
>>>> reducing downtime during stage 2. The biggest reductions will come from
>>>> the config/prep work that we're trying to move from stage 2 to stage 1,
>>>> especially when vhost-vDPA is involved. And we can only do this early
>>>> work once we have the state, hence why we're sending it earlier.
>>>
>>> This is an important bit of detail I've been missing so far. Easy
>>> enough to fix in a future commit message and cover letter.
>>>
>>
>> Ack.
>>
>>>>> Is there anything that makes virtio-net particularly suitable?
>>>>
>>>> Yes, especially with vhost-vDPA and configuring VQs. See Eugenio's
>>>> comment here
>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAJaqyWdUutZrAWKy9d=ip*h*y3BnptUrcL8Xj06XfizNxPtfpw@mail.gmail.com/__;Kys!!ACWV5N9M2RV99hQ!MHkMmGR7j2n9i2We6Mh3xXX03yEve90Bhs0aEDCVYU4Z0n-op-0aDlpWMBGZ7CpjDOBnhTkDVJjx8QcQ$ .
>>>
>>> Such prep work commonly depends only on device configuration, not state.
>>> I'm curious: what state bits exactly does the prep work need?
>>>
>>> Device configuration is available at the start of stage 1, state is
>>> fully available only at the end of stage 2.
>>>
>>
>> We pretty much need, more or less, all of the state of the VirtIODevice
>> itself as well as the bits of the VirtIONet device. Essentially, barring
>> ring indices, we'd need whatever is required throughout most of the
>> device's startup routine.
>>
>> In this series, we get everything we need from the vmstate_save_state(f,
>> &vmstate_virtio_net, ...) and vmstate_load_state(f, &vmstate_virtio_net,
>> ...) calls early during stage 1 (see patch 4/6).
>>
>> Once we've gotten this data, we can start on the prep work that's
>> normally done today during stage 2.
>
> This is unusual. I'd like to understand it better.
>
> Non-migration startup:
>
> 1. We create the device. This runs its .init().
>
> 2. We configure the device by setting device properties.
>
> 3. We realize the device. This runs its .realize(), which initializes
> the device state according to its configuration.
>
> 4. The guest interacts with the device. Device state changes.
>
> When is the expensive prep work we've been discussing done here?
>
During 4.). The expensive vhost bring-up (e.g. for vhost-vDPA) happens
during vhost_dev_start when it has to send ioctls to configure the
memory table, VQs, etc.
This prep work depends on negotiated features, device configuration
(MAC, MTU, MQ), VQ layouts (vring addresses), memory table, etc. It
doesn't require dynamic VQ state like ring indices.
>>> Your patches make *tentative* device state available in stage 1.
>>> Tentative, because it may still change afterwards.
>>>
>>> You use tentative state to do certain expensive work in stage 1 already,
>>> in order to cut downtime in stage 2.
>>>
>>> Fair?
>>
>> Correct.
>
> Got it.
>
>>> Can state change in ways that invalidate this work?
>>>
>>
>> If, for some reason, the guest wanted to change everything during
>> migration (specifically during stage 1), then it'd more or less negate
>> the early prep work we'd've done. How impactful this is would depend on
>> which route we go (see below). God forbid the guest just wait until
>> migration is complete.
>
> So the answer is yes.
>
>>> If yes, how do you handle this?
>>>
>>
>> So it depends on the route this series goes. That is, whether we go the
>> truly iterative SaveVMHandlers hooks route (which this series uses) or
>> if we go the early_setup VMSD route (which Peter recommended).
>>
>> ---
>>
>> If we go the truly iterative route, then technically we can still handle
>> these changes during stage 1 and still keep the work out of stage 2.
>>
>> However, given the nicheness of such a corner case (where things are
>> being changed last minute during migration), handling these changes
>> iteratively might be overdesign.
>>
>> And we'd have to guard against the scenario where the guest acts
>> maliciously by constantly changing things to prevent migration from
>> continuing.
>
> Yes.
>
>> ---
>>
>> If we go the early_setup VMSD route, where we get one shot early to do
>> stuff during stage 1 and one last shot to do things later during stage
>> 2, then the more that gets changed means the less beneficial this early
>> work becomes. This is because any changes made during stage 1 could only
>> be handled during stage 2, which is what this overall effort is trying
>> to minimize.
>
> Stupidest solution that could possibly work: if anything impacting the
> prep work changed, redo it from scratch.
>
>>> If no, do you verify the "no change" design assumption holds?
>>>
>>
>> When it comes to early migration for devices, we can never support a "no
>> change" design assumption. The difference in the design lies in how (and
>> when) such changes are handled during migration.
>
> Just checking :)
>
> [...]
>
>>>>>>>> But I think for now this is better left as a virtio-net configuration
>>>>>>>> rather than as a migration capability (e.g. --device
>>>>>>>> virtio-net-pci,iterative-mig=on/off,...)
>>>>>>>
>>>>>>> Makes sense to me (but I'm not a migration expert).
>>>>>
>>>>> A device property's default can depend on the machine type via compat
>>>>> properties. This is normally used to restrict a guest-visible change to
>>>>> newer machine types. Here, it's not guest-visible. But it can get you
>>>>> this:
>>>>>
>>>>> * Migrate new machine type from new QEMU to new QEMU (old QEMU doesn't
>>>>> have the machine type): iterative is enabled by default. Good. User
>>>>> can disable it on both ends to not get the improvement. Enabling it
>>>>> on just one breaks migration.
>>>>>
>>>>> All other cases go away with time.
>>>>>
>>>>> * Migrate old machine type from new QEMU to new QEMU: iterative is
>>>>> disabled by default, which is sad, but no worse than before. User can
>>>>> enable it on both ends to get the improvement. Enabling it on just
>>>>> one breaks migration.
>>>>>
>>>>> * Migrate old machine type from new QEMU to old QEMU or vice versa:
>>>>> iterative is off by default. Good. Enabling it on the new one breaks
>>>>> migration.
>>>>>
>>>>> * Migrate old machine type from old QEMU to old QEMU: iterative is off
>>>>>
>>>>> I figure almost all users could simply ignore this configuration knob
>>>>> then.
>>>>
>>>> Oh, that's interesting. I wasn't aware of this. But couldn't this
>>>> potentially cause some headaches and confusion when attempting to
>>>> migrate between 2 guests where one VM is using a machine type does
>>>> support it and the other isn't?
>>>>
>>>> For example, the source and destination VMs both specify '-machine
>>>> q35,...' and the q35 alias resolves into, say, pc-q35-10.1 for the
>>>> source VM and pc-q35-10.0 for the destination VM. And say this property
>>>> is supported on >= pc-q35-10.1.
>>>
>>> In my understanding, migration requires identical machine types on both
>>> ends, and all bets are off when they're different.
>>>
>>
>> Ah, true.
>>
>>>> IIUC, this would mean that iterative is enabled by default on the source
>>>> VM but disabled by default on the destination VM.
>>>>
>>>> Then a user attempts the migration, the migration fails, and then they'd
>>>> have to try and figure out why it's failing.
>>>
>>> Migration failures due to mismatched configuration tend to be that way,
>>> don't they?
>>>
>>
>> Right.
>>
>> So if we pin this feature to always be enabled for machine type, say, >=
>> pc-q35-XX.X, then can we assume that both guests can actually support
>> this feature?
>>
>> In other words, conversely, is it possible in production that both
>> guests use pc-q35-XX.X but one build supports this early migration
>> feature and the other doesn't?
>
> I'd call that a bug.
>
> Here's how we commonly code property defaulds depending on the machine
> type.
>
> The property defaults to the new default (here: feature enabled).
>
> Machine types older than the current (unreleased) one use a compat
> property to change it the old default (here: feature disabled). With
> this value, the device must be compatible with its older versions in
> prior release of QEMU, both for guest and for migration.
>
> Once you got that right, it's fairly unlikely to break accidentally.
>
> The current machine type then defaults the feature to enabled in the
> current and all future versions of QEMU. The machine type doesn't exist
> in older versions of QEMU.
>
> Older machine types default it to disabled in the current and all future
> versions of QEMU, which is compatible with older versions of QEMU.
>
Got it. This is something I will look into then for this kind of
implementation. Thank you!
>> If we can assume that, then this would probably be the right approach
>> for something like this.
>>
>>>> Furthermore, since it's a device property that's essentially set at VM
>>>> creation time, either the source would have to be reset and explicitly
>>>> set this property to off or the destination would have to be reset and
>>>> use a newer (>= pc-q35-10.1) machine type before starting it back up and
>>>> perform the migration.
>>>
>>> You can use qom-set to change a device property after you created the
>>> device. It might even work. However, qom-set is a deeply problematic
>>> and seriously underdocumented interface. Avoid.
>>>
>>> But will you need to change it?
>>>
>>> If you started the source with an explicit property value, start the
>>> destination the same way. Same as for any number of other configuration
>>> knobs.
>>>
>>> If you started the source with the default property value, start the
>>> destination the same way. Values will match as long as the machine type
>>> matches, as it should.
>>>
>>
>> Given that migration can only be done with matching machine types and if
>> we can assume that guests using pc-q35-XX.X, for example, will always
>> have this support, then my concerns about this are allayed.
>
> Glad I was able to assist here!
>
>>>> Am I understanding this correctly?
>>>>
>>>>>>> [...]
>>>
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration
2025-09-01 13:17 ` Jonah Palmer
@ 2025-09-02 7:31 ` Eugenio Perez Martin
0 siblings, 0 replies; 66+ messages in thread
From: Eugenio Perez Martin @ 2025-09-02 7:31 UTC (permalink / raw)
To: Jonah Palmer
Cc: Peter Xu, si-wei.liu, qemu-devel, farosas, eblake, armbru,
jasowang, mst, boris.ostrovsky, Dragos Tatulea DE
On Mon, Sep 1, 2025 at 3:17 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 9/1/25 2:57 AM, Eugenio Perez Martin wrote:
> > On Wed, Aug 27, 2025 at 6:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >>
> >> On 8/20/25 3:59 AM, Eugenio Perez Martin wrote:
> >>> On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
> >>>>> On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> >>>>>>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>>>>>>>>>> This effort was started to reduce the guest visible downtime by
> >>>>>>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> >>>>>>>>>>>>> vhost-vDPA.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from having to
> >>>>>>>>>>>>> migrate a lot of state but rather expensive backend control-plane latency
> >>>>>>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN filters, offload
> >>>>>>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations which
> >>>>>>>>>>>>> dominates its downtime.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In other words, by migrating the state of virtio-net early (before the
> >>>>>>>>>>>>> stop-and-copy phase), we can also start staging backend configurations,
> >>>>>>>>>>>>> which is the main contributor of downtime when migrating a vhost-vDPA
> >>>>>>>>>>>>> device.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I apologize if this series gives the impression that we're migrating a lot
> >>>>>>>>>>>>> of data here. It's more along the lines of moving control-plane latency out
> >>>>>>>>>>>>> of the stop-and-copy phase.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I see, thanks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please add these into the cover letter of the next post. IMHO it's
> >>>>>>>>>>>> extremely important information to explain the real goal of this work. I
> >>>>>>>>>>>> bet it is not expected for most people when reading the current cover
> >>>>>>>>>>>> letter.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then it could have nothing to do with iterative phase, am I right?
> >>>>>>>>>>>>
> >>>>>>>>>>>> What are the data needed for the dest QEMU to start staging backend
> >>>>>>>>>>>> configurations to the HWs underneath? Does dest QEMU already have them in
> >>>>>>>>>>>> the cmdlines?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Asking this because I want to know whether it can be done completely
> >>>>>>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If src QEMU's data is still needed, please also first consider providing
> >>>>>>>>>>>> such facility using an "early VMSD" if it is ever possible: feel free to
> >>>>>>>>>>>> refer to commit 3b95a71b22827d26178.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> While it works for this series, it does not allow to resend the state
> >>>>>>>>>>> when the src device changes. For example, if the number of virtqueues
> >>>>>>>>>>> is modified.
> >>>>>>>>>>
> >>>>>>>>>> Some explanation on "how sync number of vqueues helps downtime" would help.
> >>>>>>>>>> Not "it might preheat things", but exactly why, and how that differs when
> >>>>>>>>>> it's pure software, and when hardware will be involved.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> >>>>>>>>> about ~200ms:
> >>>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >>>>>>>>>
> >>>>>>>>> Adding Dragos here in case he can provide more details. Maybe the
> >>>>>>>>> numbers have changed though.
> >>>>>>>>>
> >>>>>>>>> And I guess the difference with pure SW will always come down to PCI
> >>>>>>>>> communications, which assume it is slower than configuring the host SW
> >>>>>>>>> device in RAM or even CPU cache. But I admin that proper profiling is
> >>>>>>>>> needed before making those claims.
> >>>>>>>>>
> >>>>>>>>> Jonah, can you print the time it takes to configure the vDPA device
> >>>>>>>>> with traces vs the time it takes to enable the dataplane of the
> >>>>>>>>> device? So we can get an idea of how much time we save with this.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Let me know if this isn't what you're looking for.
> >>>>>>>>
> >>>>>>>> I'm assuming by "configuration time" you mean:
> >>>>>>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> >>>>>>>> before we start enabling the vrings (e.g.
> >>>>>>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> >>>>>>>>
> >>>>>>>> And by "time taken to enable the dataplane" I'm assuming you mean:
> >>>>>>>> - Time right before we start enabling the vrings (see above) to right
> >>>>>>>> after we enable the last vring (at the end of
> >>>>>>>> vhost_vdpa_net_cvq_load())
> >>>>>>>>
> >>>>>>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> >>>>>>>>
> >>>>>>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> >>>>>>>> queues=8,x-svq=on
> >>>>>>>>
> >>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> >>>>>>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> >>>>>>>> ctrl_vlan=off,vectors=18,host_mtu=9000,
> >>>>>>>> disable-legacy=on,disable-modern=off
> >>>>>>>>
> >>>>>>>> ---
> >>>>>>>>
> >>>>>>>> Configuration time: ~31s
> >>>>>>>> Dataplane enable time: ~0.14ms
> >>>>>>>>
> >>>>>>>
> >>>>>>> I was vague, but yes, that's representative enough! It would be more
> >>>>>>> accurate if the configuration time ends by the time QEMU enables the
> >>>>>>> first queue of the dataplane though.
> >>>>>>>
> >>>>>>> As Si-Wei mentions, is v->shared->listener_registered == true at the
> >>>>>>> beginning of vhost_vdpa_dev_start?
> >>>>>>>
> >>>>>>
> >>>>>> Ah, I also realized that Qemu I was using for measurements was using a
> >>>>>> version before the listener_registered member was introduced.
> >>>>>>
> >>>>>> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
> >>>>>> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
> >>>>>> times for measurements.
> >>>>>>
> >>>>>> v->shared->listener_registered == false at the beginning of
> >>>>>> vhost_vdpa_dev_start().
> >>>>>>
> >>>>>
> >>>>> Let's move out the effect of the mem pinning from the downtime by
> >>>>> registering the listener before the migration. Can you check why is it
> >>>>> not registered at vhost_vdpa_set_owner?
> >>>>>
> >>>>
> >>>> Sorry I was profiling improperly. The listener is registered at
> >>>> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
> >>>> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
> >>>> as false and is re-registered later in the function.
> >>>>
> >>>> Should we always expect listener_registered == true at every
> >>>> vhost_vdpa_dev_start call during startup?
> >>>
> >>> Yes, that leaves all the memory pinning time out of the downtime.
> >>>
> >>>> This is what I traced during
> >>>> startup of a single guest (no migration).
> >>>
> >>> We can trace the destination's QEMU to be more accurate, but probably
> >>> it makes no difference.
> >>>
> >>>> Tracepoint is right at the
> >>>> start of the vhost_vdpa_dev_start function:
> >>>>
> >>>> vhost_vdpa_set_owner() - register memory listener
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>
> >>> This is surprising. Can you trace how listener_registered goes to 0 again?
> >>>
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >>>> ...
> >>>> * VQs are now being enabled *
> >>>>
> >>>> I'm also seeing that when the guest is being shutdown,
> >>>> dev->vhost_ops->vhost_get_vring_base() is failing in
> >>>> do_vhost_virtqueue_stop():
> >>>>
> >>>> ...
> >>>> [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
> >>>> [ 114.719255] systemd-shutdown[1]: Powering off.
> >>>> [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >>>> [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5
> >>>> [ 114.725593] reboot: Power down
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
> >>>> permitted (1)
> >>>> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> >>>>
> >>>> However when x-svq=on, I don't see these errors on shutdown.
> >>>>
> >>>
> >>> SVQ can mask this error as it does not need to forward the ring
> >>> restore message to the device. It can just start with 0 and convert
> >>> indexes.
> >>>
> >>> Let's focus on listened_registered first :).
> >>>
> >>>>>> ---
> >>>>>>
> >>>>>> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> >>>>>> right after Qemu enables the first VQ.
> >>>>>> - 26.947s, 26.606s, 27.326s
> >>>>>>
> >>>>>> Enable dataplane: Time from right after first VQ is enabled to right
> >>>>>> after the last VQ is enabled.
> >>>>>> - 0.081ms, 0.081ms, 0.079ms
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >> I looked into this a bit more and realized I was being naive thinking
> >> that the vhost-vDPA device startup path of a single VM would be the same
> >> as that on a destination VM during live migration. This is **not** the
> >> case and I apologize for the confusion I caused.
> >>
> >> What I described and profiled above is indeed true for the startup of a
> >> single VM / source VM with a vhost-vDPA device. However, this is not
> >> true on the destination side and its configuration time is drastically
> >> different.
> >>
> >> Under the same specs, but now with a live migration performed between a
> >> source and destination VM (128G Mem, SVQ=off, CVQ=on, 8 queue pairs),
> >> and using the same tracepoints to find the configuration time and enable
> >> dataplane time, these are the measurements I found for the **destination
> >> VM**:
> >>
> >> Configuration time: Time from first entry into vhost_vdpa_dev_start to
> >> right after Qemu enables the first VQ.
> >> - 268.603ms, 241.515ms, 249.007ms
> >>
> >> Enable dataplane time: Time from right after the first VQ is enabled to
> >> right after the last VQ is enabled.
> >> - 0.072ms, 0.071ms, 0.070ms
> >>
> >> ---
> >>
> >> For those curious, using the same printouts as I did above, this is what
> >> it actually looks like on the destination side:
> >>
> >> * Destination VM is started *
> >>
> >> vhost_vdpa_set_owner() - register memory listener
> >> vhost_vdpa_reset_device() - unregistering listener
> >>
> >> * Start live migration on source VM *
> >> (qemu) migrate unix:/tmp/lm.sock
> >> ...
> >>
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> >> vhost_vdpa_dev_start() - register listener
> >>
> >
> > That's weird, can you check why the memory listener is not registered
> > at vhost_vdpa_set_owner? Or, if it is registered, why is it not
> > registered by the time vhost_vdpa_dev_start is called? This changes
> > the downtime a lot, more than half of the time is spent on this. So it
> > is worth fixing it before continuing.
> >
>
> The memory listener is registered at vhost_vdpa_set_owner, but the
> reason we see v->shared->listener_registered == 0 by the time
> vhost_vdpa_dev_start is called is due to the vhost_vdpa_reset_device
> that's called shortly after.
>
Ok, I missed the status of this.
This first reset is avoidable actually. I see two routes for this:
1) Do not reset if shared->listener_registered. Maybe we should rename
that member actually, as now it means something like "The device is
blank and ready to be configured". Or maybe dedicate two variables or
flags, is a shame to lose the precision of "listener_registered".
2) Implement the VHOST_BACKEND_F_IOTLB_PERSIST part of Si-Wei's series [1].
I'd greatly prefer option 1, as it does not depend on the backend
features and it is more generic. But the option 2 will be needed to
reduce the SVQ transition downtime too.
> But this re-registering is relatively quick compared to how long it
> takes during its initialization sequence.
>
That's interesting, I guess it is because the regions are warm. Can
you measure the time of it so we can evaluate if it is worth comparing
with the iterative migration?
Thanks!
[1] https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg00909.html
^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2025-09-02 7:32 UTC | newest]
Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 12:41 [RFC 0/6] virtio-net: initial iterative live migration support Jonah Palmer
2025-07-22 12:41 ` [RFC 1/6] migration: Add virtio-iterative capability Jonah Palmer
2025-08-06 15:58 ` Peter Xu
2025-08-07 12:50 ` Jonah Palmer
2025-08-07 13:13 ` Peter Xu
2025-08-07 14:20 ` Jonah Palmer
2025-08-08 10:48 ` Markus Armbruster
2025-08-11 12:18 ` Jonah Palmer
2025-08-25 12:44 ` Markus Armbruster
2025-08-25 14:57 ` Jonah Palmer
2025-08-26 6:11 ` Markus Armbruster
2025-08-26 18:08 ` Jonah Palmer
2025-08-27 6:37 ` Markus Armbruster
2025-08-28 15:29 ` Jonah Palmer
2025-08-29 9:24 ` Markus Armbruster
2025-09-01 14:10 ` Jonah Palmer
2025-07-22 12:41 ` [RFC 2/6] virtio-net: Reorder vmstate_virtio_net and helpers Jonah Palmer
2025-07-22 12:41 ` [RFC 3/6] virtio-net: Add SaveVMHandlers for iterative migration Jonah Palmer
2025-07-22 12:41 ` [RFC 4/6] virtio-net: iter live migration - migrate vmstate Jonah Palmer
2025-07-23 6:51 ` Michael S. Tsirkin
2025-07-24 14:45 ` Jonah Palmer
2025-07-25 9:31 ` Michael S. Tsirkin
2025-07-28 12:30 ` Jonah Palmer
2025-07-22 12:41 ` [RFC 5/6] virtio, virtio-net: skip consistency check in virtio_load for iterative migration Jonah Palmer via
2025-07-28 15:30 ` [RFC 5/6] virtio,virtio-net: " Eugenio Perez Martin
2025-07-28 16:23 ` Jonah Palmer
2025-07-30 8:59 ` Eugenio Perez Martin
2025-08-06 16:27 ` Peter Xu
2025-08-07 14:18 ` Jonah Palmer
2025-08-07 16:31 ` Peter Xu
2025-08-11 12:30 ` Jonah Palmer
2025-08-11 13:39 ` Peter Xu
2025-08-11 21:26 ` Jonah Palmer
2025-08-11 21:55 ` Peter Xu
2025-08-12 15:51 ` Jonah Palmer
2025-08-13 9:25 ` Eugenio Perez Martin
2025-08-13 14:06 ` Peter Xu
2025-08-14 9:28 ` Eugenio Perez Martin
2025-08-14 16:16 ` Dragos Tatulea
2025-08-14 20:27 ` Peter Xu
2025-08-15 14:50 ` Jonah Palmer
2025-08-15 19:35 ` Si-Wei Liu
2025-08-18 6:51 ` Eugenio Perez Martin
2025-08-18 14:46 ` Jonah Palmer
2025-08-18 16:21 ` Peter Xu
2025-08-19 7:20 ` Eugenio Perez Martin
2025-08-19 7:10 ` Eugenio Perez Martin
2025-08-19 15:10 ` Jonah Palmer
2025-08-20 7:59 ` Eugenio Perez Martin
2025-08-25 12:16 ` Jonah Palmer
2025-08-27 16:55 ` Jonah Palmer
2025-09-01 6:57 ` Eugenio Perez Martin
2025-09-01 13:17 ` Jonah Palmer
2025-09-02 7:31 ` Eugenio Perez Martin
2025-07-22 12:41 ` [RFC 6/6] virtio-net: skip vhost_started assertion during " Jonah Palmer
2025-07-23 5:51 ` [RFC 0/6] virtio-net: initial iterative live migration support Jason Wang
2025-07-24 21:59 ` Jonah Palmer
2025-07-25 9:18 ` Lei Yang
2025-07-25 9:33 ` Michael S. Tsirkin
2025-07-28 7:09 ` Jason Wang
2025-07-28 7:35 ` Jason Wang
2025-07-28 12:41 ` Jonah Palmer
2025-07-28 14:51 ` Eugenio Perez Martin
2025-07-28 15:38 ` Eugenio Perez Martin
2025-07-29 2:38 ` Jason Wang
2025-07-29 12:41 ` Jonah Palmer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).