* [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support
@ 2023-05-17 15:52 Avihai Horon
2023-05-17 15:52 ` [PATCH v2 1/7] migration: Add precopy initial data capability Avihai Horon
` (6 more replies)
0 siblings, 7 replies; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Hello everyone,
This is v2 of the precopy initial data series.
I am still wondering about the name -- maybe "explicit switchover", as
suggested by Peter, is better as it's more general?
Anyway,
Changes from v1 [3]:
* Rebased on latest master branch.
* Updated to latest QAPI doc comment conventions and refined
QAPI docs and capability error message. (Markus)
* Followed Peter/Juan suggestion and removed the handshake between
source and destination.
Now the capability must be set on both source and destination.
Compatibility of this feature between different QEMU versions or
different host capabilities (i.e., kernel) is achieved in the regular
way of device properties and hw_comapt_x_y.
* Replaced is_initial_data_active() and initial_data_loaded()
SaveVMHandlers handlers with a notification mechanism. (Peter)
* Set the capability also in destination in the migration test.
* Added VFIO device property x-allow-pre-copy to be able to preserve
compatibility between different QEMU versions or different host
capabilities (i.e., kernel).
* Changed VFIO precopy initial data implementation according to the
above changes.
* Documented VFIO precopy initial data support in VFIO migration
documentation.
* Added R-bs.
===
This series adds a new migration capability called "precopy initial
data". The purpose of this capability is to reduce migration downtime in
cases where loading of migration data in the destination can take a lot
of time, such as with VFIO migration data.
The series then moves to add precopy support and precopy initial data
support for VFIO migration.
Precopy initial data is used by VFIO migration, but other migration
users can add support for it and use it as well.
=== Background ===
Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.
While this may be true for RAM, it's not necessarily true for other
migration users. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources and
prepare internal data structures which can take a significant amount of
time to do.
This poses a problem, as the source may think that the remaining
migration data is small enough to meet the downtime limit, so it will
stop the VM and complete the migration, but in fact sending and loading
the data in the destination may take longer than the downtime limit.
To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy stream [1]. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.
The new precopy initial data migration capability helps us achieve this.
It allows the source to send initial precopy data and the destination to
ACK that this data has been loaded. Migration will not attempt to stop
the source VM and complete the migration until this ACK is received.
Note that this relies on the return path capability to communicate from
the destination back to the source.
=== Flow of operation ===
To use precopy initial data, the capability must be enabled in both the
source and the destination.
During migration setup, migration code calls the initial_data_advise()
SaveVMHandlers handler of the migration users, both in the source and
the destination, to notify them that precopy initial data is used.
In the destination, an "initial_data_pending_num" counter is increased
for each migration user that supports this feature. It will be used
later to mark when an ACK should be sent to the source.
Migration starts to send precopy data and as part of it also the initial
precopy data. Initial precopy data is just like any other precopy data
and as such, migration code is not aware of it. Therefore, it's the
responsibility of the migration users (such as VFIO devices) to notify
their counterparts in the destination that their initial precopy data
has been sent (for example, VFIO migration does it when its initial
bytes reach zero).
In the destination, when a migration user finishes to receive and load
its initial data, it notifies the migration code about it and the
"initial_data_pending_num" counter is decreased. When this counter
reaches zero, it means that all initial data has been loaded in the
destination and an ACK is sent to the source, which will now be able to
complete migration when appropriate.
=== Test results ===
The below table shows the downtime of two identical migrations. In the
first migration precopy initial data is disabled and in the second it is
enabled. The migrated VM is assigned with a mlx5 VFIO device which has
300MB of device data to be migrated.
+----------------------+-----------------------+----------+
| Precopy initial data | VFIO device data size | Downtime |
+----------------------+-----------------------+----------+
| Disabled | 300MB | 1900ms |
| Enabled | 300MB | 420ms |
+----------------------+-----------------------+----------+
Precopy initial data gives a roughly 4.5 times improvement in downtime.
The 1480ms difference is time that is used for resource allocation for
the VFIO device in the destination. Without precopy initial data, this
time is spent when the source VM is stopped and thus the downtime is
much higher. With precopy initial data, the time is spent when the
source VM is still running.
=== Patch breakdown ===
- Patches 1-4 add the precopy initial data capability.
- Patches 5-6 add VFIO migration precopy support. Similar version of
them was previously sent here [2].
- Patch 7 adds precopy initial data support for VFIO migration.
Thanks for reviewing!
[1]
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048
[2]
https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avihaih@nvidia.com/
[3]
https://lore.kernel.org/qemu-devel/20230501140141.11743-1-avihaih@nvidia.com/
Avihai Horon (7):
migration: Add precopy initial data capability
migration: Implement precopy initial data logic
migration: Enable precopy initial data capability
tests: Add migration precopy initial data capability test
vfio/migration: Refactor vfio_save_block() to return saved data size
vfio/migration: Add VFIO migration pre-copy support
vfio/migration: Add support for precopy initial data capability
docs/devel/vfio-migration.rst | 45 +++++--
qapi/migration.json | 9 +-
include/hw/vfio/vfio-common.h | 6 +
include/migration/register.h | 6 +
migration/migration.h | 14 +++
migration/options.h | 1 +
migration/savevm.h | 2 +
hw/core/machine.c | 1 +
hw/vfio/common.c | 6 +-
hw/vfio/migration.c | 220 +++++++++++++++++++++++++++++++---
hw/vfio/pci.c | 2 +
migration/migration.c | 40 ++++++-
migration/options.c | 17 +++
migration/savevm.c | 65 ++++++++++
tests/qtest/migration-test.c | 26 ++++
hw/vfio/trace-events | 4 +-
migration/trace-events | 4 +
17 files changed, 435 insertions(+), 33 deletions(-)
--
2.26.3
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 1/7] migration: Add precopy initial data capability
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
@ 2023-05-17 15:52 ` Avihai Horon
2023-05-17 15:52 ` [PATCH v2 2/7] migration: Implement precopy initial data logic Avihai Horon
` (5 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.
While this may be true for RAM, it's not necessarily true for other
migration users. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources, prepare
internal data structures and so on. These operations can take a
significant amount of time which can increase migration downtime.
This patch adds a new capability "precopy initial data" that allows the
source to send initial precopy data and the destination to ACK that this
data has been loaded. Migration will not attempt to stop the source VM
and complete the migration until this ACK is received.
This will allow migration users to send initial precopy data which can
be used to reduce downtime (e.g., by pre-allocating resources), while
making sure that the source will stop the VM and complete the migration
only after this initial precopy data is sent and loaded in the
destination so it will have full effect.
This new capability relies on the return path capability to communicate
from the destination back to the source.
The actual implementation of the capability will be added in the
following patches.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
qapi/migration.json | 9 ++++++++-
migration/options.h | 1 +
migration/options.c | 21 +++++++++++++++++++++
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/qapi/migration.json b/qapi/migration.json
index 179af0c4d8..a6c1942064 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -487,6 +487,13 @@
# and should not affect the correctness of postcopy migration.
# (since 7.1)
#
+# @precopy-initial-data: If enabled, migration will not attempt to
+# stop source VM and complete the migration until an ACK is
+# received from the destination that initial precopy data has been
+# loaded. This can improve downtime if there are migration users
+# that support precopy initial data. 'return-path' capability
+# must be enabled to use it. (since 8.1)
+#
# Features:
#
# @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -502,7 +509,7 @@
'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
{ 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
'validate-uuid', 'background-snapshot',
- 'zero-copy-send', 'postcopy-preempt'] }
+ 'zero-copy-send', 'postcopy-preempt', 'precopy-initial-data'] }
##
# @MigrationCapabilityStatus:
diff --git a/migration/options.h b/migration/options.h
index 5cca3326d6..bba70a33bf 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -44,6 +44,7 @@ bool migrate_pause_before_switchover(void);
bool migrate_postcopy_blocktime(void);
bool migrate_postcopy_preempt(void);
bool migrate_postcopy_ram(void);
+bool migrate_precopy_initial_data(void);
bool migrate_rdma_pin_all(void);
bool migrate_release_ram(void);
bool migrate_return_path(void);
diff --git a/migration/options.c b/migration/options.c
index c2a278ee2d..0a31921a7a 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -184,6 +184,8 @@ Property migration_properties[] = {
DEFINE_PROP_MIG_CAP("x-zero-copy-send",
MIGRATION_CAPABILITY_ZERO_COPY_SEND),
#endif
+ DEFINE_PROP_MIG_CAP("x-precopy-initial-data",
+ MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA),
DEFINE_PROP_END_OF_LIST(),
};
@@ -286,6 +288,13 @@ bool migrate_postcopy_ram(void)
return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_RAM];
}
+bool migrate_precopy_initial_data(void)
+{
+ MigrationState *s = migrate_get_current();
+
+ return s->capabilities[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA];
+}
+
bool migrate_rdma_pin_all(void)
{
MigrationState *s = migrate_get_current();
@@ -546,6 +555,18 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
}
}
+ if (new_caps[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA]) {
+ if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
+ error_setg(errp, "Capability 'precopy-initial-data' requires "
+ "capability 'return-path'");
+ return false;
+ }
+
+ /* Disable this capability until it's implemented */
+ error_setg(errp, "'precopy-initial-data' is not implemented yet");
+ return false;
+ }
+
return true;
}
--
2.26.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 2/7] migration: Implement precopy initial data logic
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
2023-05-17 15:52 ` [PATCH v2 1/7] migration: Add precopy initial data capability Avihai Horon
@ 2023-05-17 15:52 ` Avihai Horon
2023-05-17 16:39 ` Peter Xu
2023-05-17 15:52 ` [PATCH v2 3/7] migration: Enable precopy initial data capability Avihai Horon
` (4 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Implement precopy initial data logic. This allows migration users in the
source to send precopy initial data and the destination to ACK when this
data is loaded. Migration will not attempt to stop the source VM and
complete the migration until this ACK is received.
To achieve this, a new SaveVMHandlers handler initial_data_advise() and
a new return path mesage MIG_RP_MSG_INITIAL_DATA_LOADED_ACK are added.
The initial_data_advise() handler is called during migration setup both
in the source and the destination to advise the migration user that
precopy initial data is used, and its return value indicates whether
precopy initial data is supported by it.
When precopy initial data of all the migration users is loaded in
the destination, the MIG_RP_MSG_INITIAL_DATA_LOADED_ACK return path
message is sent to the source to notify it.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
include/migration/register.h | 6 ++++
migration/migration.h | 14 ++++++++
migration/savevm.h | 2 ++
migration/migration.c | 40 ++++++++++++++++++++--
migration/savevm.c | 65 ++++++++++++++++++++++++++++++++++++
migration/trace-events | 4 +++
6 files changed, 129 insertions(+), 2 deletions(-)
diff --git a/include/migration/register.h b/include/migration/register.h
index a8dfd8fefd..3ac443a55f 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -71,6 +71,12 @@ typedef struct SaveVMHandlers {
int (*load_cleanup)(void *opaque);
/* Called when postcopy migration wants to resume from failure */
int (*resume_prepare)(MigrationState *s, void *opaque);
+
+ /*
+ * Advises that precopy initial data was requested. Returns true if it's
+ * supported or false otherwise. Called both in src and dest.
+ */
+ bool (*initial_data_advise)(void *opaque);
} SaveVMHandlers;
int register_savevm_live(const char *idstr,
diff --git a/migration/migration.h b/migration/migration.h
index 7721c7658b..cc4e817939 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -202,6 +202,13 @@ struct MigrationIncomingState {
* contains valid information.
*/
QemuMutex page_request_mutex;
+
+ /*
+ * Number of migration users that are waiting for their initial data to be
+ * loaded. When this reaches zero an ACK is sent to source. No lock is
+ * needed as this field is updated serially.
+ */
+ unsigned int initial_data_pending_num;
};
MigrationIncomingState *migration_incoming_get_current(void);
@@ -430,6 +437,12 @@ struct MigrationState {
/* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
JSONWriter *vmdesc;
+
+ /*
+ * Indicates whether an ACK that precopy initial data was loaded in
+ * destination has been received.
+ */
+ bool initial_data_loaded_acked;
};
void migrate_set_state(int *state, int old_state, int new_state);
@@ -470,6 +483,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
char *block_name);
void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
+int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis);
void dirty_bitmap_mig_before_vm_start(void);
void dirty_bitmap_mig_cancel_outgoing(void);
diff --git a/migration/savevm.h b/migration/savevm.h
index fb636735f0..e1d8a2b3b2 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -32,6 +32,7 @@
bool qemu_savevm_state_blocked(Error **errp);
void qemu_savevm_non_migratable_list(strList **reasons);
void qemu_savevm_state_setup(QEMUFile *f);
+void qemu_savevm_state_initial_data_advise(MigrationState *ms);
bool qemu_savevm_state_guest_unplug_pending(void);
int qemu_savevm_state_resume_prepare(MigrationState *s);
void qemu_savevm_state_header(QEMUFile *f);
@@ -65,6 +66,7 @@ int qemu_loadvm_state(QEMUFile *f);
void qemu_loadvm_state_cleanup(void);
int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
int qemu_load_device_state(QEMUFile *f);
+int qemu_loadvm_notify_initial_data_loaded(void);
int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
bool in_postcopy, bool inactivate_disks);
diff --git a/migration/migration.c b/migration/migration.c
index 00d8ba8da0..fdb8592e64 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -78,6 +78,7 @@ enum mig_rp_message_type {
MIG_RP_MSG_REQ_PAGES, /* data (start: be64, len: be32) */
MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
+ MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, /* Tell source initial data is loaded */
MIG_RP_MSG_MAX
};
@@ -780,6 +781,12 @@ bool migration_has_all_channels(void)
return true;
}
+int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis)
+{
+ return migrate_send_rp_message(mis, MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, 0,
+ NULL);
+}
+
/*
* Send a 'SHUT' message on the return channel with the given value
* to indicate that we've finished with the RP. Non-0 value indicates
@@ -1425,6 +1432,7 @@ void migrate_init(MigrationState *s)
s->vm_was_running = false;
s->iteration_initial_bytes = 0;
s->threshold_size = 0;
+ s->initial_data_loaded_acked = false;
}
int migrate_add_blocker_internal(Error *reason, Error **errp)
@@ -1741,6 +1749,9 @@ static struct rp_cmd_args {
[MIG_RP_MSG_REQ_PAGES_ID] = { .len = -1, .name = "REQ_PAGES_ID" },
[MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
[MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
+ [MIG_RP_MSG_INITIAL_DATA_LOADED_ACK] = { .len = 0,
+ .name =
+ "INITIAL_DATA_LOADED_ACK" },
[MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
};
@@ -1979,6 +1990,11 @@ retry:
}
break;
+ case MIG_RP_MSG_INITIAL_DATA_LOADED_ACK:
+ ms->initial_data_loaded_acked = true;
+ trace_source_return_path_thread_initial_data_loaded_ack();
+ break;
+
default:
break;
}
@@ -2727,6 +2743,20 @@ static void migration_update_counters(MigrationState *s,
bandwidth, s->threshold_size);
}
+static bool initial_data_loaded_acked(MigrationState *s)
+{
+ if (!migrate_precopy_initial_data()) {
+ return true;
+ }
+
+ /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
+ if (!runstate_is_running()) {
+ return true;
+ }
+
+ return s->initial_data_loaded_acked;
+}
+
/* Migration thread iteration status */
typedef enum {
MIG_ITERATE_RESUME, /* Resume current iteration */
@@ -2742,6 +2772,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
{
uint64_t must_precopy, can_postcopy;
bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
+ bool initial_data_loaded = initial_data_loaded_acked(s);
qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
uint64_t pending_size = must_precopy + can_postcopy;
@@ -2754,7 +2785,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
}
- if (!pending_size || pending_size < s->threshold_size) {
+ if ((!pending_size || pending_size < s->threshold_size) &&
+ initial_data_loaded) {
trace_migration_thread_low_pending(pending_size);
migration_completion(s);
return MIG_ITERATE_BREAK;
@@ -2762,7 +2794,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
/* Still a significant amount to transfer */
if (!in_postcopy && must_precopy <= s->threshold_size &&
- qatomic_read(&s->start_postcopy)) {
+ initial_data_loaded && qatomic_read(&s->start_postcopy)) {
if (postcopy_start(s)) {
error_report("%s: postcopy failed to start", __func__);
}
@@ -2986,6 +3018,10 @@ static void *migration_thread(void *opaque)
qemu_savevm_state_setup(s->to_dst_file);
+ if (migrate_precopy_initial_data()) {
+ qemu_savevm_state_initial_data_advise(s);
+ }
+
qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
MIGRATION_STATUS_ACTIVE);
diff --git a/migration/savevm.c b/migration/savevm.c
index e33788343a..c713ace891 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1233,6 +1233,32 @@ bool qemu_savevm_state_guest_unplug_pending(void)
return false;
}
+void qemu_savevm_state_initial_data_advise(MigrationState *ms)
+{
+ SaveStateEntry *se;
+ unsigned int supported_num = 0;
+
+ QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+ if (!se->ops || !se->ops->initial_data_advise) {
+ continue;
+ }
+
+ if (se->ops->initial_data_advise(se->opaque)) {
+ supported_num++;
+ }
+ }
+
+ if (!supported_num) {
+ /*
+ * There are no migration users that support precopy initial data. Set
+ * initial data loaded acked to true so migration can be completed.
+ */
+ ms->initial_data_loaded_acked = true;
+ }
+
+ trace_savevm_state_initial_data_advise(supported_num);
+}
+
void qemu_savevm_state_setup(QEMUFile *f)
{
MigrationState *ms = migrate_get_current();
@@ -2586,6 +2612,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
return 0;
}
+static void qemu_loadvm_state_initial_data_advise(MigrationIncomingState *mis)
+{
+ SaveStateEntry *se;
+
+ QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+ if (!se->ops || !se->ops->initial_data_advise) {
+ continue;
+ }
+
+ if (se->ops->initial_data_advise(se->opaque)) {
+ mis->initial_data_pending_num++;
+ }
+ }
+
+ trace_loadvm_state_initial_data_advise(mis->initial_data_pending_num);
+}
+
static int qemu_loadvm_state_setup(QEMUFile *f)
{
SaveStateEntry *se;
@@ -2789,6 +2832,10 @@ int qemu_loadvm_state(QEMUFile *f)
return -EINVAL;
}
+ if (migrate_precopy_initial_data()) {
+ qemu_loadvm_state_initial_data_advise(mis);
+ }
+
cpu_synchronize_all_pre_loadvm();
ret = qemu_loadvm_state_main(f, mis);
@@ -2862,6 +2909,24 @@ int qemu_load_device_state(QEMUFile *f)
return 0;
}
+int qemu_loadvm_notify_initial_data_loaded(void)
+{
+ MigrationIncomingState *mis = migration_incoming_get_current();
+
+ if (!mis->initial_data_pending_num) {
+ return -EINVAL;
+ }
+
+ mis->initial_data_pending_num--;
+ trace_loadvm_notify_initial_data_loaded(mis->initial_data_pending_num);
+
+ if (mis->initial_data_pending_num) {
+ return 0;
+ }
+
+ return migrate_send_rp_initial_data_loaded_ack(mis);
+}
+
bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
bool has_devices, strList *devices, Error **errp)
{
diff --git a/migration/trace-events b/migration/trace-events
index f39818c329..807083c0a1 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
qemu_loadvm_state_post_main(int ret) "%d"
qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
qemu_savevm_send_packaged(void) ""
+loadvm_state_initial_data_advise(unsigned int initial_data_pending_num) "Initial data pending num=%u"
loadvm_state_setup(void) ""
loadvm_state_cleanup(void) ""
loadvm_handle_cmd_packaged(unsigned int length) "%u"
@@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
loadvm_process_command_ping(uint32_t val) "0x%x"
+loadvm_notify_initial_data_loaded(unsigned int initial_data_pending_num) "Initial data pending num=%u"
postcopy_ram_listen_thread_exit(void) ""
postcopy_ram_listen_thread_start(void) ""
qemu_savevm_send_postcopy_advise(void) ""
@@ -39,6 +41,7 @@ savevm_send_postcopy_resume(void) ""
savevm_send_colo_enable(void) ""
savevm_send_recv_bitmap(char *name) "%s"
savevm_state_setup(void) ""
+savevm_state_initial_data_advise(unsigned int initial_data_supported_num) "Initial data supported num=%u"
savevm_state_resume_prepare(void) ""
savevm_state_header(void) ""
savevm_state_iterate(void) ""
@@ -180,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
source_return_path_thread_pong(uint32_t val) "0x%x"
source_return_path_thread_shut(uint32_t val) "0x%x"
source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
+source_return_path_thread_initial_data_loaded_ack(void) ""
migration_thread_low_pending(uint64_t pending) "%" PRIu64
migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
--
2.26.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 3/7] migration: Enable precopy initial data capability
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
2023-05-17 15:52 ` [PATCH v2 1/7] migration: Add precopy initial data capability Avihai Horon
2023-05-17 15:52 ` [PATCH v2 2/7] migration: Implement precopy initial data logic Avihai Horon
@ 2023-05-17 15:52 ` Avihai Horon
2023-05-17 16:07 ` Peter Xu
2023-05-17 15:52 ` [PATCH v2 4/7] tests: Add migration precopy initial data capability test Avihai Horon
` (3 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Now that precopy initial data logic has been implemented, enable the
capability.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
migration/options.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/migration/options.c b/migration/options.c
index 0a31921a7a..3449ce4f14 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -561,10 +561,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
"capability 'return-path'");
return false;
}
-
- /* Disable this capability until it's implemented */
- error_setg(errp, "'precopy-initial-data' is not implemented yet");
- return false;
}
return true;
--
2.26.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 4/7] tests: Add migration precopy initial data capability test
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
` (2 preceding siblings ...)
2023-05-17 15:52 ` [PATCH v2 3/7] migration: Enable precopy initial data capability Avihai Horon
@ 2023-05-17 15:52 ` Avihai Horon
2023-05-17 16:40 ` Peter Xu
2023-05-17 15:52 ` [PATCH v2 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
` (2 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Add migration precopy initial data capability test. The test runs
without migration users that support this capability, but is still
useful to make sure it didn't break anything.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
tests/qtest/migration-test.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index b99b49a314..3b3c806104 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1648,6 +1648,28 @@ static void test_precopy_tcp_plain(void)
test_precopy_common(&args);
}
+static void *test_migrate_initial_data_start(QTestState *from, QTestState *to)
+{
+
+ migrate_set_capability(from, "return-path", true);
+ migrate_set_capability(to, "return-path", true);
+
+ migrate_set_capability(from, "precopy-initial-data", true);
+ migrate_set_capability(to, "precopy-initial-data", true);
+
+ return NULL;
+}
+
+static void test_precopy_tcp_initial_data(void)
+{
+ MigrateCommon args = {
+ .listen_uri = "tcp:127.0.0.1:0",
+ .start_hook = test_migrate_initial_data_start,
+ };
+
+ test_precopy_common(&args);
+}
+
#ifdef CONFIG_GNUTLS
static void test_precopy_tcp_tls_psk_match(void)
{
@@ -2695,6 +2717,10 @@ int main(int argc, char **argv)
#endif /* CONFIG_GNUTLS */
qtest_add_func("/migration/precopy/tcp/plain", test_precopy_tcp_plain);
+
+ qtest_add_func("/migration/precopy/tcp/plain/precopy-initial-data",
+ test_precopy_tcp_initial_data);
+
#ifdef CONFIG_GNUTLS
qtest_add_func("/migration/precopy/tcp/tls/psk/match",
test_precopy_tcp_tls_psk_match);
--
2.26.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
` (3 preceding siblings ...)
2023-05-17 15:52 ` [PATCH v2 4/7] tests: Add migration precopy initial data capability test Avihai Horon
@ 2023-05-17 15:52 ` Avihai Horon
2023-05-17 15:52 ` [PATCH v2 6/7] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-05-17 15:52 ` [PATCH v2 7/7] vfio/migration: Add support for precopy initial data capability Avihai Horon
6 siblings, 0 replies; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Refactor vfio_save_block() to return the size of saved data on success
and -errno on error.
This will be used in next patch to implement VFIO migration pre-copy
support.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
hw/vfio/migration.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 6b58dddb88..235978fd68 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -241,8 +241,8 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
return 0;
}
-/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
-static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+/* Returns the size of saved data on success and -errno on error */
+static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
{
ssize_t data_size;
@@ -252,7 +252,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
return -errno;
}
if (data_size == 0) {
- return 1;
+ return 0;
}
qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
@@ -262,7 +262,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
trace_vfio_save_block(migration->vbasedev->name, data_size);
- return qemu_file_get_error(f);
+ return qemu_file_get_error(f) ?: data_size;
}
/* ---------------------------------------------------------------------- */
@@ -335,6 +335,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
{
VFIODevice *vbasedev = opaque;
+ ssize_t data_size;
int ret;
/* We reach here with device state STOP only */
@@ -345,11 +346,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
}
do {
- ret = vfio_save_block(f, vbasedev->migration);
- if (ret < 0) {
- return ret;
+ data_size = vfio_save_block(f, vbasedev->migration);
+ if (data_size < 0) {
+ return data_size;
}
- } while (!ret);
+ } while (data_size);
qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
ret = qemu_file_get_error(f);
--
2.26.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 6/7] vfio/migration: Add VFIO migration pre-copy support
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
` (4 preceding siblings ...)
2023-05-17 15:52 ` [PATCH v2 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
@ 2023-05-17 15:52 ` Avihai Horon
2023-05-17 15:52 ` [PATCH v2 7/7] vfio/migration: Add support for precopy initial data capability Avihai Horon
6 siblings, 0 replies; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.
Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].
In addition, add a new VFIO device property x-allow-pre-copy to keep
migration compatibility to/from older QEMU versions that don't have VFIO
pre-copy support.
[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
docs/devel/vfio-migration.rst | 35 +++++---
include/hw/vfio/vfio-common.h | 4 +
hw/core/machine.c | 1 +
hw/vfio/common.c | 6 +-
hw/vfio/migration.c | 163 ++++++++++++++++++++++++++++++++--
hw/vfio/pci.c | 2 +
hw/vfio/trace-events | 4 +-
7 files changed, 193 insertions(+), 22 deletions(-)
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1b68ccf115..e896b2a673 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
destination host. This document details how saving and restoring of VFIO
devices is done in QEMU.
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
+VFIO_DEVICE_FEATURE_MIGRATION ioctl.
Note that currently VFIO migration is supported only for a single device. This
is due to VFIO migration's lack of P2P support. However, P2P support is planned
@@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
* A ``load_setup`` function that sets the VFIO device on the destination in
_RESUMING state.
+* A ``state_pending_estimate`` function that reports an estimate of the
+ remaining pre-copy data that the vendor driver has yet to save for the VFIO
+ device.
+
* A ``state_pending_exact`` function that reads pending_bytes from the vendor
driver, which indicates the amount of data that the vendor driver has yet to
save for the VFIO device.
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+ active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+ vendor driver during iterative pre-copy phase.
+
* A ``save_state`` function to save the device config space if it is present.
* A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -111,8 +123,10 @@ Flow of state changes during Live migration
===========================================
Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
Live migration save path
------------------------
@@ -124,11 +138,12 @@ Live migration save path
|
migrate_init spawns migration_thread
Migration thread then calls each device's .save_setup()
- (RUNNING, _SETUP, _RUNNING)
+ (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
|
- (RUNNING, _ACTIVE, _RUNNING)
- If device is active, get pending_bytes by .state_pending_exact()
+ (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
+ If device is active, get pending_bytes by .state_pending_{estimate,exact}()
If total pending_bytes >= threshold_size, call .save_live_iterate()
+ [Data of VFIO device for pre-copy phase is copied]
Iterate till total pending bytes converge and are less than threshold
|
On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eed244f25f..5ce7a01d56 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,9 @@ typedef struct VFIOMigration {
int data_fd;
void *data_buffer;
size_t data_buffer_size;
+ uint64_t precopy_init_size;
+ uint64_t precopy_dirty_size;
+ uint64_t mig_flags;
} VFIOMigration;
typedef struct VFIOAddressSpace {
@@ -143,6 +146,7 @@ typedef struct VFIODevice {
VFIOMigration *migration;
Error *migration_blocker;
OnOffAuto pre_copy_dirty_page_tracking;
+ bool allow_pre_copy;
bool dirty_pages_supported;
bool dirty_tracking;
} VFIODevice;
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 47a34841a5..cde449a6c9 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@
GlobalProperty hw_compat_8_0[] = {
{ "migration", "multifd-flush-after-each-section", "on"},
+ { "vfio-pci", "x-allow-pre-copy", "false" },
};
const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 78358ede27..b73086e17a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
}
if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
- migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+ (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+ migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
return false;
}
}
@@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
return false;
}
- if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+ if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+ migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
continue;
} else {
return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 235978fd68..418efed019 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
return "STOP_COPY";
case VFIO_DEVICE_STATE_RESUMING:
return "RESUMING";
+ case VFIO_DEVICE_STATE_PRE_COPY:
+ return "PRE_COPY";
default:
return "UNKNOWN STATE";
}
@@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
return 0;
}
+static int vfio_query_precopy_size(VFIOMigration *migration)
+{
+ struct vfio_precopy_info precopy = {
+ .argsz = sizeof(precopy),
+ };
+
+ if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+ return -errno;
+ }
+
+ migration->precopy_init_size = precopy.initial_bytes;
+ migration->precopy_dirty_size = precopy.dirty_bytes;
+
+ return 0;
+}
+
/* Returns the size of saved data on success and -errno on error */
static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
{
@@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
data_size = read(migration->data_fd, migration->data_buffer,
migration->data_buffer_size);
if (data_size < 0) {
+ /* Pre-copy emptied all the device state for now */
+ if (errno == ENOMSG) {
+ return 0;
+ }
+
return -errno;
}
if (data_size == 0) {
@@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
return qemu_file_get_error(f) ?: data_size;
}
+static void vfio_update_estimated_pending_data(VFIOMigration *migration,
+ uint64_t data_size)
+{
+ if (!data_size) {
+ /*
+ * Pre-copy emptied all the device state for now, update estimated sizes
+ * accordingly.
+ */
+ migration->precopy_init_size = 0;
+ migration->precopy_dirty_size = 0;
+
+ return;
+ }
+
+ if (migration->precopy_init_size) {
+ uint64_t init_size = MIN(migration->precopy_init_size, data_size);
+
+ migration->precopy_init_size -= init_size;
+ data_size -= init_size;
+ }
+
+ migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
+ data_size);
+}
+
+static bool vfio_precopy_supported(VFIODevice *vbasedev)
+{
+ VFIOMigration *migration = vbasedev->migration;
+
+ return vbasedev->allow_pre_copy &&
+ migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
+}
+
/* ---------------------------------------------------------------------- */
static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
return -ENOMEM;
}
+ if (vfio_precopy_supported(vbasedev)) {
+ int ret;
+
+ migration->precopy_init_size = 0;
+ migration->precopy_dirty_size = 0;
+
+ switch (migration->device_state) {
+ case VFIO_DEVICE_STATE_RUNNING:
+ ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+ VFIO_DEVICE_STATE_RUNNING);
+ if (ret) {
+ return ret;
+ }
+
+ vfio_query_precopy_size(migration);
+
+ break;
+ case VFIO_DEVICE_STATE_STOP:
+ /* vfio_save_complete_precopy() will go to STOP_COPY */
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+
trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
trace_vfio_save_cleanup(vbasedev->name);
}
+static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
+ uint64_t *can_postcopy)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+
+ if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+ return;
+ }
+
+ *must_precopy +=
+ migration->precopy_init_size + migration->precopy_dirty_size;
+
+ trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
+ *can_postcopy,
+ migration->precopy_init_size,
+ migration->precopy_dirty_size);
+}
+
/*
* Migration size of VFIO devices can be as little as a few KBs or as big as
* many GBs. This value should be big enough to cover the worst case.
*/
#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
-/*
- * Only exact function is implemented and not estimate function. The reason is
- * that during pre-copy phase of migration the estimate function is called
- * repeatedly while pending RAM size is over the threshold, thus migration
- * can't converge and querying the VFIO device pending data size is useless.
- */
static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
uint64_t *can_postcopy)
{
VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
/*
@@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
*must_precopy += stop_copy_size;
+ if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+ migration->precopy_init_size = 0;
+ migration->precopy_dirty_size = 0;
+ vfio_query_precopy_size(migration);
+
+ *must_precopy +=
+ migration->precopy_init_size + migration->precopy_dirty_size;
+ }
+
trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
- stop_copy_size);
+ stop_copy_size, migration->precopy_init_size,
+ migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+
+ return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+ ssize_t data_size;
+
+ data_size = vfio_save_block(f, migration);
+ if (data_size < 0) {
+ return data_size;
+ }
+ qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+ vfio_update_estimated_pending_data(migration, data_size);
+
+ trace_vfio_save_iterate(vbasedev->name);
+
+ /*
+ * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+ * Return 1 so following handlers will not be potentially blocked.
+ */
+ return 1;
}
static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
ssize_t data_size;
int ret;
- /* We reach here with device state STOP only */
+ /* We reach here with device state STOP or STOP_COPY only */
ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
VFIO_DEVICE_STATE_STOP);
if (ret) {
@@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
static const SaveVMHandlers savevm_vfio_handlers = {
.save_setup = vfio_save_setup,
.save_cleanup = vfio_save_cleanup,
+ .state_pending_estimate = vfio_state_pending_estimate,
.state_pending_exact = vfio_state_pending_exact,
+ .is_active_iterate = vfio_is_active_iterate,
+ .save_live_iterate = vfio_save_iterate,
.save_live_complete_precopy = vfio_save_complete_precopy,
.save_state = vfio_save_state,
.load_setup = vfio_load_setup,
@@ -470,13 +609,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
static void vfio_vmstate_change(void *opaque, bool running, RunState state)
{
VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
enum vfio_device_mig_state new_state;
int ret;
if (running) {
new_state = VFIO_DEVICE_STATE_RUNNING;
} else {
- new_state = VFIO_DEVICE_STATE_STOP;
+ new_state =
+ (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+ (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+ VFIO_DEVICE_STATE_STOP_COPY :
+ VFIO_DEVICE_STATE_STOP;
}
/*
@@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
migration->vbasedev = vbasedev;
migration->device_state = VFIO_DEVICE_STATE_RUNNING;
migration->data_fd = -1;
+ migration->mig_flags = mig_flags;
vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bf27a39905..72f30ce09f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
vbasedev.pre_copy_dirty_page_tracking,
ON_OFF_AUTO_ON),
+ DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
+ vbasedev.allow_pre_copy, true),
DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
display, ON_OFF_AUTO_OFF),
DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 646e42fd27..fd6893cb43 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
vfio_save_cleanup(const char *name) " (%s)"
vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_iterate(const char *name) " (%s)"
vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
--
2.26.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 7/7] vfio/migration: Add support for precopy initial data capability
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
` (5 preceding siblings ...)
2023-05-17 15:52 ` [PATCH v2 6/7] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-05-17 15:52 ` Avihai Horon
6 siblings, 0 replies; 15+ messages in thread
From: Avihai Horon @ 2023-05-17 15:52 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé, Yanan Wang,
Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Loading of a VFIO device's data can take a substantial amount of time as
the device may need to allocate resources, prepare internal data
structures, etc. This can increase migration downtime, especially for
VFIO devices with a lot of resources.
To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy data stream. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.
Use migration precopy initial data capability to make sure a VFIO
device's initial bytes are sent and loaded in the destination before the
source stops the VM and attempts to complete the migration.
This can significantly reduce migration downtime for some devices.
As precopy support and precopy initial data support come together in
VFIO migration, use x-allow-pre-copy device property to control usage of
this feature as well.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
docs/devel/vfio-migration.rst | 10 +++++++++
include/hw/vfio/vfio-common.h | 2 ++
hw/vfio/migration.c | 42 ++++++++++++++++++++++++++++++++++-
3 files changed, 53 insertions(+), 1 deletion(-)
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index e896b2a673..aebf63b02b 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
VFIO_DEVICE_FEATURE_MIGRATION ioctl.
+When pre-copy is supported, it's possible to further reduce downtime by
+enabling "precopy-initial-data" migration capability.
+VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
+and recommends that the initial bytes are sent and loaded in the destination
+before stopping the source VM. Enabling this migration capability will
+guarantee that and thus, can potentially reduce downtime even further.
+
Note that currently VFIO migration is supported only for a single device. This
is due to VFIO migration's lack of P2P support. However, P2P support is planned
to be added later on.
@@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative approach as follows:
* A ``save_live_iterate`` function that reads the VFIO device's data from the
vendor driver during iterative pre-copy phase.
+* An ``initial_data_advise`` function that advises the VFIO device to use
+ "precopy-initial-data" migration capability if supported.
+
* A ``save_state`` function to save the device config space if it is present.
* A ``save_live_complete_precopy`` function that sets the VFIO device in
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5ce7a01d56..698b2b8cc5 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -69,6 +69,8 @@ typedef struct VFIOMigration {
uint64_t precopy_init_size;
uint64_t precopy_dirty_size;
uint64_t mig_flags;
+ bool initial_data_active;
+ bool initial_data_sent;
} VFIOMigration;
typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 418efed019..5d98c4cd94 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -18,6 +18,7 @@
#include "sysemu/runstate.h"
#include "hw/vfio/vfio-common.h"
#include "migration/migration.h"
+#include "migration/savevm.h"
#include "migration/vmstate.h"
#include "migration/qemu-file.h"
#include "migration/register.h"
@@ -45,6 +46,7 @@
#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL)
#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
#define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
+#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
/*
* This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -380,6 +382,8 @@ static void vfio_save_cleanup(void *opaque)
g_free(migration->data_buffer);
migration->data_buffer = NULL;
+ migration->initial_data_sent = false;
+ migration->initial_data_active = false;
vfio_migration_cleanup(vbasedev);
trace_vfio_save_cleanup(vbasedev->name);
}
@@ -455,10 +459,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
if (data_size < 0) {
return data_size;
}
- qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
vfio_update_estimated_pending_data(migration, data_size);
+ if (migration->initial_data_active && !migration->precopy_init_size &&
+ !migration->initial_data_sent) {
+ qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
+ migration->initial_data_sent = true;
+ } else {
+ qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+ }
+
trace_vfio_save_iterate(vbasedev->name);
/*
@@ -576,6 +587,24 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
}
break;
}
+ case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
+ {
+ if (!vbasedev->migration->initial_data_active) {
+ error_report("%s: Received INIT_DATA_SENT but initial data is "
+ "not active",
+ vbasedev->name);
+ return -EINVAL;
+ }
+
+ ret = qemu_loadvm_notify_initial_data_loaded();
+ if (ret) {
+ error_report("%s: qemu_loadvm_notify_initial_data_loaded "
+ "failed, err=%d (%s)",
+ vbasedev->name, ret, strerror(-ret));
+ }
+
+ return ret;
+ }
default:
error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
return -EINVAL;
@@ -590,6 +619,16 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
return ret;
}
+static bool vfio_initial_data_advise(void *opaque)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+
+ migration->initial_data_active = vfio_precopy_supported(vbasedev);
+
+ return migration->initial_data_active;
+}
+
static const SaveVMHandlers savevm_vfio_handlers = {
.save_setup = vfio_save_setup,
.save_cleanup = vfio_save_cleanup,
@@ -602,6 +641,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
.load_setup = vfio_load_setup,
.load_cleanup = vfio_load_cleanup,
.load_state = vfio_load_state,
+ .initial_data_advise = vfio_initial_data_advise,
};
/* ---------------------------------------------------------------------- */
--
2.26.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/7] migration: Enable precopy initial data capability
2023-05-17 15:52 ` [PATCH v2 3/7] migration: Enable precopy initial data capability Avihai Horon
@ 2023-05-17 16:07 ` Peter Xu
2023-05-18 7:26 ` Avihai Horon
0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-05-17 16:07 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Wed, May 17, 2023 at 06:52:15PM +0300, Avihai Horon wrote:
> Now that precopy initial data logic has been implemented, enable the
> capability.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
> migration/options.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/migration/options.c b/migration/options.c
> index 0a31921a7a..3449ce4f14 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -561,10 +561,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
> "capability 'return-path'");
> return false;
> }
> -
> - /* Disable this capability until it's implemented */
> - error_setg(errp, "'precopy-initial-data' is not implemented yet");
> - return false;
> }
I'm always confused why we need this and not having this squashed into
patch 1 (or say, never have these lines).
The only thing it matters is when someone backports patch 1 but not
backport the rest of the patches. But that's really, really weird already
as a backporter doing that, and I doubt its happening.
Neither should we merge patch 1 without merging follow up patches to
master, as we should just always merge the whole feature or just keep
reworking on the list.
I'd like to know if I missed something else..
PS: sorry to be late on replying to your email for previous version due to
travelling last week, I'll reply to your series instead. Actually I was
just writting up the reply to your previous version when receiving this
one. :)
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/7] migration: Implement precopy initial data logic
2023-05-17 15:52 ` [PATCH v2 2/7] migration: Implement precopy initial data logic Avihai Horon
@ 2023-05-17 16:39 ` Peter Xu
2023-05-18 7:45 ` Avihai Horon
0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-05-17 16:39 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Wed, May 17, 2023 at 06:52:14PM +0300, Avihai Horon wrote:
> Implement precopy initial data logic. This allows migration users in the
> source to send precopy initial data and the destination to ACK when this
> data is loaded. Migration will not attempt to stop the source VM and
> complete the migration until this ACK is received.
>
> To achieve this, a new SaveVMHandlers handler initial_data_advise() and
> a new return path mesage MIG_RP_MSG_INITIAL_DATA_LOADED_ACK are added.
>
> The initial_data_advise() handler is called during migration setup both
> in the source and the destination to advise the migration user that
> precopy initial data is used, and its return value indicates whether
> precopy initial data is supported by it.
>
> When precopy initial data of all the migration users is loaded in
> the destination, the MIG_RP_MSG_INITIAL_DATA_LOADED_ACK return path
> message is sent to the source to notify it.
This looks much better and easier to digest, thanks a lot.
It does answer one of my question that I wanted to ask on whether both
sides would know which device will need this feature enabled. It seems
both sides are actually in consensus with it, which is great on reducing
the changeset. Then it's less controversial on whether we'll need a more
generic handshake because we simply don't need the per-dev handshake for
now.
The name is probably prone to change indeed. Firstly, it's not really
always on precopy but also for postcopy. If vfio will supports postcopy in
the future this feature will also be needed to make sure low downtime
during switching to postcopy. I saw that you even changed postcopy for
this [1] which is definitely good. So "precopy" in the cap is not proper
anymore.
How about "switchover-ack"? Then..
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
> include/migration/register.h | 6 ++++
> migration/migration.h | 14 ++++++++
> migration/savevm.h | 2 ++
> migration/migration.c | 40 ++++++++++++++++++++--
> migration/savevm.c | 65 ++++++++++++++++++++++++++++++++++++
> migration/trace-events | 4 +++
> 6 files changed, 129 insertions(+), 2 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index a8dfd8fefd..3ac443a55f 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -71,6 +71,12 @@ typedef struct SaveVMHandlers {
> int (*load_cleanup)(void *opaque);
> /* Called when postcopy migration wants to resume from failure */
> int (*resume_prepare)(MigrationState *s, void *opaque);
> +
> + /*
> + * Advises that precopy initial data was requested. Returns true if it's
> + * supported or false otherwise. Called both in src and dest.
> + */
> + bool (*initial_data_advise)(void *opaque);
.. this can be "switchover_ack_needed()". Ditto below on most of the name
of variables.
> } SaveVMHandlers;
>
> int register_savevm_live(const char *idstr,
> diff --git a/migration/migration.h b/migration/migration.h
> index 7721c7658b..cc4e817939 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -202,6 +202,13 @@ struct MigrationIncomingState {
> * contains valid information.
> */
> QemuMutex page_request_mutex;
> +
> + /*
> + * Number of migration users that are waiting for their initial data to be
> + * loaded. When this reaches zero an ACK is sent to source. No lock is
> + * needed as this field is updated serially.
> + */
> + unsigned int initial_data_pending_num;
> };
>
> MigrationIncomingState *migration_incoming_get_current(void);
> @@ -430,6 +437,12 @@ struct MigrationState {
>
> /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
> JSONWriter *vmdesc;
> +
> + /*
> + * Indicates whether an ACK that precopy initial data was loaded in
> + * destination has been received.
> + */
> + bool initial_data_loaded_acked;
> };
>
> void migrate_set_state(int *state, int old_state, int new_state);
> @@ -470,6 +483,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
> void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
> char *block_name);
> void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis);
>
> void dirty_bitmap_mig_before_vm_start(void);
> void dirty_bitmap_mig_cancel_outgoing(void);
> diff --git a/migration/savevm.h b/migration/savevm.h
> index fb636735f0..e1d8a2b3b2 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -32,6 +32,7 @@
> bool qemu_savevm_state_blocked(Error **errp);
> void qemu_savevm_non_migratable_list(strList **reasons);
> void qemu_savevm_state_setup(QEMUFile *f);
> +void qemu_savevm_state_initial_data_advise(MigrationState *ms);
> bool qemu_savevm_state_guest_unplug_pending(void);
> int qemu_savevm_state_resume_prepare(MigrationState *s);
> void qemu_savevm_state_header(QEMUFile *f);
> @@ -65,6 +66,7 @@ int qemu_loadvm_state(QEMUFile *f);
> void qemu_loadvm_state_cleanup(void);
> int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> int qemu_load_device_state(QEMUFile *f);
> +int qemu_loadvm_notify_initial_data_loaded(void);
> int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> bool in_postcopy, bool inactivate_disks);
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 00d8ba8da0..fdb8592e64 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -78,6 +78,7 @@ enum mig_rp_message_type {
> MIG_RP_MSG_REQ_PAGES, /* data (start: be64, len: be32) */
> MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
> MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
> + MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, /* Tell source initial data is loaded */
>
> MIG_RP_MSG_MAX
> };
> @@ -780,6 +781,12 @@ bool migration_has_all_channels(void)
> return true;
> }
>
> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis)
> +{
> + return migrate_send_rp_message(mis, MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, 0,
> + NULL);
> +}
> +
> /*
> * Send a 'SHUT' message on the return channel with the given value
> * to indicate that we've finished with the RP. Non-0 value indicates
> @@ -1425,6 +1432,7 @@ void migrate_init(MigrationState *s)
> s->vm_was_running = false;
> s->iteration_initial_bytes = 0;
> s->threshold_size = 0;
> + s->initial_data_loaded_acked = false;
> }
>
> int migrate_add_blocker_internal(Error *reason, Error **errp)
> @@ -1741,6 +1749,9 @@ static struct rp_cmd_args {
> [MIG_RP_MSG_REQ_PAGES_ID] = { .len = -1, .name = "REQ_PAGES_ID" },
> [MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
> [MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
> + [MIG_RP_MSG_INITIAL_DATA_LOADED_ACK] = { .len = 0,
> + .name =
> + "INITIAL_DATA_LOADED_ACK" },
Nit: let's just put it in a single line. If it'll be SWITCHOVER_ACK it's
even shorter. :)
> [MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
> };
>
> @@ -1979,6 +1990,11 @@ retry:
> }
> break;
>
> + case MIG_RP_MSG_INITIAL_DATA_LOADED_ACK:
> + ms->initial_data_loaded_acked = true;
> + trace_source_return_path_thread_initial_data_loaded_ack();
> + break;
> +
> default:
> break;
> }
> @@ -2727,6 +2743,20 @@ static void migration_update_counters(MigrationState *s,
> bandwidth, s->threshold_size);
> }
>
> +static bool initial_data_loaded_acked(MigrationState *s)
> +{
> + if (!migrate_precopy_initial_data()) {
> + return true;
> + }
> +
> + /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
> + if (!runstate_is_running()) {
> + return true;
> + }
This check may or may not be a problem, as it can become running right away
after the check.
But I think I get your point here and I think it's fine; it at least means
the vm stopped for even a while and the user doesn't strictly care super
lot about downtime.
> +
> + return s->initial_data_loaded_acked;
> +}
> +
> /* Migration thread iteration status */
> typedef enum {
> MIG_ITERATE_RESUME, /* Resume current iteration */
> @@ -2742,6 +2772,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> {
> uint64_t must_precopy, can_postcopy;
> bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> + bool initial_data_loaded = initial_data_loaded_acked(s);
>
> qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
> uint64_t pending_size = must_precopy + can_postcopy;
> @@ -2754,7 +2785,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
> }
>
> - if (!pending_size || pending_size < s->threshold_size) {
> + if ((!pending_size || pending_size < s->threshold_size) &&
> + initial_data_loaded) {
> trace_migration_thread_low_pending(pending_size);
> migration_completion(s);
> return MIG_ITERATE_BREAK;
> @@ -2762,7 +2794,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>
> /* Still a significant amount to transfer */
> if (!in_postcopy && must_precopy <= s->threshold_size &&
> - qatomic_read(&s->start_postcopy)) {
> + initial_data_loaded && qatomic_read(&s->start_postcopy)) {
[1]
> if (postcopy_start(s)) {
> error_report("%s: postcopy failed to start", __func__);
> }
> @@ -2986,6 +3018,10 @@ static void *migration_thread(void *opaque)
>
> qemu_savevm_state_setup(s->to_dst_file);
>
> + if (migrate_precopy_initial_data()) {
> + qemu_savevm_state_initial_data_advise(s);
> + }
> +
> qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
> MIGRATION_STATUS_ACTIVE);
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index e33788343a..c713ace891 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1233,6 +1233,32 @@ bool qemu_savevm_state_guest_unplug_pending(void)
> return false;
> }
>
> +void qemu_savevm_state_initial_data_advise(MigrationState *ms)
> +{
> + SaveStateEntry *se;
> + unsigned int supported_num = 0;
> +
> + QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> + if (!se->ops || !se->ops->initial_data_advise) {
> + continue;
> + }
> +
> + if (se->ops->initial_data_advise(se->opaque)) {
> + supported_num++;
> + }
> + }
> +
> + if (!supported_num) {
> + /*
> + * There are no migration users that support precopy initial data. Set
> + * initial data loaded acked to true so migration can be completed.
> + */
> + ms->initial_data_loaded_acked = true;
> + }
This is fine but slightly hackish (e.g. someone would assume we received a
msg when this var being true, but in this case we just don't need one).
How about remember this "supported_nums" on src too and just check that in
initial_data_loaded_acked() above?
PS: we should someday have a common object for both src/dst migration which
will be suitable for variables like this, so all common vars in
MigrationState and MigrationIncomingState will go there. Another story.
> +
> + trace_savevm_state_initial_data_advise(supported_num);
> +}
> +
> void qemu_savevm_state_setup(QEMUFile *f)
> {
> MigrationState *ms = migrate_get_current();
> @@ -2586,6 +2612,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
> return 0;
> }
>
> +static void qemu_loadvm_state_initial_data_advise(MigrationIncomingState *mis)
> +{
> + SaveStateEntry *se;
> +
> + QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> + if (!se->ops || !se->ops->initial_data_advise) {
> + continue;
> + }
> +
> + if (se->ops->initial_data_advise(se->opaque)) {
> + mis->initial_data_pending_num++;
> + }
> + }
> +
> + trace_loadvm_state_initial_data_advise(mis->initial_data_pending_num);
> +}
> +
> static int qemu_loadvm_state_setup(QEMUFile *f)
> {
> SaveStateEntry *se;
> @@ -2789,6 +2832,10 @@ int qemu_loadvm_state(QEMUFile *f)
> return -EINVAL;
> }
>
> + if (migrate_precopy_initial_data()) {
> + qemu_loadvm_state_initial_data_advise(mis);
> + }
> +
> cpu_synchronize_all_pre_loadvm();
>
> ret = qemu_loadvm_state_main(f, mis);
> @@ -2862,6 +2909,24 @@ int qemu_load_device_state(QEMUFile *f)
> return 0;
> }
>
> +int qemu_loadvm_notify_initial_data_loaded(void)
> +{
> + MigrationIncomingState *mis = migration_incoming_get_current();
> +
> + if (!mis->initial_data_pending_num) {
> + return -EINVAL;
> + }
> +
> + mis->initial_data_pending_num--;
> + trace_loadvm_notify_initial_data_loaded(mis->initial_data_pending_num);
> +
> + if (mis->initial_data_pending_num) {
> + return 0;
> + }
> +
> + return migrate_send_rp_initial_data_loaded_ack(mis);
> +}
> +
> bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
> bool has_devices, strList *devices, Error **errp)
> {
> diff --git a/migration/trace-events b/migration/trace-events
> index f39818c329..807083c0a1 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
> qemu_loadvm_state_post_main(int ret) "%d"
> qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> qemu_savevm_send_packaged(void) ""
> +loadvm_state_initial_data_advise(unsigned int initial_data_pending_num) "Initial data pending num=%u"
> loadvm_state_setup(void) ""
> loadvm_state_cleanup(void) ""
> loadvm_handle_cmd_packaged(unsigned int length) "%u"
> @@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
> loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
> loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
> loadvm_process_command_ping(uint32_t val) "0x%x"
> +loadvm_notify_initial_data_loaded(unsigned int initial_data_pending_num) "Initial data pending num=%u"
> postcopy_ram_listen_thread_exit(void) ""
> postcopy_ram_listen_thread_start(void) ""
> qemu_savevm_send_postcopy_advise(void) ""
> @@ -39,6 +41,7 @@ savevm_send_postcopy_resume(void) ""
> savevm_send_colo_enable(void) ""
> savevm_send_recv_bitmap(char *name) "%s"
> savevm_state_setup(void) ""
> +savevm_state_initial_data_advise(unsigned int initial_data_supported_num) "Initial data supported num=%u"
> savevm_state_resume_prepare(void) ""
> savevm_state_header(void) ""
> savevm_state_iterate(void) ""
> @@ -180,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
> source_return_path_thread_pong(uint32_t val) "0x%x"
> source_return_path_thread_shut(uint32_t val) "0x%x"
> source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> +source_return_path_thread_initial_data_loaded_ack(void) ""
> migration_thread_low_pending(uint64_t pending) "%" PRIu64
> migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
> process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> --
> 2.26.3
>
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/7] tests: Add migration precopy initial data capability test
2023-05-17 15:52 ` [PATCH v2 4/7] tests: Add migration precopy initial data capability test Avihai Horon
@ 2023-05-17 16:40 ` Peter Xu
0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2023-05-17 16:40 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Wed, May 17, 2023 at 06:52:16PM +0300, Avihai Horon wrote:
> Add migration precopy initial data capability test. The test runs
> without migration users that support this capability, but is still
> useful to make sure it didn't break anything.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
It's a pity that there seems to have no easy way to test the RP msg and
notifications into the qtest.. Besides the cap name it looks all fine:
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/7] migration: Enable precopy initial data capability
2023-05-17 16:07 ` Peter Xu
@ 2023-05-18 7:26 ` Avihai Horon
2023-05-18 13:42 ` Peter Xu
0 siblings, 1 reply; 15+ messages in thread
From: Avihai Horon @ 2023-05-18 7:26 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 17/05/2023 19:07, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, May 17, 2023 at 06:52:15PM +0300, Avihai Horon wrote:
>> Now that precopy initial data logic has been implemented, enable the
>> capability.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>> ---
>> migration/options.c | 4 ----
>> 1 file changed, 4 deletions(-)
>>
>> diff --git a/migration/options.c b/migration/options.c
>> index 0a31921a7a..3449ce4f14 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -561,10 +561,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
>> "capability 'return-path'");
>> return false;
>> }
>> -
>> - /* Disable this capability until it's implemented */
>> - error_setg(errp, "'precopy-initial-data' is not implemented yet");
>> - return false;
>> }
> I'm always confused why we need this and not having this squashed into
> patch 1 (or say, never have these lines).
>
> The only thing it matters is when someone backports patch 1 but not
> backport the rest of the patches. But that's really, really weird already
> as a backporter doing that, and I doubt its happening.
>
> Neither should we merge patch 1 without merging follow up patches to
> master, as we should just always merge the whole feature or just keep
> reworking on the list.
>
> I'd like to know if I missed something else..
There are also git bisect considerations.
This practice is useful for git bisect for features that are enabled by
default, so you won't mistakenly run "half a feature" if you do bisect.
But here the capability must be manually enabled, so maybe it's not that
useful in this case.
I like it for the sake of good order, so this capability can't be
enabled before it's fully implemented (even if it's unlikely that
someone would do that).
>
> PS: sorry to be late on replying to your email for previous version due to
> travelling last week, I'll reply to your series instead. Actually I was
> just writting up the reply to your previous version when receiving this
> one. :)
No worries, thanks :)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/7] migration: Implement precopy initial data logic
2023-05-17 16:39 ` Peter Xu
@ 2023-05-18 7:45 ` Avihai Horon
0 siblings, 0 replies; 15+ messages in thread
From: Avihai Horon @ 2023-05-18 7:45 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 17/05/2023 19:39, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, May 17, 2023 at 06:52:14PM +0300, Avihai Horon wrote:
>> Implement precopy initial data logic. This allows migration users in the
>> source to send precopy initial data and the destination to ACK when this
>> data is loaded. Migration will not attempt to stop the source VM and
>> complete the migration until this ACK is received.
>>
>> To achieve this, a new SaveVMHandlers handler initial_data_advise() and
>> a new return path mesage MIG_RP_MSG_INITIAL_DATA_LOADED_ACK are added.
>>
>> The initial_data_advise() handler is called during migration setup both
>> in the source and the destination to advise the migration user that
>> precopy initial data is used, and its return value indicates whether
>> precopy initial data is supported by it.
>>
>> When precopy initial data of all the migration users is loaded in
>> the destination, the MIG_RP_MSG_INITIAL_DATA_LOADED_ACK return path
>> message is sent to the source to notify it.
> This looks much better and easier to digest, thanks a lot.
>
> It does answer one of my question that I wanted to ask on whether both
> sides would know which device will need this feature enabled. It seems
> both sides are actually in consensus with it, which is great on reducing
> the changeset. Then it's less controversial on whether we'll need a more
> generic handshake because we simply don't need the per-dev handshake for
> now.
>
> The name is probably prone to change indeed. Firstly, it's not really
> always on precopy but also for postcopy. If vfio will supports postcopy in
> the future this feature will also be needed to make sure low downtime
> during switching to postcopy. I saw that you even changed postcopy for
> this [1] which is definitely good. So "precopy" in the cap is not proper
> anymore.
>
> How about "switchover-ack"?
Sounds good to me.
It's shorter than "explicit switchover" and maybe more understandable
even without context.
Plus, the "switchover" terminology is already used in migration, so
we're not adding a new term here.
> Then..
>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>> include/migration/register.h | 6 ++++
>> migration/migration.h | 14 ++++++++
>> migration/savevm.h | 2 ++
>> migration/migration.c | 40 ++++++++++++++++++++--
>> migration/savevm.c | 65 ++++++++++++++++++++++++++++++++++++
>> migration/trace-events | 4 +++
>> 6 files changed, 129 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index a8dfd8fefd..3ac443a55f 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -71,6 +71,12 @@ typedef struct SaveVMHandlers {
>> int (*load_cleanup)(void *opaque);
>> /* Called when postcopy migration wants to resume from failure */
>> int (*resume_prepare)(MigrationState *s, void *opaque);
>> +
>> + /*
>> + * Advises that precopy initial data was requested. Returns true if it's
>> + * supported or false otherwise. Called both in src and dest.
>> + */
>> + bool (*initial_data_advise)(void *opaque);
> .. this can be "switchover_ack_needed()". Ditto below on most of the name
> of variables.
Sure, will change.
>
>> } SaveVMHandlers;
>>
>> int register_savevm_live(const char *idstr,
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 7721c7658b..cc4e817939 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -202,6 +202,13 @@ struct MigrationIncomingState {
>> * contains valid information.
>> */
>> QemuMutex page_request_mutex;
>> +
>> + /*
>> + * Number of migration users that are waiting for their initial data to be
>> + * loaded. When this reaches zero an ACK is sent to source. No lock is
>> + * needed as this field is updated serially.
>> + */
>> + unsigned int initial_data_pending_num;
>> };
>>
>> MigrationIncomingState *migration_incoming_get_current(void);
>> @@ -430,6 +437,12 @@ struct MigrationState {
>>
>> /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>> JSONWriter *vmdesc;
>> +
>> + /*
>> + * Indicates whether an ACK that precopy initial data was loaded in
>> + * destination has been received.
>> + */
>> + bool initial_data_loaded_acked;
>> };
>>
>> void migrate_set_state(int *state, int old_state, int new_state);
>> @@ -470,6 +483,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>> void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>> char *block_name);
>> void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
>> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis);
>>
>> void dirty_bitmap_mig_before_vm_start(void);
>> void dirty_bitmap_mig_cancel_outgoing(void);
>> diff --git a/migration/savevm.h b/migration/savevm.h
>> index fb636735f0..e1d8a2b3b2 100644
>> --- a/migration/savevm.h
>> +++ b/migration/savevm.h
>> @@ -32,6 +32,7 @@
>> bool qemu_savevm_state_blocked(Error **errp);
>> void qemu_savevm_non_migratable_list(strList **reasons);
>> void qemu_savevm_state_setup(QEMUFile *f);
>> +void qemu_savevm_state_initial_data_advise(MigrationState *ms);
>> bool qemu_savevm_state_guest_unplug_pending(void);
>> int qemu_savevm_state_resume_prepare(MigrationState *s);
>> void qemu_savevm_state_header(QEMUFile *f);
>> @@ -65,6 +66,7 @@ int qemu_loadvm_state(QEMUFile *f);
>> void qemu_loadvm_state_cleanup(void);
>> int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>> int qemu_load_device_state(QEMUFile *f);
>> +int qemu_loadvm_notify_initial_data_loaded(void);
>> int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>> bool in_postcopy, bool inactivate_disks);
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 00d8ba8da0..fdb8592e64 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -78,6 +78,7 @@ enum mig_rp_message_type {
>> MIG_RP_MSG_REQ_PAGES, /* data (start: be64, len: be32) */
>> MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
>> MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
>> + MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, /* Tell source initial data is loaded */
>>
>> MIG_RP_MSG_MAX
>> };
>> @@ -780,6 +781,12 @@ bool migration_has_all_channels(void)
>> return true;
>> }
>>
>> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis)
>> +{
>> + return migrate_send_rp_message(mis, MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, 0,
>> + NULL);
>> +}
>> +
>> /*
>> * Send a 'SHUT' message on the return channel with the given value
>> * to indicate that we've finished with the RP. Non-0 value indicates
>> @@ -1425,6 +1432,7 @@ void migrate_init(MigrationState *s)
>> s->vm_was_running = false;
>> s->iteration_initial_bytes = 0;
>> s->threshold_size = 0;
>> + s->initial_data_loaded_acked = false;
>> }
>>
>> int migrate_add_blocker_internal(Error *reason, Error **errp)
>> @@ -1741,6 +1749,9 @@ static struct rp_cmd_args {
>> [MIG_RP_MSG_REQ_PAGES_ID] = { .len = -1, .name = "REQ_PAGES_ID" },
>> [MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
>> [MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
>> + [MIG_RP_MSG_INITIAL_DATA_LOADED_ACK] = { .len = 0,
>> + .name =
>> + "INITIAL_DATA_LOADED_ACK" },
> Nit: let's just put it in a single line. If it'll be SWITCHOVER_ACK it's
> even shorter. :)
Sure.
>
>> [MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
>> };
>>
>> @@ -1979,6 +1990,11 @@ retry:
>> }
>> break;
>>
>> + case MIG_RP_MSG_INITIAL_DATA_LOADED_ACK:
>> + ms->initial_data_loaded_acked = true;
>> + trace_source_return_path_thread_initial_data_loaded_ack();
>> + break;
>> +
>> default:
>> break;
>> }
>> @@ -2727,6 +2743,20 @@ static void migration_update_counters(MigrationState *s,
>> bandwidth, s->threshold_size);
>> }
>>
>> +static bool initial_data_loaded_acked(MigrationState *s)
>> +{
>> + if (!migrate_precopy_initial_data()) {
>> + return true;
>> + }
>> +
>> + /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
>> + if (!runstate_is_running()) {
>> + return true;
>> + }
> This check may or may not be a problem, as it can become running right away
> after the check.
>
> But I think I get your point here and I think it's fine; it at least means
> the vm stopped for even a while and the user doesn't strictly care super
> lot about downtime.
Yes, exactly.
>
>> +
>> + return s->initial_data_loaded_acked;
>> +}
>> +
>> /* Migration thread iteration status */
>> typedef enum {
>> MIG_ITERATE_RESUME, /* Resume current iteration */
>> @@ -2742,6 +2772,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>> {
>> uint64_t must_precopy, can_postcopy;
>> bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
>> + bool initial_data_loaded = initial_data_loaded_acked(s);
>>
>> qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>> uint64_t pending_size = must_precopy + can_postcopy;
>> @@ -2754,7 +2785,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>> trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>> }
>>
>> - if (!pending_size || pending_size < s->threshold_size) {
>> + if ((!pending_size || pending_size < s->threshold_size) &&
>> + initial_data_loaded) {
>> trace_migration_thread_low_pending(pending_size);
>> migration_completion(s);
>> return MIG_ITERATE_BREAK;
>> @@ -2762,7 +2794,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>
>> /* Still a significant amount to transfer */
>> if (!in_postcopy && must_precopy <= s->threshold_size &&
>> - qatomic_read(&s->start_postcopy)) {
>> + initial_data_loaded && qatomic_read(&s->start_postcopy)) {
> [1]
>
>> if (postcopy_start(s)) {
>> error_report("%s: postcopy failed to start", __func__);
>> }
>> @@ -2986,6 +3018,10 @@ static void *migration_thread(void *opaque)
>>
>> qemu_savevm_state_setup(s->to_dst_file);
>>
>> + if (migrate_precopy_initial_data()) {
>> + qemu_savevm_state_initial_data_advise(s);
>> + }
>> +
>> qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>> MIGRATION_STATUS_ACTIVE);
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index e33788343a..c713ace891 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -1233,6 +1233,32 @@ bool qemu_savevm_state_guest_unplug_pending(void)
>> return false;
>> }
>>
>> +void qemu_savevm_state_initial_data_advise(MigrationState *ms)
>> +{
>> + SaveStateEntry *se;
>> + unsigned int supported_num = 0;
>> +
>> + QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> + if (!se->ops || !se->ops->initial_data_advise) {
>> + continue;
>> + }
>> +
>> + if (se->ops->initial_data_advise(se->opaque)) {
>> + supported_num++;
>> + }
>> + }
>> +
>> + if (!supported_num) {
>> + /*
>> + * There are no migration users that support precopy initial data. Set
>> + * initial data loaded acked to true so migration can be completed.
>> + */
>> + ms->initial_data_loaded_acked = true;
>> + }
> This is fine but slightly hackish (e.g. someone would assume we received a
> msg when this var being true, but in this case we just don't need one).
>
> How about remember this "supported_nums" on src too and just check that in
> initial_data_loaded_acked() above?
I see what you mean.
Yes, I will change it.
>
> PS: we should someday have a common object for both src/dst migration which
> will be suitable for variables like this, so all common vars in
> MigrationState and MigrationIncomingState will go there. Another story.
Oh, cool!
Thanks!
>
>> +
>> + trace_savevm_state_initial_data_advise(supported_num);
>> +}
>> +
>> void qemu_savevm_state_setup(QEMUFile *f)
>> {
>> MigrationState *ms = migrate_get_current();
>> @@ -2586,6 +2612,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
>> return 0;
>> }
>>
>> +static void qemu_loadvm_state_initial_data_advise(MigrationIncomingState *mis)
>> +{
>> + SaveStateEntry *se;
>> +
>> + QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> + if (!se->ops || !se->ops->initial_data_advise) {
>> + continue;
>> + }
>> +
>> + if (se->ops->initial_data_advise(se->opaque)) {
>> + mis->initial_data_pending_num++;
>> + }
>> + }
>> +
>> + trace_loadvm_state_initial_data_advise(mis->initial_data_pending_num);
>> +}
>> +
>> static int qemu_loadvm_state_setup(QEMUFile *f)
>> {
>> SaveStateEntry *se;
>> @@ -2789,6 +2832,10 @@ int qemu_loadvm_state(QEMUFile *f)
>> return -EINVAL;
>> }
>>
>> + if (migrate_precopy_initial_data()) {
>> + qemu_loadvm_state_initial_data_advise(mis);
>> + }
>> +
>> cpu_synchronize_all_pre_loadvm();
>>
>> ret = qemu_loadvm_state_main(f, mis);
>> @@ -2862,6 +2909,24 @@ int qemu_load_device_state(QEMUFile *f)
>> return 0;
>> }
>>
>> +int qemu_loadvm_notify_initial_data_loaded(void)
>> +{
>> + MigrationIncomingState *mis = migration_incoming_get_current();
>> +
>> + if (!mis->initial_data_pending_num) {
>> + return -EINVAL;
>> + }
>> +
>> + mis->initial_data_pending_num--;
>> + trace_loadvm_notify_initial_data_loaded(mis->initial_data_pending_num);
>> +
>> + if (mis->initial_data_pending_num) {
>> + return 0;
>> + }
>> +
>> + return migrate_send_rp_initial_data_loaded_ack(mis);
>> +}
>> +
>> bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>> bool has_devices, strList *devices, Error **errp)
>> {
>> diff --git a/migration/trace-events b/migration/trace-events
>> index f39818c329..807083c0a1 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>> qemu_loadvm_state_post_main(int ret) "%d"
>> qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>> qemu_savevm_send_packaged(void) ""
>> +loadvm_state_initial_data_advise(unsigned int initial_data_pending_num) "Initial data pending num=%u"
>> loadvm_state_setup(void) ""
>> loadvm_state_cleanup(void) ""
>> loadvm_handle_cmd_packaged(unsigned int length) "%u"
>> @@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>> loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
>> loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>> loadvm_process_command_ping(uint32_t val) "0x%x"
>> +loadvm_notify_initial_data_loaded(unsigned int initial_data_pending_num) "Initial data pending num=%u"
>> postcopy_ram_listen_thread_exit(void) ""
>> postcopy_ram_listen_thread_start(void) ""
>> qemu_savevm_send_postcopy_advise(void) ""
>> @@ -39,6 +41,7 @@ savevm_send_postcopy_resume(void) ""
>> savevm_send_colo_enable(void) ""
>> savevm_send_recv_bitmap(char *name) "%s"
>> savevm_state_setup(void) ""
>> +savevm_state_initial_data_advise(unsigned int initial_data_supported_num) "Initial data supported num=%u"
>> savevm_state_resume_prepare(void) ""
>> savevm_state_header(void) ""
>> savevm_state_iterate(void) ""
>> @@ -180,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
>> source_return_path_thread_pong(uint32_t val) "0x%x"
>> source_return_path_thread_shut(uint32_t val) "0x%x"
>> source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
>> +source_return_path_thread_initial_data_loaded_ack(void) ""
>> migration_thread_low_pending(uint64_t pending) "%" PRIu64
>> migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>> process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>> --
>> 2.26.3
>>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/7] migration: Enable precopy initial data capability
2023-05-18 7:26 ` Avihai Horon
@ 2023-05-18 13:42 ` Peter Xu
2023-05-18 16:51 ` Avihai Horon
0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-05-18 13:42 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Thu, May 18, 2023 at 10:26:04AM +0300, Avihai Horon wrote:
>
> On 17/05/2023 19:07, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Wed, May 17, 2023 at 06:52:15PM +0300, Avihai Horon wrote:
> > > Now that precopy initial data logic has been implemented, enable the
> > > capability.
> > >
> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > > Reviewed-by: Juan Quintela <quintela@redhat.com>
> > > ---
> > > migration/options.c | 4 ----
> > > 1 file changed, 4 deletions(-)
> > >
> > > diff --git a/migration/options.c b/migration/options.c
> > > index 0a31921a7a..3449ce4f14 100644
> > > --- a/migration/options.c
> > > +++ b/migration/options.c
> > > @@ -561,10 +561,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
> > > "capability 'return-path'");
> > > return false;
> > > }
> > > -
> > > - /* Disable this capability until it's implemented */
> > > - error_setg(errp, "'precopy-initial-data' is not implemented yet");
> > > - return false;
> > > }
> > I'm always confused why we need this and not having this squashed into
> > patch 1 (or say, never have these lines).
> >
> > The only thing it matters is when someone backports patch 1 but not
> > backport the rest of the patches. But that's really, really weird already
> > as a backporter doing that, and I doubt its happening.
> >
> > Neither should we merge patch 1 without merging follow up patches to
> > master, as we should just always merge the whole feature or just keep
> > reworking on the list.
> >
> > I'd like to know if I missed something else..
>
> There are also git bisect considerations.
> This practice is useful for git bisect for features that are enabled by
> default, so you won't mistakenly run "half a feature" if you do bisect.
> But here the capability must be manually enabled, so maybe it's not that
> useful in this case.
>
> I like it for the sake of good order, so this capability can't be enabled
> before it's fully implemented (even if it's unlikely that someone would do
> that).
Right. I was kind of thinking someone bisecting such feature will always
make sure to start from the last commit got merged, but I see your point as
a general concept.
One slightly better way to not add something and remove again is, we can
introduce migrate_precopy_initial_data() in patch 2, returning constantly
false, then we can put patch 1 (qapi interface) to be after current patch
2, where you allow migrate_precopy_initial_data() to start return true. It
saves a few lines to remove, and also one specific patch explicitly
removing it. But I think fundamentally it's similar indeed.
In case you'd like to keep this as is, feel free to add:
Reviewed-by: Peter Xu <peterx@redhat.com>
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/7] migration: Enable precopy initial data capability
2023-05-18 13:42 ` Peter Xu
@ 2023-05-18 16:51 ` Avihai Horon
0 siblings, 0 replies; 15+ messages in thread
From: Avihai Horon @ 2023-05-18 16:51 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 18/05/2023 16:42, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, May 18, 2023 at 10:26:04AM +0300, Avihai Horon wrote:
>> On 17/05/2023 19:07, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, May 17, 2023 at 06:52:15PM +0300, Avihai Horon wrote:
>>>> Now that precopy initial data logic has been implemented, enable the
>>>> capability.
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>>>> ---
>>>> migration/options.c | 4 ----
>>>> 1 file changed, 4 deletions(-)
>>>>
>>>> diff --git a/migration/options.c b/migration/options.c
>>>> index 0a31921a7a..3449ce4f14 100644
>>>> --- a/migration/options.c
>>>> +++ b/migration/options.c
>>>> @@ -561,10 +561,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
>>>> "capability 'return-path'");
>>>> return false;
>>>> }
>>>> -
>>>> - /* Disable this capability until it's implemented */
>>>> - error_setg(errp, "'precopy-initial-data' is not implemented yet");
>>>> - return false;
>>>> }
>>> I'm always confused why we need this and not having this squashed into
>>> patch 1 (or say, never have these lines).
>>>
>>> The only thing it matters is when someone backports patch 1 but not
>>> backport the rest of the patches. But that's really, really weird already
>>> as a backporter doing that, and I doubt its happening.
>>>
>>> Neither should we merge patch 1 without merging follow up patches to
>>> master, as we should just always merge the whole feature or just keep
>>> reworking on the list.
>>>
>>> I'd like to know if I missed something else..
>> There are also git bisect considerations.
>> This practice is useful for git bisect for features that are enabled by
>> default, so you won't mistakenly run "half a feature" if you do bisect.
>> But here the capability must be manually enabled, so maybe it's not that
>> useful in this case.
>>
>> I like it for the sake of good order, so this capability can't be enabled
>> before it's fully implemented (even if it's unlikely that someone would do
>> that).
> Right. I was kind of thinking someone bisecting such feature will always
> make sure to start from the last commit got merged, but I see your point as
> a general concept.
>
> One slightly better way to not add something and remove again is, we can
> introduce migrate_precopy_initial_data() in patch 2, returning constantly
> false, then we can put patch 1 (qapi interface) to be after current patch
> 2, where you allow migrate_precopy_initial_data() to start return true. It
> saves a few lines to remove, and also one specific patch explicitly
> removing it. But I think fundamentally it's similar indeed.
>
> In case you'd like to keep this as is, feel free to add:
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
I think I will keep it as is, it seems more natural to me.
However, if someone insists then I don't mind changing it.
Thanks!
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-05-18 16:57 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-17 15:52 [PATCH v2 0/7] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
2023-05-17 15:52 ` [PATCH v2 1/7] migration: Add precopy initial data capability Avihai Horon
2023-05-17 15:52 ` [PATCH v2 2/7] migration: Implement precopy initial data logic Avihai Horon
2023-05-17 16:39 ` Peter Xu
2023-05-18 7:45 ` Avihai Horon
2023-05-17 15:52 ` [PATCH v2 3/7] migration: Enable precopy initial data capability Avihai Horon
2023-05-17 16:07 ` Peter Xu
2023-05-18 7:26 ` Avihai Horon
2023-05-18 13:42 ` Peter Xu
2023-05-18 16:51 ` Avihai Horon
2023-05-17 15:52 ` [PATCH v2 4/7] tests: Add migration precopy initial data capability test Avihai Horon
2023-05-17 16:40 ` Peter Xu
2023-05-17 15:52 ` [PATCH v2 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
2023-05-17 15:52 ` [PATCH v2 6/7] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-05-17 15:52 ` [PATCH v2 7/7] vfio/migration: Add support for precopy initial data capability Avihai Horon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).