* [PATCH V4 0/8] Live update: cpr-exec
@ 2025-09-22 13:49 Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 1/8] migration: multi-mode notifier Steve Sistare
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
This patch series adds the live migration cpr-exec mode.
The new user-visible interfaces are:
  * cpr-exec (MigMode migration parameter)
  * cpr-exec-command (migration parameter)
cpr-exec mode is similar in most respects to cpr-transfer mode, with the
primary difference being that old QEMU directly exec's new QEMU.  The user
specifies the command to exec new QEMU in the migration parameter
cpr-exec-command.
Why?
In a containerized QEMU environment, cpr-exec reuses an existing QEMU
container and its assigned resources.  By contrast, cpr-transfer mode
requires a new container to be created on the same host as the target of
the CPR operation.  Resources must be reserved for the new container, while
the old container still reserves resources until the operation completes.
Avoiding over commitment requires extra work in the management layer.
This is one reason why a cloud provider may prefer cpr-exec.  A second reason
is that the container may include agents with their own connections to the
outside world, and such connections remain intact if the container is reused.
How?
cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
and by sending the unique name and value of each descriptor to new QEMU
via CPR state.
CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel that is not visible to the user.
New QEMU reads the second channel prior to creating devices or backends.
The exec itself is trivial.  After writing to the migration channels, the
migration code calls a new main-loop hook to perform the exec.
Example:
In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in cpr-exec-command.
  # qemu-kvm -monitor stdio
  -object memory-backend-memfd,id=ram0,size=1G
  -machine memory-backend=ram0 -machine aux-ram-share=on ...
  QEMU 10.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 10.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
Changes in V2:
  * dropped patch "helper to request exec" and use a BH to exec
  * used g_shell_parse_argv for cpr-exec-command parameter
  * fixed check for channel in cpr_state_load
  * tweaked QAPI docs, developer docs, and code comments
  * fixed doc: rename cpr-exec-args -> cpr-exec-command
Steve Sistare (8):
  migration: multi-mode notifier
  migration: add cpr_walk_fd
  oslib: qemu_clear_cloexec
  migration: cpr-exec-command parameter
  migration: cpr-exec save and load
  migration: cpr-exec mode
  migration: cpr-exec docs
  vfio: cpr-exec mode
 docs/devel/migration/CPR.rst   | 106 +++++++++++++++++++++++-
 qapi/migration.json            |  46 ++++++++++-
 include/migration/cpr.h        |   9 +++
 include/migration/misc.h       |  12 +++
 include/qemu/osdep.h           |   9 +++
 hw/vfio/container.c            |   3 +-
 hw/vfio/cpr-iommufd.c          |   3 +-
 hw/vfio/cpr-legacy.c           |   9 ++-
 hw/vfio/cpr.c                  |  13 +--
 migration/cpr-exec.c           | 178 +++++++++++++++++++++++++++++++++++++++++
 migration/cpr.c                |  41 +++++++++-
 migration/migration-hmp-cmds.c |  30 +++++++
 migration/migration.c          |  70 ++++++++++++----
 migration/options.c            |  14 ++++
 migration/ram.c                |   1 +
 migration/vmstate-types.c      |   8 ++
 system/vl.c                    |   4 +-
 util/oslib-posix.c             |   9 +++
 util/oslib-win32.c             |   4 +
 hmp-commands.hx                |   2 +-
 migration/meson.build          |   1 +
 migration/trace-events         |   1 +
 22 files changed, 538 insertions(+), 35 deletions(-)
 create mode 100644 migration/cpr-exec.c
base-commit: e7c1e8043a69c5a8efa39d4f9d111f7c72c076e6
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 30+ messages in thread
* [PATCH V4 1/8] migration: multi-mode notifier
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 15:18   ` Cédric Le Goater
  2025-09-22 13:49 ` [PATCH V4 2/8] migration: add cpr_walk_fd Steve Sistare
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
Allow a notifier to be added for multiple migration modes.
To allow a notifier to appear on multiple per-node lists, use
a generic list type.  We can no longer use NotifierWithReturnList,
because it shoe horns the notifier onto a single list.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 include/migration/misc.h | 12 ++++++++++
 migration/migration.c    | 60 +++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 59 insertions(+), 13 deletions(-)
diff --git a/include/migration/misc.h b/include/migration/misc.h
index a261f99..592b930 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -95,7 +95,19 @@ void migration_add_notifier(NotifierWithReturn *notify,
 void migration_add_notifier_mode(NotifierWithReturn *notify,
                                  MigrationNotifyFunc func, MigMode mode);
 
+/*
+ * Same as migration_add_notifier, but applies to all @mode in the argument
+ * list.  The list is terminated by -1 or MIG_MODE_ALL.  For the latter,
+ * the notifier is added for all modes.
+ */
+void migration_add_notifier_modes(NotifierWithReturn *notify,
+                                  MigrationNotifyFunc func, MigMode mode, ...);
+
+/*
+ * Remove a notifier from all modes.
+ */
 void migration_remove_notifier(NotifierWithReturn *notify);
+
 void migration_file_set_error(int ret, Error *err);
 
 /* True if incoming migration entered POSTCOPY_INCOMING_DISCARD */
diff --git a/migration/migration.c b/migration/migration.c
index 10c216d..08a98f7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -74,11 +74,7 @@
 
 #define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
 
-static NotifierWithReturnList migration_state_notifiers[] = {
-    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
-    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
-    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
-};
+static GSList *migration_state_notifiers[MIG_MODE__MAX];
 
 /* Messages sent on the return path from destination to source */
 enum mig_rp_message_type {
@@ -1665,23 +1661,51 @@ void migration_cancel(void)
     }
 }
 
+static int get_modes(MigMode mode, va_list ap);
+
+static void add_notifiers(NotifierWithReturn *notify, int modes)
+{
+    for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
+        if (modes & BIT(mode)) {
+            migration_state_notifiers[mode] =
+                g_slist_prepend(migration_state_notifiers[mode], notify);
+        }
+    }
+}
+
+void migration_add_notifier_modes(NotifierWithReturn *notify,
+                                  MigrationNotifyFunc func, MigMode mode, ...)
+{
+    int modes;
+    va_list ap;
+
+    va_start(ap, mode);
+    modes = get_modes(mode, ap);
+    va_end(ap);
+
+    notify->notify = (NotifierWithReturnFunc)func;
+    add_notifiers(notify, modes);
+}
+
 void migration_add_notifier_mode(NotifierWithReturn *notify,
                                  MigrationNotifyFunc func, MigMode mode)
 {
-    notify->notify = (NotifierWithReturnFunc)func;
-    notifier_with_return_list_add(&migration_state_notifiers[mode], notify);
+    migration_add_notifier_modes(notify, func, mode, -1);
 }
 
 void migration_add_notifier(NotifierWithReturn *notify,
                             MigrationNotifyFunc func)
 {
-    migration_add_notifier_mode(notify, func, MIG_MODE_NORMAL);
+    migration_add_notifier_modes(notify, func, MIG_MODE_NORMAL, -1);
 }
 
 void migration_remove_notifier(NotifierWithReturn *notify)
 {
     if (notify->notify) {
-        notifier_with_return_remove(notify);
+        for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
+            migration_blockers[mode] =
+                g_slist_remove(migration_state_notifiers[mode], notify);
+        }
         notify->notify = NULL;
     }
 }
@@ -1691,13 +1715,23 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
 {
     MigMode mode = s->parameters.mode;
     MigrationEvent e;
+    NotifierWithReturn *notifier;
+    GSList *elem, *next;
     int ret;
 
     e.type = type;
-    ret = notifier_with_return_list_notify(&migration_state_notifiers[mode],
-                                           &e, errp);
-    assert(!ret || type == MIG_EVENT_PRECOPY_SETUP);
-    return ret;
+
+    for (elem = migration_state_notifiers[mode]; elem; elem = next) {
+        next = elem->next;
+        notifier = (NotifierWithReturn *)elem->data;
+        ret = notifier->notify(notifier, &e, errp);
+        if (ret) {
+            assert(type == MIG_EVENT_PRECOPY_SETUP);
+            return ret;
+        }
+    }
+
+    return 0;
 }
 
 bool migration_has_failed(MigrationState *s)
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH V4 2/8] migration: add cpr_walk_fd
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 1/8] migration: multi-mode notifier Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 3/8] oslib: qemu_clear_cloexec Steve Sistare
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
Add a helper to walk all CPR fd's and run a callback for each.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/migration/cpr.h |  3 +++
 migration/cpr.c         | 13 +++++++++++++
 2 files changed, 16 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 3fc19a7..2b074d7 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -34,6 +34,9 @@ void cpr_resave_fd(const char *name, int id, int fd);
 int cpr_open_fd(const char *path, int flags, const char *name, int id,
                 Error **errp);
 
+typedef bool (*cpr_walk_fd_cb)(int fd);
+bool cpr_walk_fd(cpr_walk_fd_cb cb);
+
 MigMode cpr_get_incoming_mode(void);
 void cpr_set_incoming_mode(MigMode mode);
 bool cpr_is_incoming(void);
diff --git a/migration/cpr.c b/migration/cpr.c
index 42ad0b0..d3e370e 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -121,6 +121,19 @@ int cpr_open_fd(const char *path, int flags, const char *name, int id,
     return fd;
 }
 
+bool cpr_walk_fd(cpr_walk_fd_cb cb)
+{
+    CprFd *elem;
+
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        g_assert(elem->fd >= 0);
+        if (!cb(elem->fd)) {
+            return false;
+        }
+    }
+    return true;
+}
+
 /*************************************************************************/
 static const VMStateDescription vmstate_cpr_state = {
     .name = CPR_STATE,
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH V4 3/8] oslib: qemu_clear_cloexec
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 1/8] migration: multi-mode notifier Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 2/8] migration: add cpr_walk_fd Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 4/8] migration: cpr-exec-command parameter Steve Sistare
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 include/qemu/osdep.h | 9 +++++++++
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 22 insertions(+)
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index be3460b..8dac4ed 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -688,6 +688,15 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
 
 void qemu_set_cloexec(int fd);
 
+/*
+ * Clear FD_CLOEXEC for a descriptor.
+ *
+ * The caller must guarantee that no other fork+exec's occur before the
+ * exec that is intended to inherit this descriptor, eg by suspending CPUs
+ * and blocking monitor commands.
+ */
+void qemu_clear_cloexec(int fd);
+
 /* Return a dynamically allocated directory path that is appropriate for storing
  * local state.
  *
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 4ff577e..4c04658 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -307,6 +307,15 @@ int qemu_socketpair(int domain, int type, int protocol, int sv[2])
     return ret;
 }
 
+void qemu_clear_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 char *
 qemu_get_local_state_dir(void)
 {
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index b735163..843a901 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -222,6 +222,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clear_cloexec(int fd)
+{
+}
+
 int qemu_get_thread_id(void)
 {
     return GetCurrentThreadId();
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH V4 4/8] migration: cpr-exec-command parameter
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
                   ` (2 preceding siblings ...)
  2025-09-22 13:49 ` [PATCH V4 3/8] oslib: qemu_clear_cloexec Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 5/8] migration: cpr-exec save and load Steve Sistare
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
Create the cpr-exec-command migration parameter, defined as a list of
strings.  It will be used for cpr-exec migration mode in a subsequent
patch, and contains forward references to cpr-exec mode in the qapi
doc.
No functional change, except that cpr-exec-command is shown by the
'info migrate' command.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
---
 qapi/migration.json            | 21 ++++++++++++++++++---
 migration/migration-hmp-cmds.c | 30 ++++++++++++++++++++++++++++++
 migration/options.c            | 14 ++++++++++++++
 hmp-commands.hx                |  2 +-
 4 files changed, 63 insertions(+), 4 deletions(-)
diff --git a/qapi/migration.json b/qapi/migration.json
index 2387c21..2be8fa1 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -924,6 +924,10 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+#     is @cpr-exec.  The first list element is the program's filename,
+#     the remainder its arguments.  (Since 10.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -950,7 +954,8 @@
            'vcpu-dirty-limit',
            'mode',
            'zero-page-detection',
-           'direct-io'] }
+           'direct-io',
+           'cpr-exec-command'] }
 
 ##
 # @MigrateSetParameters:
@@ -1105,6 +1110,10 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+#     is @cpr-exec.  The first list element is the program's filename,
+#     the remainder its arguments.  (Since 10.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1146,7 +1155,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-exec-command': [ 'str' ]} }
 
 ##
 # @migrate-set-parameters:
@@ -1315,6 +1325,10 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+#     is @cpr-exec.  The first list element is the program's filename,
+#     the remainder its arguments.  (Since 10.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1353,7 +1367,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-exec-command': [ 'str' ]} }
 
 ##
 # @query-migrate-parameters:
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 0fc21f0..54df615 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -306,6 +306,18 @@ void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict)
     qapi_free_MigrationCapabilityStatusList(caps);
 }
 
+static void monitor_print_cpr_exec_command(Monitor *mon, strList *args)
+{
+    monitor_printf(mon, "%s:",
+        MigrationParameter_str(MIGRATION_PARAMETER_CPR_EXEC_COMMAND));
+
+    while (args) {
+        monitor_printf(mon, " %s", args->value);
+        args = args->next;
+    }
+    monitor_printf(mon, "\n");
+}
+
 void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
 {
     MigrationParameters *params;
@@ -435,6 +447,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
                                MIGRATION_PARAMETER_DIRECT_IO),
                            params->direct_io ? "on" : "off");
         }
+
+        assert(params->has_cpr_exec_command);
+        monitor_print_cpr_exec_command(mon, params->cpr_exec_command);
     }
 
     qapi_free_MigrationParameters(params);
@@ -716,6 +731,21 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_direct_io = true;
         visit_type_bool(v, param, &p->direct_io, &err);
         break;
+    case MIGRATION_PARAMETER_CPR_EXEC_COMMAND: {
+        g_autofree char **strv = NULL;
+        g_autoptr(GError) gerr = NULL;
+        strList **tail = &p->cpr_exec_command;
+
+        if (!g_shell_parse_argv(valuestr, NULL, &strv, &gerr)) {
+            error_setg(&err, "%s", gerr->message);
+            break;
+        }
+        for (int i = 0; strv[i]; i++) {
+            QAPI_LIST_APPEND(tail, strv[i]);
+        }
+        p->has_cpr_exec_command = true;
+        break;
+    }
     default:
         g_assert_not_reached();
     }
diff --git a/migration/options.c b/migration/options.c
index 4e923a2..5183112 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -959,6 +959,9 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->zero_page_detection = s->parameters.zero_page_detection;
     params->has_direct_io = true;
     params->direct_io = s->parameters.direct_io;
+    params->has_cpr_exec_command = true;
+    params->cpr_exec_command = QAPI_CLONE(strList,
+                                          s->parameters.cpr_exec_command);
 
     return params;
 }
@@ -993,6 +996,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_mode = true;
     params->has_zero_page_detection = true;
     params->has_direct_io = true;
+    params->has_cpr_exec_command = true;
 }
 
 /*
@@ -1297,6 +1301,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_direct_io) {
         dest->direct_io = params->direct_io;
     }
+
+    if (params->has_cpr_exec_command) {
+        dest->cpr_exec_command = params->cpr_exec_command;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1429,6 +1437,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_direct_io) {
         s->parameters.direct_io = params->direct_io;
     }
+
+    if (params->has_cpr_exec_command) {
+        qapi_free_strList(s->parameters.cpr_exec_command);
+        s->parameters.cpr_exec_command =
+            QAPI_CLONE(strList, params->cpr_exec_command);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/hmp-commands.hx b/hmp-commands.hx
index d0e4f35..3cace8f 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1009,7 +1009,7 @@ ERST
 
     {
         .name       = "migrate_set_parameter",
-        .args_type  = "parameter:s,value:s",
+        .args_type  = "parameter:s,value:S",
         .params     = "parameter value",
         .help       = "Set the parameter for migration",
         .cmd        = hmp_migrate_set_parameter,
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH V4 5/8] migration: cpr-exec save and load
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
                   ` (3 preceding siblings ...)
  2025-09-22 13:49 ` [PATCH V4 4/8] migration: cpr-exec-command parameter Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 16:00   ` Cédric Le Goater
  2025-09-22 13:49 ` [PATCH V4 6/8] migration: cpr-exec mode Steve Sistare
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
To preserve CPR state across exec, create a QEMUFile based on a memfd, and
keep the memfd open across exec.  Save the value of the memfd in an
environment variable so post-exec QEMU can find it.
These new functions are called in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h |  5 +++
 migration/cpr-exec.c    | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |  1 +
 3 files changed, 100 insertions(+)
 create mode 100644 migration/cpr-exec.c
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 2b074d7..b84389f 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -53,4 +53,9 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
+QEMUFile *cpr_exec_output(Error **errp);
+QEMUFile *cpr_exec_input(Error **errp);
+void cpr_exec_persist_state(QEMUFile *f);
+bool cpr_exec_has_state(void);
+void cpr_exec_unpersist_state(void);
 #endif
diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
new file mode 100644
index 0000000..2c32e9c
--- /dev/null
+++ b/migration/cpr-exec.c
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "io/channel-file.h"
+#include "io/channel-socket.h"
+#include "migration/cpr.h"
+#include "migration/qemu-file.h"
+#include "migration/misc.h"
+#include "migration/vmstate.h"
+#include "system/runstate.h"
+
+#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
+
+static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    return qemu_file_new_input(ioc);
+}
+
+static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    return qemu_file_new_output(ioc);
+}
+
+void cpr_exec_persist_state(QEMUFile *f)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
+    int mfd = dup(fioc->fd);
+    char val[16];
+
+    /* Remember mfd in environment for post-exec load */
+    qemu_clear_cloexec(mfd);
+    snprintf(val, sizeof(val), "%d", mfd);
+    g_setenv(CPR_EXEC_STATE_NAME, val, 1);
+}
+
+static int cpr_exec_find_state(void)
+{
+    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
+    int mfd;
+
+    assert(val);
+    g_unsetenv(CPR_EXEC_STATE_NAME);
+    assert(!qemu_strtoi(val, NULL, 10, &mfd));
+    return mfd;
+}
+
+bool cpr_exec_has_state(void)
+{
+    return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
+}
+
+void cpr_exec_unpersist_state(void)
+{
+    int mfd;
+    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
+
+    g_unsetenv(CPR_EXEC_STATE_NAME);
+    assert(val);
+    assert(!qemu_strtoi(val, NULL, 10, &mfd));
+    close(mfd);
+}
+
+QEMUFile *cpr_exec_output(Error **errp)
+{
+    int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
+
+    if (mfd < 0) {
+        error_setg_errno(errp, errno, "memfd_create failed");
+        return NULL;
+    }
+
+    return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
+}
+
+QEMUFile *cpr_exec_input(Error **errp)
+{
+    int mfd = cpr_exec_find_state();
+
+    lseek(mfd, 0, SEEK_SET);
+    return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
+}
diff --git a/migration/meson.build b/migration/meson.build
index 0f71544..16909d5 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -16,6 +16,7 @@ system_ss.add(files(
   'channel-block.c',
   'cpr.c',
   'cpr-transfer.c',
+  'cpr-exec.c',
   'cpu-throttle.c',
   'dirtyrate.c',
   'exec.c',
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
                   ` (4 preceding siblings ...)
  2025-09-22 13:49 ` [PATCH V4 5/8] migration: cpr-exec save and load Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 15:28   ` Cédric Le Goater
  2025-09-30 16:39   ` Peter Xu
  2025-09-22 13:49 ` [PATCH V4 7/8] migration: cpr-exec docs Steve Sistare
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
Add the cpr-exec migration mode.  Usage:
  qemu-system-$arch -machine aux-ram-share=on ...
  migrate_set_parameter mode cpr-exec
  migrate_set_parameter cpr-exec-command \
    <arg1> <arg2> ... -incoming <uri-1> \
  migrate -d <uri-1>
The migrate command stops the VM, saves state to uri-1,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from uri-1.  Guest RAM is preserved in place,
albeit with new virtual addresses.
The new QEMU process is started by exec'ing the command
specified by the @cpr-exec-command parameter.  The first word of
the command is the binary, and the remaining words are its
arguments.  The command may be a direct invocation of new QEMU,
or may be a non-QEMU command that exec's the new QEMU binary.
This mode creates a second migration channel that is not visible
to the user.  At the start of migration, old QEMU saves CPR state
to the second channel, and at the end of migration, it tells the
main loop to call cpr_exec.  New QEMU loads CPR state early, before
objects are created.
Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so uri-1 must be a type,
such as a file, that accepts all data before old QEMU exits.
Otherwise, old QEMU may quietly block writing to the channel.
Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started with
the '-machine aux-ram-share=on' option, which allows anonymous
memory to be transferred in place to the new process.  The memfds
are kept open across exec by clearing the close-on-exec flag, their
values are saved in CPR state, and they are mmap'd in new QEMU.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
---
 qapi/migration.json       | 25 +++++++++++++-
 include/migration/cpr.h   |  1 +
 migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
 migration/cpr.c           | 28 ++++++++++++++--
 migration/migration.c     | 10 +++++-
 migration/ram.c           |  1 +
 migration/vmstate-types.c |  8 +++++
 system/vl.c               |  4 ++-
 migration/trace-events    |  1 +
 9 files changed, 157 insertions(+), 5 deletions(-)
diff --git a/qapi/migration.json b/qapi/migration.json
index 2be8fa1..be0f3fc 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -694,9 +694,32 @@
 #     until you issue the `migrate-incoming` command.
 #
 #     (since 10.0)
+#
+# @cpr-exec: The migrate command stops the VM, saves state to the
+#     migration channel, directly exec's a new version of QEMU on the
+#     same host, replacing the original process while retaining its
+#     PID, and loads state from the channel.  Guest RAM is preserved
+#     in place.  Devices and their pinned pages are also preserved for
+#     VFIO and IOMMUFD.
+#
+#     Old QEMU starts new QEMU by exec'ing the command specified by
+#     the @cpr-exec-command parameter.  The command may be a direct
+#     invocation of new QEMU, or may be a wrapper that exec's the new
+#     QEMU binary.
+#
+#     Because old QEMU terminates when new QEMU starts, one cannot
+#     stream data between the two, so the channel must be a type,
+#     such as a file, that accepts all data before old QEMU exits.
+#     Otherwise, old QEMU may quietly block writing to the channel.
+#
+#     Memory-backend objects must have the share=on attribute, but
+#     memory-backend-epc is not supported.  The VM must be started
+#     with the '-machine aux-ram-share=on' option.
+#
+#     (since 10.2)
 ##
 { 'enum': 'MigMode',
-  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
+  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
 
 ##
 # @ZeroPageDetection:
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index b84389f..beed392 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
+void cpr_exec_init(void);
 QEMUFile *cpr_exec_output(Error **errp);
 QEMUFile *cpr_exec_input(Error **errp);
 void cpr_exec_persist_state(QEMUFile *f);
diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
index 2c32e9c..8cf55a3 100644
--- a/migration/cpr-exec.c
+++ b/migration/cpr-exec.c
@@ -6,15 +6,21 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include "qemu/memfd.h"
 #include "qapi/error.h"
+#include "qapi/type-helpers.h"
 #include "io/channel-file.h"
 #include "io/channel-socket.h"
+#include "block/block-global-state.h"
+#include "qemu/main-loop.h"
 #include "migration/cpr.h"
 #include "migration/qemu-file.h"
+#include "migration/migration.h"
 #include "migration/misc.h"
 #include "migration/vmstate.h"
 #include "system/runstate.h"
+#include "trace.h"
 
 #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
 
@@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
     lseek(mfd, 0, SEEK_SET);
     return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
 }
+
+static bool preserve_fd(int fd)
+{
+    qemu_clear_cloexec(fd);
+    return true;
+}
+
+static bool unpreserve_fd(int fd)
+{
+    qemu_set_cloexec(fd);
+    return true;
+}
+
+static void cpr_exec_cb(void *opaque)
+{
+    MigrationState *s = migrate_get_current();
+    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
+    Error *err = NULL;
+
+    /*
+     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
+     * earlier because they should not persist across miscellaneous fork and
+     * exec calls that are performed during normal operation.
+     */
+    cpr_walk_fd(preserve_fd);
+
+    trace_cpr_exec();
+    execvp(argv[0], argv);
+
+    /*
+     * exec should only fail if argv[0] is bogus, or has a permissions problem,
+     * or the system is very short on resources.
+     */
+    g_strfreev(argv);
+    cpr_walk_fd(unpreserve_fd);
+
+    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
+    error_report_err(error_copy(err));
+    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+    migrate_set_error(s, err);
+
+    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
+
+    err = NULL;
+    if (!migration_block_activate(&err)) {
+        /* error was already reported */
+        return;
+    }
+
+    if (runstate_is_live(s->vm_old_state)) {
+        vm_start();
+    }
+}
+
+static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
+                             Error **errp)
+{
+    MigrationState *s = migrate_get_current();
+
+    if (e->type == MIG_EVENT_PRECOPY_DONE) {
+        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
+        assert(s->state == MIGRATION_STATUS_COMPLETED);
+        qemu_bh_schedule(cpr_exec_bh);
+        qemu_notify_event();
+
+    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
+        cpr_exec_unpersist_state();
+    }
+    return 0;
+}
+
+void cpr_exec_init(void)
+{
+    static NotifierWithReturn exec_notifier;
+
+    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
+                                MIG_MODE_CPR_EXEC);
+}
diff --git a/migration/cpr.c b/migration/cpr.c
index d3e370e..eea3773 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
     if (mode == MIG_MODE_CPR_TRANSFER) {
         g_assert(channel);
         f = cpr_transfer_output(channel, errp);
+    } else if (mode == MIG_MODE_CPR_EXEC) {
+        f = cpr_exec_output(errp);
     } else {
         return 0;
     }
@@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
         return ret;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        cpr_exec_persist_state(f);
+    }
+
     /*
      * Close the socket only partially so we can later detect when the other
      * end closes by getting a HUP event.
@@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
     return 0;
 }
 
+static bool unpreserve_fd(int fd)
+{
+    qemu_set_cloexec(fd);
+    return true;
+}
+
 int cpr_state_load(MigrationChannel *channel, Error **errp)
 {
     int ret;
@@ -220,7 +232,13 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
     QEMUFile *f;
     MigMode mode = 0;
 
-    if (channel) {
+    if (cpr_exec_has_state()) {
+        mode = MIG_MODE_CPR_EXEC;
+        f = cpr_exec_input(errp);
+        if (channel) {
+            warn_report("ignoring cpr channel for migration mode cpr-exec");
+        }
+    } else if (channel) {
         mode = MIG_MODE_CPR_TRANSFER;
         cpr_set_incoming_mode(mode);
         f = cpr_transfer_input(channel, errp);
@@ -232,6 +250,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
     }
 
     trace_cpr_state_load(MigMode_str(mode));
+    cpr_set_incoming_mode(mode);
 
     v = qemu_get_be32(f);
     if (v != QEMU_CPR_FILE_MAGIC) {
@@ -253,6 +272,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
         return ret;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
+        cpr_walk_fd(unpreserve_fd);
+    }
+
     /*
      * Let the caller decide when to close the socket (and generate a HUP event
      * for the sending side).
@@ -273,7 +297,7 @@ void cpr_state_close(void)
 bool cpr_incoming_needed(void *opaque)
 {
     MigMode mode = migrate_mode();
-    return mode == MIG_MODE_CPR_TRANSFER;
+    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
 }
 
 /*
diff --git a/migration/migration.c b/migration/migration.c
index 08a98f7..2515bec 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -333,6 +333,7 @@ void migration_object_init(void)
 
     ram_mig_init();
     dirty_bitmap_mig_init();
+    cpr_exec_init();
 
     /* Initialize cpu throttle timers */
     cpu_throttle_init();
@@ -1796,7 +1797,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
 {
     MigMode mode = s->parameters.mode;
     return mode == MIG_MODE_CPR_REBOOT ||
-           mode == MIG_MODE_CPR_TRANSFER;
+           mode == MIG_MODE_CPR_TRANSFER ||
+           mode == MIG_MODE_CPR_EXEC;
 }
 
 int migrate_init(MigrationState *s, Error **errp)
@@ -2145,6 +2147,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
         return false;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
+        !s->parameters.has_cpr_exec_command) {
+        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
diff --git a/migration/ram.c b/migration/ram.c
index 7208bc1..6730a41 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
     MigMode mode = migrate_mode();
     return !qemu_ram_is_migratable(block) ||
            mode == MIG_MODE_CPR_TRANSFER ||
+           mode == MIG_MODE_CPR_EXEC ||
            (migrate_ignore_shared() && qemu_ram_is_shared(block)
                                     && qemu_ram_is_named_file(block));
 }
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index 741a588..1aa0573 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
                   const VMStateField *field)
 {
     int32_t *v = pv;
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        qemu_get_sbe32s(f, v);
+        return 0;
+    }
     *v = qemu_file_get_fd(f);
     return 0;
 }
@@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
                   const VMStateField *field, JSONWriter *vmdesc)
 {
     int32_t *v = pv;
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        qemu_put_sbe32s(f, v);
+        return 0;
+    }
     return qemu_file_put_fd(f, *v);
 }
 
diff --git a/system/vl.c b/system/vl.c
index 4c24073..f395d04 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -3867,6 +3867,8 @@ void qemu_init(int argc, char **argv)
     }
     qemu_init_displays();
     accel_setup_post(current_machine);
-    os_setup_post();
+    if (migrate_mode() != MIG_MODE_CPR_EXEC) {
+        os_setup_post();
+    }
     resume_mux_open();
 }
diff --git a/migration/trace-events b/migration/trace-events
index 706db97..e8edd1f 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
 cpr_state_load(const char *mode) "%s mode"
 cpr_transfer_input(const char *path) "%s"
 cpr_transfer_output(const char *path) "%s"
+cpr_exec(void) ""
 
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH V4 7/8] migration: cpr-exec docs
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
                   ` (5 preceding siblings ...)
  2025-09-22 13:49 ` [PATCH V4 6/8] migration: cpr-exec mode Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 13:49 ` [PATCH V4 8/8] vfio: cpr-exec mode Steve Sistare
  2025-09-30 15:28 ` [PATCH V4 0/8] Live update: cpr-exec Steven Sistare
  8 siblings, 0 replies; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
Update developer documentation for cpr-exec mode.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 docs/devel/migration/CPR.rst | 106 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 105 insertions(+), 1 deletion(-)
diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
index 0a0fd4f..77fdbdd 100644
--- a/docs/devel/migration/CPR.rst
+++ b/docs/devel/migration/CPR.rst
@@ -5,7 +5,7 @@ CPR is the umbrella name for a set of migration modes in which the
 VM is migrated to a new QEMU instance on the same host.  It is
 intended for use when the goal is to update host software components
 that run the VM, such as QEMU or even the host kernel.  At this time,
-the cpr-reboot and cpr-transfer modes are available.
+the cpr-reboot, cpr-transfer, and cpr-exec modes are available.
 
 Because QEMU is restarted on the same host, with access to the same
 local devices, CPR is allowed in certain cases where normal migration
@@ -324,3 +324,107 @@ descriptors from old to new QEMU.  In the future, descriptors for
 vhost, and char devices could be transferred,
 preserving those devices and their kernel state without interruption,
 even if they do not explicitly support live migration.
+
+cpr-exec mode
+-------------
+
+In this mode, QEMU stops the VM, writes VM state to the migration
+URI, and directly exec's a new version of QEMU on the same host,
+replacing the original process while retaining its PID.  Guest RAM is
+preserved in place, albeit with new virtual addresses.  The user
+completes the migration by specifying the ``-incoming`` option, and
+by issuing the ``migrate-incoming`` command if necessary; see details
+below.
+
+This mode supports VFIO/IOMMUFD devices by preserving device
+descriptors and hence kernel state across the exec, even for devices
+that do not support live migration.
+
+Because the old and new QEMU instances are not active concurrently,
+the URI cannot be a type that streams data from one instance to the
+other.
+
+Usage
+^^^^^
+
+Arguments for the new QEMU process are taken from the
+@cpr-exec-command parameter.  The first argument should be the
+path of a new QEMU binary, or a prefix command that exec's the
+new QEMU binary, and the arguments should include the ''-incoming''
+option.
+
+Memory backend objects must have the ``share=on`` attribute.
+The VM must be started with the ``-machine aux-ram-share=on`` option.
+
+Outgoing:
+  * Set the migration mode parameter to ``cpr-exec``.
+  * Set the ``cpr-exec-command`` parameter.
+  * Issue the ``migrate`` command.  It is recommended that the URI be
+    a ``file`` type, but one can use other types such as ``exec``,
+    provided the command captures all the data from the outgoing side,
+    and provides all the data to the incoming side.
+
+Incoming:
+  * You do not need to explicitly start new QEMU.  It is started as
+    a side effect of the migrate command above.
+  * If the VM was running when the outgoing ``migrate`` command was
+    issued, then QEMU automatically resumes VM execution.
+
+Example 1: incoming URI
+^^^^^^^^^^^^^^^^^^^^^^^
+
+In these examples, we simply restart the same version of QEMU, but in
+a real scenario one would set a new QEMU binary path in
+cpr-exec-command.
+
+::
+
+  # qemu-kvm -monitor stdio
+  -object memory-backend-memfd,id=ram0,size=4G
+  -machine memory-backend=ram0
+  -machine aux-ram-share=on
+  ...
+
+  QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  VM status: running
+  (qemu) migrate_set_parameter mode cpr-exec
+  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
+  (qemu) migrate -d file:vm.state
+  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  VM status: running
+
+Example 2: incoming defer
+^^^^^^^^^^^^^^^^^^^^^^^^^
+::
+
+  # qemu-kvm -monitor stdio
+  -object memory-backend-memfd,id=ram0,size=4G
+  -machine memory-backend=ram0
+  -machine aux-ram-share=on
+  ...
+
+  QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  VM status: running
+  (qemu) migrate_set_parameter mode cpr-exec
+  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming defer
+  (qemu) migrate -d file:vm.state
+  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  status: paused (inmigrate)
+  (qemu) migrate_incoming file:vm.state
+  (qemu) info status
+  VM status: running
+
+Caveats
+^^^^^^^
+
+cpr-exec mode may not be used with postcopy, background-snapshot,
+or COLO.
+
+cpr-exec mode requires permission to use the exec system call, which
+is denied by certain sandbox options, such as spawn.
+
+The guest pause time increases for large guest RAM backed by small pages.
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* [PATCH V4 8/8] vfio: cpr-exec mode
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
                   ` (6 preceding siblings ...)
  2025-09-22 13:49 ` [PATCH V4 7/8] migration: cpr-exec docs Steve Sistare
@ 2025-09-22 13:49 ` Steve Sistare
  2025-09-22 15:28   ` Cédric Le Goater
  2025-09-30 15:28 ` [PATCH V4 0/8] Live update: cpr-exec Steven Sistare
  8 siblings, 1 reply; 30+ messages in thread
From: Steve Sistare @ 2025-09-22 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson, Steve Sistare
All blockers and notifiers for cpr-transfer mode also apply to cpr-exec.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container.c   |  3 ++-
 hw/vfio/cpr-iommufd.c |  3 ++-
 hw/vfio/cpr-legacy.c  |  9 +++++----
 hw/vfio/cpr.c         | 13 +++++++------
 4 files changed, 16 insertions(+), 12 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 030c6d3..935f14d 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -988,7 +988,8 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
         error_setg(&vbasedev->cpr.mdev_blocker,
                    "CPR does not support vfio mdev %s", vbasedev->name);
         if (migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, errp,
-                                      MIG_MODE_CPR_TRANSFER, -1) < 0) {
+                                      MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
+                                      -1) < 0) {
             goto hiod_unref_exit;
         }
     }
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 148a06d..e1f1854 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -159,7 +159,8 @@ bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
 
     if (!vfio_cpr_supported(be, cpr_blocker)) {
         return migrate_add_blocker_modes(cpr_blocker, errp,
-                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+                                         MIG_MODE_CPR_TRANSFER,
+                                         MIG_MODE_CPR_EXEC, -1) == 0;
     }
 
     vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 8f43719..eebb3bf 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -176,16 +176,17 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
 
     if (!vfio_cpr_supported(container, cpr_blocker)) {
         return migrate_add_blocker_modes(cpr_blocker, errp,
-                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+                                         MIG_MODE_CPR_TRANSFER,
+                                         MIG_MODE_CPR_EXEC, -1) == 0;
     }
 
     vfio_cpr_add_kvm_notifier();
 
     vmstate_register(NULL, -1, &vfio_container_vmstate, container);
 
-    migration_add_notifier_mode(&container->cpr.transfer_notifier,
-                                vfio_cpr_fail_notifier,
-                                MIG_MODE_CPR_TRANSFER);
+    migration_add_notifier_modes(&container->cpr.transfer_notifier,
+                                 vfio_cpr_fail_notifier,
+                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
     return true;
 }
 
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 2c71fc1..db462aa 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -195,9 +195,10 @@ static int vfio_cpr_kvm_close_notifier(NotifierWithReturn *notifier,
 void vfio_cpr_add_kvm_notifier(void)
 {
     if (!kvm_close_notifier.notify) {
-        migration_add_notifier_mode(&kvm_close_notifier,
-                                    vfio_cpr_kvm_close_notifier,
-                                    MIG_MODE_CPR_TRANSFER);
+        migration_add_notifier_modes(&kvm_close_notifier,
+                                     vfio_cpr_kvm_close_notifier,
+                                     MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
+                                     -1);
     }
 }
 
@@ -282,9 +283,9 @@ static int vfio_cpr_pci_notifier(NotifierWithReturn *notifier,
 
 void vfio_cpr_pci_register_device(VFIOPCIDevice *vdev)
 {
-    migration_add_notifier_mode(&vdev->cpr.transfer_notifier,
-                                vfio_cpr_pci_notifier,
-                                MIG_MODE_CPR_TRANSFER);
+    migration_add_notifier_modes(&vdev->cpr.transfer_notifier,
+                                 vfio_cpr_pci_notifier,
+                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
 }
 
 void vfio_cpr_pci_unregister_device(VFIOPCIDevice *vdev)
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 1/8] migration: multi-mode notifier
  2025-09-22 13:49 ` [PATCH V4 1/8] migration: multi-mode notifier Steve Sistare
@ 2025-09-22 15:18   ` Cédric Le Goater
  2025-09-24 18:15     ` Steven Sistare
  0 siblings, 1 reply; 30+ messages in thread
From: Cédric Le Goater @ 2025-09-22 15:18 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/22/25 15:49, Steve Sistare wrote:
> Allow a notifier to be added for multiple migration modes.
> To allow a notifier to appear on multiple per-node lists, use
> a generic list type.  We can no longer use NotifierWithReturnList,
> because it shoe horns the notifier onto a single list.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> ---
>   include/migration/misc.h | 12 ++++++++++
>   migration/migration.c    | 60 +++++++++++++++++++++++++++++++++++++-----------
>   2 files changed, 59 insertions(+), 13 deletions(-)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index a261f99..592b930 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -95,7 +95,19 @@ void migration_add_notifier(NotifierWithReturn *notify,
>   void migration_add_notifier_mode(NotifierWithReturn *notify,
>                                    MigrationNotifyFunc func, MigMode mode);
>   
> +/*
> + * Same as migration_add_notifier, but applies to all @mode in the argument
> + * list.  The list is terminated by -1 or MIG_MODE_ALL.  For the latter,
> + * the notifier is added for all modes.
> + */
> +void migration_add_notifier_modes(NotifierWithReturn *notify,
> +                                  MigrationNotifyFunc func, MigMode mode, ...);
> +
> +/*
> + * Remove a notifier from all modes.
> + */
>   void migration_remove_notifier(NotifierWithReturn *notify);
> +
>   void migration_file_set_error(int ret, Error *err);
I think the include/migration/misc.h file should be updated with
proper documentation, like found in include/migration/blocker.h.
>   
>   /* True if incoming migration entered POSTCOPY_INCOMING_DISCARD */
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d..08a98f7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -74,11 +74,7 @@
>   
>   #define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
>   
> -static NotifierWithReturnList migration_state_notifiers[] = {
> -    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
> -    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
> -    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
> -};
> +static GSList *migration_state_notifiers[MIG_MODE__MAX];
>   
>   /* Messages sent on the return path from destination to source */
>   enum mig_rp_message_type {
> @@ -1665,23 +1661,51 @@ void migration_cancel(void)
>       }
>   }
>   
> +static int get_modes(MigMode mode, va_list ap);
> +
> +static void add_notifiers(NotifierWithReturn *notify, int modes)
> +{
> +    for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
> +        if (modes & BIT(mode)) {
> +            migration_state_notifiers[mode] =
> +                g_slist_prepend(migration_state_notifiers[mode], notify);
> +        }
> +    }
> +}
> +
> +void migration_add_notifier_modes(NotifierWithReturn *notify,
> +                                  MigrationNotifyFunc func, MigMode mode, ...)
> +{
> +    int modes;
> +    va_list ap;
> +
> +    va_start(ap, mode);
> +    modes = get_modes(mode, ap);
> +    va_end(ap);
No sanity check needed ? Could we have conflicting modes ? Just asking.
Thanks,
C.
> +    notify->notify = (NotifierWithReturnFunc)func;
> +    add_notifiers(notify, modes);
> +}
> +
>   void migration_add_notifier_mode(NotifierWithReturn *notify,
>                                    MigrationNotifyFunc func, MigMode mode)
>   {
> -    notify->notify = (NotifierWithReturnFunc)func;
> -    notifier_with_return_list_add(&migration_state_notifiers[mode], notify);
> +    migration_add_notifier_modes(notify, func, mode, -1);
>   }
>   
>   void migration_add_notifier(NotifierWithReturn *notify,
>                               MigrationNotifyFunc func)
>   {
> -    migration_add_notifier_mode(notify, func, MIG_MODE_NORMAL);
> +    migration_add_notifier_modes(notify, func, MIG_MODE_NORMAL, -1);
>   }
>   
>   void migration_remove_notifier(NotifierWithReturn *notify)
>   {
>       if (notify->notify) {
> -        notifier_with_return_remove(notify);
> +        for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
> +            migration_blockers[mode] =
> +                g_slist_remove(migration_state_notifiers[mode], notify);
> +        }
>           notify->notify = NULL;
>       }
>   }
> @@ -1691,13 +1715,23 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
>   {
>       MigMode mode = s->parameters.mode;
>       MigrationEvent e;
> +    NotifierWithReturn *notifier;
> +    GSList *elem, *next;
>       int ret;
>   
>       e.type = type;
> -    ret = notifier_with_return_list_notify(&migration_state_notifiers[mode],
> -                                           &e, errp);
> -    assert(!ret || type == MIG_EVENT_PRECOPY_SETUP);
> -    return ret;
> +
> +    for (elem = migration_state_notifiers[mode]; elem; elem = next) {
> +        next = elem->next;
> +        notifier = (NotifierWithReturn *)elem->data;
> +        ret = notifier->notify(notifier, &e, errp);
> +        if (ret) {
> +            assert(type == MIG_EVENT_PRECOPY_SETUP);
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
>   }
>   
>   bool migration_has_failed(MigrationState *s)
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-22 13:49 ` [PATCH V4 6/8] migration: cpr-exec mode Steve Sistare
@ 2025-09-22 15:28   ` Cédric Le Goater
  2025-09-24 18:16     ` Steven Sistare
  2025-09-30 16:39   ` Peter Xu
  1 sibling, 1 reply; 30+ messages in thread
From: Cédric Le Goater @ 2025-09-22 15:28 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/22/25 15:49, Steve Sistare wrote:
> Add the cpr-exec migration mode.  Usage:
>    qemu-system-$arch -machine aux-ram-share=on ...
>    migrate_set_parameter mode cpr-exec
>    migrate_set_parameter cpr-exec-command \
>      <arg1> <arg2> ... -incoming <uri-1> \
>    migrate -d <uri-1>
> 
> The migrate command stops the VM, saves state to uri-1,
> directly exec's a new version of QEMU on the same host,
> replacing the original process while retaining its PID, and
> loads state from uri-1.  Guest RAM is preserved in place,
> albeit with new virtual addresses.
> 
> The new QEMU process is started by exec'ing the command
> specified by the @cpr-exec-command parameter.  The first word of
> the command is the binary, and the remaining words are its
> arguments.  The command may be a direct invocation of new QEMU,
> or may be a non-QEMU command that exec's the new QEMU binary.
> 
> This mode creates a second migration channel that is not visible
> to the user.  At the start of migration, old QEMU saves CPR state
> to the second channel, and at the end of migration, it tells the
> main loop to call cpr_exec.  New QEMU loads CPR state early, before
> objects are created.
> 
> Because old QEMU terminates when new QEMU starts, one cannot
> stream data between the two, so uri-1 must be a type,
> such as a file, that accepts all data before old QEMU exits.
> Otherwise, old QEMU may quietly block writing to the channel.
> 
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported.  The VM must be started with
> the '-machine aux-ram-share=on' option, which allows anonymous
> memory to be transferred in place to the new process.  The memfds
> are kept open across exec by clearing the close-on-exec flag, their
> values are saved in CPR state, and they are mmap'd in new QEMU.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Acked-by: Markus Armbruster <armbru@redhat.com>
> ---
>   qapi/migration.json       | 25 +++++++++++++-
>   include/migration/cpr.h   |  1 +
>   migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>   migration/cpr.c           | 28 ++++++++++++++--
>   migration/migration.c     | 10 +++++-
>   migration/ram.c           |  1 +
>   migration/vmstate-types.c |  8 +++++
>   system/vl.c               |  4 ++-
>   migration/trace-events    |  1 +
>   9 files changed, 157 insertions(+), 5 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 2be8fa1..be0f3fc 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -694,9 +694,32 @@
>   #     until you issue the `migrate-incoming` command.
>   #
>   #     (since 10.0)
> +#
> +# @cpr-exec: The migrate command stops the VM, saves state to the
> +#     migration channel, directly exec's a new version of QEMU on the
> +#     same host, replacing the original process while retaining its
> +#     PID, and loads state from the channel.  Guest RAM is preserved
> +#     in place.  Devices and their pinned pages are also preserved for
> +#     VFIO and IOMMUFD.
> +#
> +#     Old QEMU starts new QEMU by exec'ing the command specified by
> +#     the @cpr-exec-command parameter.  The command may be a direct
> +#     invocation of new QEMU, or may be a wrapper that exec's the new
> +#     QEMU binary.
> +#
> +#     Because old QEMU terminates when new QEMU starts, one cannot
> +#     stream data between the two, so the channel must be a type,
> +#     such as a file, that accepts all data before old QEMU exits.
> +#     Otherwise, old QEMU may quietly block writing to the channel.
> +#
> +#     Memory-backend objects must have the share=on attribute, but
> +#     memory-backend-epc is not supported.  The VM must be started
> +#     with the '-machine aux-ram-share=on' option.
> +#
> +#     (since 10.2)
>   ##
>   { 'enum': 'MigMode',
> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>   
>   ##
>   # @ZeroPageDetection:
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index b84389f..beed392 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>   
> +void cpr_exec_init(void);
>   QEMUFile *cpr_exec_output(Error **errp);
>   QEMUFile *cpr_exec_input(Error **errp);
>   void cpr_exec_persist_state(QEMUFile *f);
> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> index 2c32e9c..8cf55a3 100644
> --- a/migration/cpr-exec.c
> +++ b/migration/cpr-exec.c
> @@ -6,15 +6,21 @@
>   
>   #include "qemu/osdep.h"
>   #include "qemu/cutils.h"
> +#include "qemu/error-report.h"
>   #include "qemu/memfd.h"
>   #include "qapi/error.h"
> +#include "qapi/type-helpers.h"
>   #include "io/channel-file.h"
>   #include "io/channel-socket.h"
> +#include "block/block-global-state.h"
> +#include "qemu/main-loop.h"
>   #include "migration/cpr.h"
>   #include "migration/qemu-file.h"
> +#include "migration/migration.h"
>   #include "migration/misc.h"
>   #include "migration/vmstate.h"
>   #include "system/runstate.h"
> +#include "trace.h"
>   
>   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>   
> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>       lseek(mfd, 0, SEEK_SET);
>       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>   }
> +
> +static bool preserve_fd(int fd)
> +{
> +    qemu_clear_cloexec(fd);
> +    return true;
> +}
> +
> +static bool unpreserve_fd(int fd)
> +{
> +    qemu_set_cloexec(fd);
> +    return true;
> +}
> +
> +static void cpr_exec_cb(void *opaque)
> +{
> +    MigrationState *s = migrate_get_current();
> +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
> +    Error *err = NULL;
> +
> +    /*
> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> +     * earlier because they should not persist across miscellaneous fork and
> +     * exec calls that are performed during normal operation.
> +     */
> +    cpr_walk_fd(preserve_fd);
> +
> +    trace_cpr_exec();
> +    execvp(argv[0], argv);
> +
> +    /*
> +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
> +     * or the system is very short on resources.
> +     */
> +    g_strfreev(argv);
> +    cpr_walk_fd(unpreserve_fd);
> +
> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
> +    error_report_err(error_copy(err));
> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +    migrate_set_error(s, err);
> +
> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
> +
> +    err = NULL;
> +    if (!migration_block_activate(&err)) {
> +        /* error was already reported */
> +        return;
> +    }
> +
> +    if (runstate_is_live(s->vm_old_state)) {
> +        vm_start();
> +    }
> +}
> +
> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> +                             Error **errp)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
> +        qemu_bh_schedule(cpr_exec_bh);
> +        qemu_notify_event();
> +
> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> +        cpr_exec_unpersist_state();
> +    }
> +    return 0;
> +}
> +
> +void cpr_exec_init(void)
> +{
> +    static NotifierWithReturn exec_notifier;
> +
> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
> +                                MIG_MODE_CPR_EXEC);
> +}
> diff --git a/migration/cpr.c b/migration/cpr.c
> index d3e370e..eea3773 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>       if (mode == MIG_MODE_CPR_TRANSFER) {
>           g_assert(channel);
>           f = cpr_transfer_output(channel, errp);
> +    } else if (mode == MIG_MODE_CPR_EXEC) {
> +        f = cpr_exec_output(errp);
>       } else {
>           return 0;
>       }
> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>           return ret;
>       }
>   
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        cpr_exec_persist_state(f);
> +    }
> +
>       /*
>        * Close the socket only partially so we can later detect when the other
>        * end closes by getting a HUP event.
> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>       return 0;
>   }
>   
> +static bool unpreserve_fd(int fd)
> +{
> +    qemu_set_cloexec(fd);
> +    return true;
> +}
> +
>   int cpr_state_load(MigrationChannel *channel, Error **errp)
>   {
>       int ret;
> @@ -220,7 +232,13 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>       QEMUFile *f;
>       MigMode mode = 0;
>   
> -    if (channel) {
> +    if (cpr_exec_has_state()) {
> +        mode = MIG_MODE_CPR_EXEC;
> +        f = cpr_exec_input(errp);
> +        if (channel) {
> +            warn_report("ignoring cpr channel for migration mode cpr-exec");
migration/cpr.c does not include "qemu/error-report.h"
C.
> +        }
> +    } else if (channel) {
>           mode = MIG_MODE_CPR_TRANSFER;
>           cpr_set_incoming_mode(mode);
>           f = cpr_transfer_input(channel, errp);
> @@ -232,6 +250,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>       }
>   
>       trace_cpr_state_load(MigMode_str(mode));
> +    cpr_set_incoming_mode(mode);
>   
>       v = qemu_get_be32(f);
>       if (v != QEMU_CPR_FILE_MAGIC) {
> @@ -253,6 +272,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>           return ret;
>       }
>   
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
> +        cpr_walk_fd(unpreserve_fd);
> +    }
> +
>       /*
>        * Let the caller decide when to close the socket (and generate a HUP event
>        * for the sending side).
> @@ -273,7 +297,7 @@ void cpr_state_close(void)
>   bool cpr_incoming_needed(void *opaque)
>   {
>       MigMode mode = migrate_mode();
> -    return mode == MIG_MODE_CPR_TRANSFER;
> +    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
>   }
>   
>   /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 08a98f7..2515bec 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -333,6 +333,7 @@ void migration_object_init(void)
>   
>       ram_mig_init();
>       dirty_bitmap_mig_init();
> +    cpr_exec_init();
>   
>       /* Initialize cpu throttle timers */
>       cpu_throttle_init();
> @@ -1796,7 +1797,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
>   {
>       MigMode mode = s->parameters.mode;
>       return mode == MIG_MODE_CPR_REBOOT ||
> -           mode == MIG_MODE_CPR_TRANSFER;
> +           mode == MIG_MODE_CPR_TRANSFER ||
> +           mode == MIG_MODE_CPR_EXEC;
>   }
>   
>   int migrate_init(MigrationState *s, Error **errp)
> @@ -2145,6 +2147,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>           return false;
>       }
>   
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
> +        !s->parameters.has_cpr_exec_command) {
> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
> +        return false;
> +    }
> +
>       if (migration_is_blocked(errp)) {
>           return false;
>       }
> diff --git a/migration/ram.c b/migration/ram.c
> index 7208bc1..6730a41 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
>       MigMode mode = migrate_mode();
>       return !qemu_ram_is_migratable(block) ||
>              mode == MIG_MODE_CPR_TRANSFER ||
> +           mode == MIG_MODE_CPR_EXEC ||
>              (migrate_ignore_shared() && qemu_ram_is_shared(block)
>                                       && qemu_ram_is_named_file(block));
>   }
> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> index 741a588..1aa0573 100644
> --- a/migration/vmstate-types.c
> +++ b/migration/vmstate-types.c
> @@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>                     const VMStateField *field)
>   {
>       int32_t *v = pv;
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        qemu_get_sbe32s(f, v);
> +        return 0;
> +    }
>       *v = qemu_file_get_fd(f);
>       return 0;
>   }
> @@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>                     const VMStateField *field, JSONWriter *vmdesc)
>   {
>       int32_t *v = pv;
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        qemu_put_sbe32s(f, v);
> +        return 0;
> +    }
>       return qemu_file_put_fd(f, *v);
>   }
>   
> diff --git a/system/vl.c b/system/vl.c
> index 4c24073..f395d04 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -3867,6 +3867,8 @@ void qemu_init(int argc, char **argv)
>       }
>       qemu_init_displays();
>       accel_setup_post(current_machine);
> -    os_setup_post();
> +    if (migrate_mode() != MIG_MODE_CPR_EXEC) {
> +        os_setup_post();
> +    }
>       resume_mux_open();
>   }
> diff --git a/migration/trace-events b/migration/trace-events
> index 706db97..e8edd1f 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
>   cpr_state_load(const char *mode) "%s mode"
>   cpr_transfer_input(const char *path) "%s"
>   cpr_transfer_output(const char *path) "%s"
> +cpr_exec(void) ""
>   
>   # block-dirty-bitmap.c
>   send_bitmap_header_enter(void) ""
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 8/8] vfio: cpr-exec mode
  2025-09-22 13:49 ` [PATCH V4 8/8] vfio: cpr-exec mode Steve Sistare
@ 2025-09-22 15:28   ` Cédric Le Goater
  0 siblings, 0 replies; 30+ messages in thread
From: Cédric Le Goater @ 2025-09-22 15:28 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/22/25 15:49, Steve Sistare wrote:
> All blockers and notifiers for cpr-transfer mode also apply to cpr-exec.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
>   hw/vfio/container.c   |  3 ++-
>   hw/vfio/cpr-iommufd.c |  3 ++-
>   hw/vfio/cpr-legacy.c  |  9 +++++----
>   hw/vfio/cpr.c         | 13 +++++++------
>   4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 030c6d3..935f14d 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -988,7 +988,8 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
>           error_setg(&vbasedev->cpr.mdev_blocker,
>                      "CPR does not support vfio mdev %s", vbasedev->name);
>           if (migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, errp,
> -                                      MIG_MODE_CPR_TRANSFER, -1) < 0) {
> +                                      MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
> +                                      -1) < 0) {
>               goto hiod_unref_exit;
>           }
>       }
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> index 148a06d..e1f1854 100644
> --- a/hw/vfio/cpr-iommufd.c
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -159,7 +159,8 @@ bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
>   
>       if (!vfio_cpr_supported(be, cpr_blocker)) {
>           return migrate_add_blocker_modes(cpr_blocker, errp,
> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> +                                         MIG_MODE_CPR_TRANSFER,
> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>       }
>   
>       vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index 8f43719..eebb3bf 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -176,16 +176,17 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>   
>       if (!vfio_cpr_supported(container, cpr_blocker)) {
>           return migrate_add_blocker_modes(cpr_blocker, errp,
> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> +                                         MIG_MODE_CPR_TRANSFER,
> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>       }
>   
>       vfio_cpr_add_kvm_notifier();
>   
>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>   
> -    migration_add_notifier_mode(&container->cpr.transfer_notifier,
> -                                vfio_cpr_fail_notifier,
> -                                MIG_MODE_CPR_TRANSFER);
> +    migration_add_notifier_modes(&container->cpr.transfer_notifier,
> +                                 vfio_cpr_fail_notifier,
> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>       return true;
>   }
>   
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 2c71fc1..db462aa 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -195,9 +195,10 @@ static int vfio_cpr_kvm_close_notifier(NotifierWithReturn *notifier,
>   void vfio_cpr_add_kvm_notifier(void)
>   {
>       if (!kvm_close_notifier.notify) {
> -        migration_add_notifier_mode(&kvm_close_notifier,
> -                                    vfio_cpr_kvm_close_notifier,
> -                                    MIG_MODE_CPR_TRANSFER);
> +        migration_add_notifier_modes(&kvm_close_notifier,
> +                                     vfio_cpr_kvm_close_notifier,
> +                                     MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
> +                                     -1);
>       }
>   }
>   
> @@ -282,9 +283,9 @@ static int vfio_cpr_pci_notifier(NotifierWithReturn *notifier,
>   
>   void vfio_cpr_pci_register_device(VFIOPCIDevice *vdev)
>   {
> -    migration_add_notifier_mode(&vdev->cpr.transfer_notifier,
> -                                vfio_cpr_pci_notifier,
> -                                MIG_MODE_CPR_TRANSFER);
> +    migration_add_notifier_modes(&vdev->cpr.transfer_notifier,
> +                                 vfio_cpr_pci_notifier,
> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>   }
>   
>   void vfio_cpr_pci_unregister_device(VFIOPCIDevice *vdev)
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 5/8] migration: cpr-exec save and load
  2025-09-22 13:49 ` [PATCH V4 5/8] migration: cpr-exec save and load Steve Sistare
@ 2025-09-22 16:00   ` Cédric Le Goater
  2025-09-24 18:16     ` Steven Sistare
  0 siblings, 1 reply; 30+ messages in thread
From: Cédric Le Goater @ 2025-09-22 16:00 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/22/25 15:49, Steve Sistare wrote:
> To preserve CPR state across exec, create a QEMUFile based on a memfd, and
> keep the memfd open across exec.  Save the value of the memfd in an
> environment variable so post-exec QEMU can find it.
Couldn't we preserve some memory to hand off to QEMU ? Like firmwares
An environment variable is a limited method.
Thanks,
C.
That's a short term hack right ? it's not even documented. I am sure
you something else in mind.
> These new functions are called in a subsequent patch.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/migration/cpr.h |  5 +++
>   migration/cpr-exec.c    | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
>   migration/meson.build   |  1 +
>   3 files changed, 100 insertions(+)
>   create mode 100644 migration/cpr-exec.c
> 
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 2b074d7..b84389f 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -53,4 +53,9 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>   
> +QEMUFile *cpr_exec_output(Error **errp);
> +QEMUFile *cpr_exec_input(Error **errp);
> +void cpr_exec_persist_state(QEMUFile *f);
> +bool cpr_exec_has_state(void);
> +void cpr_exec_unpersist_state(void);
>   #endif
> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> new file mode 100644
> index 0000000..2c32e9c
> --- /dev/null
> +++ b/migration/cpr-exec.c
> @@ -0,0 +1,94 @@
> +/*
> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/cutils.h"
> +#include "qemu/memfd.h"
> +#include "qapi/error.h"
> +#include "io/channel-file.h"
> +#include "io/channel-socket.h"
> +#include "migration/cpr.h"
> +#include "migration/qemu-file.h"
> +#include "migration/misc.h"
> +#include "migration/vmstate.h"
> +#include "system/runstate.h"
> +
> +#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
> +
> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    return qemu_file_new_input(ioc);
> +}
> +
> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    return qemu_file_new_output(ioc);
> +}
> +
> +void cpr_exec_persist_state(QEMUFile *f)
> +{
> +    QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
> +    int mfd = dup(fioc->fd);
> +    char val[16];
> +
> +    /* Remember mfd in environment for post-exec load */
> +    qemu_clear_cloexec(mfd);
> +    snprintf(val, sizeof(val), "%d", mfd);
> +    g_setenv(CPR_EXEC_STATE_NAME, val, 1);
> +}
> +
> +static int cpr_exec_find_state(void)
> +{
> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
> +    int mfd;
> +
> +    assert(val);
> +    g_unsetenv(CPR_EXEC_STATE_NAME);
> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
> +    return mfd;
> +}
> +
> +bool cpr_exec_has_state(void)
> +{
> +    return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
> +}
> +
> +void cpr_exec_unpersist_state(void)
> +{
> +    int mfd;
> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
> +
> +    g_unsetenv(CPR_EXEC_STATE_NAME);
> +    assert(val);
> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
> +    close(mfd);
> +}
> +
> +QEMUFile *cpr_exec_output(Error **errp)
> +{
> +    int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
The build should be adjusted for Linux only.
Thanks,
C.
> +
> +    if (mfd < 0) {
> +        error_setg_errno(errp, errno, "memfd_create failed");
> +        return NULL;
> +    }
> +
> +    return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
> +}
> +
> +QEMUFile *cpr_exec_input(Error **errp)
> +{
> +    int mfd = cpr_exec_find_state();
> +
> +    lseek(mfd, 0, SEEK_SET);
> +    return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
> +}
> diff --git a/migration/meson.build b/migration/meson.build
> index 0f71544..16909d5 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -16,6 +16,7 @@ system_ss.add(files(
>     'channel-block.c',
>     'cpr.c',
>     'cpr-transfer.c',
> +  'cpr-exec.c',
>     'cpu-throttle.c',
>     'dirtyrate.c',
>     'exec.c',
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 1/8] migration: multi-mode notifier
  2025-09-22 15:18   ` Cédric Le Goater
@ 2025-09-24 18:15     ` Steven Sistare
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Sistare @ 2025-09-24 18:15 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/22/2025 11:18 AM, Cédric Le Goater wrote:
> On 9/22/25 15:49, Steve Sistare wrote:
>> Allow a notifier to be added for multiple migration modes.
>> To allow a notifier to appear on multiple per-node lists, use
>> a generic list type.  We can no longer use NotifierWithReturnList,
>> because it shoe horns the notifier onto a single list.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>   include/migration/misc.h | 12 ++++++++++
>>   migration/migration.c    | 60 +++++++++++++++++++++++++++++++++++++-----------
>>   2 files changed, 59 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index a261f99..592b930 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -95,7 +95,19 @@ void migration_add_notifier(NotifierWithReturn *notify,
>>   void migration_add_notifier_mode(NotifierWithReturn *notify,
>>                                    MigrationNotifyFunc func, MigMode mode);
>> +/*
>> + * Same as migration_add_notifier, but applies to all @mode in the argument
>> + * list.  The list is terminated by -1 or MIG_MODE_ALL.  For the latter,
>> + * the notifier is added for all modes.
>> + */
>> +void migration_add_notifier_modes(NotifierWithReturn *notify,
>> +                                  MigrationNotifyFunc func, MigMode mode, ...);
>> +
>> +/*
>> + * Remove a notifier from all modes.
>> + */
>>   void migration_remove_notifier(NotifierWithReturn *notify);
>> +
>>   void migration_file_set_error(int ret, Error *err);
> 
> I think the include/migration/misc.h file should be updated with
> proper documentation, like found in include/migration/blocker.h.
> 
>>   /* True if incoming migration entered POSTCOPY_INCOMING_DISCARD */
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 10c216d..08a98f7 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -74,11 +74,7 @@
>>   #define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
>> -static NotifierWithReturnList migration_state_notifiers[] = {
>> -    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
>> -    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
>> -    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
>> -};
>> +static GSList *migration_state_notifiers[MIG_MODE__MAX];
>>   /* Messages sent on the return path from destination to source */
>>   enum mig_rp_message_type {
>> @@ -1665,23 +1661,51 @@ void migration_cancel(void)
>>       }
>>   }
>> +static int get_modes(MigMode mode, va_list ap);
>> +
>> +static void add_notifiers(NotifierWithReturn *notify, int modes)
>> +{
>> +    for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
>> +        if (modes & BIT(mode)) {
>> +            migration_state_notifiers[mode] =
>> +                g_slist_prepend(migration_state_notifiers[mode], notify);
>> +        }
>> +    }
>> +}
>> +
>> +void migration_add_notifier_modes(NotifierWithReturn *notify,
>> +                                  MigrationNotifyFunc func, MigMode mode, ...)
>> +{
>> +    int modes;
>> +    va_list ap;
>> +
>> +    va_start(ap, mode);
>> +    modes = get_modes(mode, ap);
>> +    va_end(ap);
> 
> No sanity check needed ? Could we have conflicting modes ? Just asking.
No conflicts.  A notifier can apply to one or more nodes.  Only the caller
knows what is necessary.
- Steve
>> +    notify->notify = (NotifierWithReturnFunc)func;
>> +    add_notifiers(notify, modes);
>> +}
>> +
>>   void migration_add_notifier_mode(NotifierWithReturn *notify,
>>                                    MigrationNotifyFunc func, MigMode mode)
>>   {
>> -    notify->notify = (NotifierWithReturnFunc)func;
>> -    notifier_with_return_list_add(&migration_state_notifiers[mode], notify);
>> +    migration_add_notifier_modes(notify, func, mode, -1);
>>   }
>>   void migration_add_notifier(NotifierWithReturn *notify,
>>                               MigrationNotifyFunc func)
>>   {
>> -    migration_add_notifier_mode(notify, func, MIG_MODE_NORMAL);
>> +    migration_add_notifier_modes(notify, func, MIG_MODE_NORMAL, -1);
>>   }
>>   void migration_remove_notifier(NotifierWithReturn *notify)
>>   {
>>       if (notify->notify) {
>> -        notifier_with_return_remove(notify);
>> +        for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
>> +            migration_blockers[mode] =
>> +                g_slist_remove(migration_state_notifiers[mode], notify);
>> +        }
>>           notify->notify = NULL;
>>       }
>>   }
>> @@ -1691,13 +1715,23 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
>>   {
>>       MigMode mode = s->parameters.mode;
>>       MigrationEvent e;
>> +    NotifierWithReturn *notifier;
>> +    GSList *elem, *next;
>>       int ret;
>>       e.type = type;
>> -    ret = notifier_with_return_list_notify(&migration_state_notifiers[mode],
>> -                                           &e, errp);
>> -    assert(!ret || type == MIG_EVENT_PRECOPY_SETUP);
>> -    return ret;
>> +
>> +    for (elem = migration_state_notifiers[mode]; elem; elem = next) {
>> +        next = elem->next;
>> +        notifier = (NotifierWithReturn *)elem->data;
>> +        ret = notifier->notify(notifier, &e, errp);
>> +        if (ret) {
>> +            assert(type == MIG_EVENT_PRECOPY_SETUP);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    return 0;
>>   }
>>   bool migration_has_failed(MigrationState *s)
> 
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 5/8] migration: cpr-exec save and load
  2025-09-22 16:00   ` Cédric Le Goater
@ 2025-09-24 18:16     ` Steven Sistare
  2025-09-25  7:11       ` Cédric Le Goater
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Sistare @ 2025-09-24 18:16 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/22/2025 12:00 PM, Cédric Le Goater wrote:
> On 9/22/25 15:49, Steve Sistare wrote:
>> To preserve CPR state across exec, create a QEMUFile based on a memfd, and
>> keep the memfd open across exec.  Save the value of the memfd in an
>> environment variable so post-exec QEMU can find it.
> 
> Couldn't we preserve some memory to hand off to QEMU ? Like firmwares
> An environment variable is a limited method.
There is no upside in making this more complicated.  We only need to
pass one tidbit of information -- the file descriptor number of the memfd
that contains all other information.
> Thanks,
> 
> C.
> 
> That's a short term hack right ? it's not even documented. 
It is an implementation detail, known only to the matched saving
and loading functions inside qemu.  No one else needs to know, so
no documentation.
- Steve
>I am sure
> you something else in mind.
> 
>> These new functions are called in a subsequent patch.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/cpr.h |  5 +++
>>   migration/cpr-exec.c    | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/meson.build   |  1 +
>>   3 files changed, 100 insertions(+)
>>   create mode 100644 migration/cpr-exec.c
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index 2b074d7..b84389f 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -53,4 +53,9 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>> +QEMUFile *cpr_exec_output(Error **errp);
>> +QEMUFile *cpr_exec_input(Error **errp);
>> +void cpr_exec_persist_state(QEMUFile *f);
>> +bool cpr_exec_has_state(void);
>> +void cpr_exec_unpersist_state(void);
>>   #endif
>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>> new file mode 100644
>> index 0000000..2c32e9c
>> --- /dev/null
>> +++ b/migration/cpr-exec.c
>> @@ -0,0 +1,94 @@
>> +/*
>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/cutils.h"
>> +#include "qemu/memfd.h"
>> +#include "qapi/error.h"
>> +#include "io/channel-file.h"
>> +#include "io/channel-socket.h"
>> +#include "migration/cpr.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/misc.h"
>> +#include "migration/vmstate.h"
>> +#include "system/runstate.h"
>> +
>> +#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>> +
>> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +    qio_channel_set_name(ioc, name);
>> +    return qemu_file_new_input(ioc);
>> +}
>> +
>> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +    qio_channel_set_name(ioc, name);
>> +    return qemu_file_new_output(ioc);
>> +}
>> +
>> +void cpr_exec_persist_state(QEMUFile *f)
>> +{
>> +    QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
>> +    int mfd = dup(fioc->fd);
>> +    char val[16];
>> +
>> +    /* Remember mfd in environment for post-exec load */
>> +    qemu_clear_cloexec(mfd);
>> +    snprintf(val, sizeof(val), "%d", mfd);
>> +    g_setenv(CPR_EXEC_STATE_NAME, val, 1);
>> +}
>> +
>> +static int cpr_exec_find_state(void)
>> +{
>> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
>> +    int mfd;
>> +
>> +    assert(val);
>> +    g_unsetenv(CPR_EXEC_STATE_NAME);
>> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
>> +    return mfd;
>> +}
>> +
>> +bool cpr_exec_has_state(void)
>> +{
>> +    return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
>> +}
>> +
>> +void cpr_exec_unpersist_state(void)
>> +{
>> +    int mfd;
>> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
>> +
>> +    g_unsetenv(CPR_EXEC_STATE_NAME);
>> +    assert(val);
>> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
>> +    close(mfd);
>> +}
>> +
>> +QEMUFile *cpr_exec_output(Error **errp)
>> +{
>> +    int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
> 
> The build should be adjusted for Linux only.
> 
> Thanks,
> 
> C.
> 
> 
> 
>> +
>> +    if (mfd < 0) {
>> +        error_setg_errno(errp, errno, "memfd_create failed");
>> +        return NULL;
>> +    }
>> +
>> +    return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
>> +}
>> +
>> +QEMUFile *cpr_exec_input(Error **errp)
>> +{
>> +    int mfd = cpr_exec_find_state();
>> +
>> +    lseek(mfd, 0, SEEK_SET);
>> +    return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>> +}
>> diff --git a/migration/meson.build b/migration/meson.build
>> index 0f71544..16909d5 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -16,6 +16,7 @@ system_ss.add(files(
>>     'channel-block.c',
>>     'cpr.c',
>>     'cpr-transfer.c',
>> +  'cpr-exec.c',
>>     'cpu-throttle.c',
>>     'dirtyrate.c',
>>     'exec.c',
> 
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-22 15:28   ` Cédric Le Goater
@ 2025-09-24 18:16     ` Steven Sistare
  2025-09-25  7:12       ` Cédric Le Goater
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Sistare @ 2025-09-24 18:16 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/22/2025 11:28 AM, Cédric Le Goater wrote:
> On 9/22/25 15:49, Steve Sistare wrote:
>> Add the cpr-exec migration mode.  Usage:
>>    qemu-system-$arch -machine aux-ram-share=on ...
>>    migrate_set_parameter mode cpr-exec
>>    migrate_set_parameter cpr-exec-command \
>>      <arg1> <arg2> ... -incoming <uri-1> \
>>    migrate -d <uri-1>
>>
>> The migrate command stops the VM, saves state to uri-1,
>> directly exec's a new version of QEMU on the same host,
>> replacing the original process while retaining its PID, and
>> loads state from uri-1.  Guest RAM is preserved in place,
>> albeit with new virtual addresses.
>>
>> The new QEMU process is started by exec'ing the command
>> specified by the @cpr-exec-command parameter.  The first word of
>> the command is the binary, and the remaining words are its
>> arguments.  The command may be a direct invocation of new QEMU,
>> or may be a non-QEMU command that exec's the new QEMU binary.
>>
>> This mode creates a second migration channel that is not visible
>> to the user.  At the start of migration, old QEMU saves CPR state
>> to the second channel, and at the end of migration, it tells the
>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>> objects are created.
>>
>> Because old QEMU terminates when new QEMU starts, one cannot
>> stream data between the two, so uri-1 must be a type,
>> such as a file, that accepts all data before old QEMU exits.
>> Otherwise, old QEMU may quietly block writing to the channel.
>>
>> Memory-backend objects must have the share=on attribute, but
>> memory-backend-epc is not supported.  The VM must be started with
>> the '-machine aux-ram-share=on' option, which allows anonymous
>> memory to be transferred in place to the new process.  The memfds
>> are kept open across exec by clearing the close-on-exec flag, their
>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Acked-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qapi/migration.json       | 25 +++++++++++++-
>>   include/migration/cpr.h   |  1 +
>>   migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/cpr.c           | 28 ++++++++++++++--
>>   migration/migration.c     | 10 +++++-
>>   migration/ram.c           |  1 +
>>   migration/vmstate-types.c |  8 +++++
>>   system/vl.c               |  4 ++-
>>   migration/trace-events    |  1 +
>>   9 files changed, 157 insertions(+), 5 deletions(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 2be8fa1..be0f3fc 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -694,9 +694,32 @@
>>   #     until you issue the `migrate-incoming` command.
>>   #
>>   #     (since 10.0)
>> +#
>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>> +#     migration channel, directly exec's a new version of QEMU on the
>> +#     same host, replacing the original process while retaining its
>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>> +#     in place.  Devices and their pinned pages are also preserved for
>> +#     VFIO and IOMMUFD.
>> +#
>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>> +#     the @cpr-exec-command parameter.  The command may be a direct
>> +#     invocation of new QEMU, or may be a wrapper that exec's the new
>> +#     QEMU binary.
>> +#
>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>> +#     stream data between the two, so the channel must be a type,
>> +#     such as a file, that accepts all data before old QEMU exits.
>> +#     Otherwise, old QEMU may quietly block writing to the channel.
>> +#
>> +#     Memory-backend objects must have the share=on attribute, but
>> +#     memory-backend-epc is not supported.  The VM must be started
>> +#     with the '-machine aux-ram-share=on' option.
>> +#
>> +#     (since 10.2)
>>   ##
>>   { 'enum': 'MigMode',
>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>   ##
>>   # @ZeroPageDetection:
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index b84389f..beed392 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>> +void cpr_exec_init(void);
>>   QEMUFile *cpr_exec_output(Error **errp);
>>   QEMUFile *cpr_exec_input(Error **errp);
>>   void cpr_exec_persist_state(QEMUFile *f);
>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>> index 2c32e9c..8cf55a3 100644
>> --- a/migration/cpr-exec.c
>> +++ b/migration/cpr-exec.c
>> @@ -6,15 +6,21 @@
>>   #include "qemu/osdep.h"
>>   #include "qemu/cutils.h"
>> +#include "qemu/error-report.h"
>>   #include "qemu/memfd.h"
>>   #include "qapi/error.h"
>> +#include "qapi/type-helpers.h"
>>   #include "io/channel-file.h"
>>   #include "io/channel-socket.h"
>> +#include "block/block-global-state.h"
>> +#include "qemu/main-loop.h"
>>   #include "migration/cpr.h"
>>   #include "migration/qemu-file.h"
>> +#include "migration/migration.h"
>>   #include "migration/misc.h"
>>   #include "migration/vmstate.h"
>>   #include "system/runstate.h"
>> +#include "trace.h"
>>   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>>       lseek(mfd, 0, SEEK_SET);
>>       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>   }
>> +
>> +static bool preserve_fd(int fd)
>> +{
>> +    qemu_clear_cloexec(fd);
>> +    return true;
>> +}
>> +
>> +static bool unpreserve_fd(int fd)
>> +{
>> +    qemu_set_cloexec(fd);
>> +    return true;
>> +}
>> +
>> +static void cpr_exec_cb(void *opaque)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
>> +    Error *err = NULL;
>> +
>> +    /*
>> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
>> +     * earlier because they should not persist across miscellaneous fork and
>> +     * exec calls that are performed during normal operation.
>> +     */
>> +    cpr_walk_fd(preserve_fd);
>> +
>> +    trace_cpr_exec();
>> +    execvp(argv[0], argv);
>> +
>> +    /*
>> +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
>> +     * or the system is very short on resources.
>> +     */
>> +    g_strfreev(argv);
>> +    cpr_walk_fd(unpreserve_fd);
>> +
>> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>> +    error_report_err(error_copy(err));
>> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>> +    migrate_set_error(s, err);
>> +
>> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>> +
>> +    err = NULL;
>> +    if (!migration_block_activate(&err)) {
>> +        /* error was already reported */
>> +        return;
>> +    }
>> +
>> +    if (runstate_is_live(s->vm_old_state)) {
>> +        vm_start();
>> +    }
>> +}
>> +
>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>> +                             Error **errp)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +
>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>> +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
>> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
>> +        qemu_bh_schedule(cpr_exec_bh);
>> +        qemu_notify_event();
>> +
>> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>> +        cpr_exec_unpersist_state();
>> +    }
>> +    return 0;
>> +}
>> +
>> +void cpr_exec_init(void)
>> +{
>> +    static NotifierWithReturn exec_notifier;
>> +
>> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>> +                                MIG_MODE_CPR_EXEC);
>> +}
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index d3e370e..eea3773 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>       if (mode == MIG_MODE_CPR_TRANSFER) {
>>           g_assert(channel);
>>           f = cpr_transfer_output(channel, errp);
>> +    } else if (mode == MIG_MODE_CPR_EXEC) {
>> +        f = cpr_exec_output(errp);
>>       } else {
>>           return 0;
>>       }
>> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>           return ret;
>>       }
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        cpr_exec_persist_state(f);
>> +    }
>> +
>>       /*
>>        * Close the socket only partially so we can later detect when the other
>>        * end closes by getting a HUP event.
>> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>       return 0;
>>   }
>> +static bool unpreserve_fd(int fd)
>> +{
>> +    qemu_set_cloexec(fd);
>> +    return true;
>> +}
>> +
>>   int cpr_state_load(MigrationChannel *channel, Error **errp)
>>   {
>>       int ret;
>> @@ -220,7 +232,13 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>       QEMUFile *f;
>>       MigMode mode = 0;
>> -    if (channel) {
>> +    if (cpr_exec_has_state()) {
>> +        mode = MIG_MODE_CPR_EXEC;
>> +        f = cpr_exec_input(errp);
>> +        if (channel) {
>> +            warn_report("ignoring cpr channel for migration mode cpr-exec");
> 
> migration/cpr.c does not include "qemu/error-report.h"
It builds just fine because it is included indirectly, but I will include it
directly.
- Steve
>> +        }
>> +    } else if (channel) {
>>           mode = MIG_MODE_CPR_TRANSFER;
>>           cpr_set_incoming_mode(mode);
>>           f = cpr_transfer_input(channel, errp);
>> @@ -232,6 +250,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>       }
>>       trace_cpr_state_load(MigMode_str(mode));
>> +    cpr_set_incoming_mode(mode);
>>       v = qemu_get_be32(f);
>>       if (v != QEMU_CPR_FILE_MAGIC) {
>> @@ -253,6 +272,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>           return ret;
>>       }
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
>> +        cpr_walk_fd(unpreserve_fd);
>> +    }
>> +
>>       /*
>>        * Let the caller decide when to close the socket (and generate a HUP event
>>        * for the sending side).
>> @@ -273,7 +297,7 @@ void cpr_state_close(void)
>>   bool cpr_incoming_needed(void *opaque)
>>   {
>>       MigMode mode = migrate_mode();
>> -    return mode == MIG_MODE_CPR_TRANSFER;
>> +    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
>>   }
>>   /*
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 08a98f7..2515bec 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -333,6 +333,7 @@ void migration_object_init(void)
>>       ram_mig_init();
>>       dirty_bitmap_mig_init();
>> +    cpr_exec_init();
>>       /* Initialize cpu throttle timers */
>>       cpu_throttle_init();
>> @@ -1796,7 +1797,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
>>   {
>>       MigMode mode = s->parameters.mode;
>>       return mode == MIG_MODE_CPR_REBOOT ||
>> -           mode == MIG_MODE_CPR_TRANSFER;
>> +           mode == MIG_MODE_CPR_TRANSFER ||
>> +           mode == MIG_MODE_CPR_EXEC;
>>   }
>>   int migrate_init(MigrationState *s, Error **errp)
>> @@ -2145,6 +2147,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>           return false;
>>       }
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
>> +        !s->parameters.has_cpr_exec_command) {
>> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
>> +        return false;
>> +    }
>> +
>>       if (migration_is_blocked(errp)) {
>>           return false;
>>       }
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7208bc1..6730a41 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
>>       MigMode mode = migrate_mode();
>>       return !qemu_ram_is_migratable(block) ||
>>              mode == MIG_MODE_CPR_TRANSFER ||
>> +           mode == MIG_MODE_CPR_EXEC ||
>>              (migrate_ignore_shared() && qemu_ram_is_shared(block)
>>                                       && qemu_ram_is_named_file(block));
>>   }
>> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
>> index 741a588..1aa0573 100644
>> --- a/migration/vmstate-types.c
>> +++ b/migration/vmstate-types.c
>> @@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>>                     const VMStateField *field)
>>   {
>>       int32_t *v = pv;
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        qemu_get_sbe32s(f, v);
>> +        return 0;
>> +    }
>>       *v = qemu_file_get_fd(f);
>>       return 0;
>>   }
>> @@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>>                     const VMStateField *field, JSONWriter *vmdesc)
>>   {
>>       int32_t *v = pv;
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        qemu_put_sbe32s(f, v);
>> +        return 0;
>> +    }
>>       return qemu_file_put_fd(f, *v);
>>   }
>> diff --git a/system/vl.c b/system/vl.c
>> index 4c24073..f395d04 100644
>> --- a/system/vl.c
>> +++ b/system/vl.c
>> @@ -3867,6 +3867,8 @@ void qemu_init(int argc, char **argv)
>>       }
>>       qemu_init_displays();
>>       accel_setup_post(current_machine);
>> -    os_setup_post();
>> +    if (migrate_mode() != MIG_MODE_CPR_EXEC) {
>> +        os_setup_post();
>> +    }
>>       resume_mux_open();
>>   }
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 706db97..e8edd1f 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
>>   cpr_state_load(const char *mode) "%s mode"
>>   cpr_transfer_input(const char *path) "%s"
>>   cpr_transfer_output(const char *path) "%s"
>> +cpr_exec(void) ""
>>   # block-dirty-bitmap.c
>>   send_bitmap_header_enter(void) ""
> 
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 5/8] migration: cpr-exec save and load
  2025-09-24 18:16     ` Steven Sistare
@ 2025-09-25  7:11       ` Cédric Le Goater
  2025-09-25 20:38         ` Steven Sistare
  2025-09-30 16:19         ` Peter Xu
  0 siblings, 2 replies; 30+ messages in thread
From: Cédric Le Goater @ 2025-09-25  7:11 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/24/25 20:16, Steven Sistare wrote:
> On 9/22/2025 12:00 PM, Cédric Le Goater wrote:
>> On 9/22/25 15:49, Steve Sistare wrote:
>>> To preserve CPR state across exec, create a QEMUFile based on a memfd, and
>>> keep the memfd open across exec.  Save the value of the memfd in an
>>> environment variable so post-exec QEMU can find it.
>>
>> Couldn't we preserve some memory to hand off to QEMU ? Like firmwares
>> An environment variable is a limited method.
> 
> There is no upside in making this more complicated.  We only need to
> pass one tidbit of information -- the file descriptor number of the memfd
> that contains all other information.
Please adjust the build for windows, memfd is Linux only.
>> Thanks,
>>
>> C.
>>
>> That's a short term hack right ? it's not even documented. 
> 
> It is an implementation detail, known only to the matched saving
> and loading functions inside qemu.  No one else needs to know, so
> no documentation.
ok. Fair enough.
Thanks,
C.
> 
> - Steve
> 
>> I am sure
>> you something else in mind.
>>
>>> These new functions are called in a subsequent patch.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   include/migration/cpr.h |  5 +++
>>>   migration/cpr-exec.c    | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>   migration/meson.build   |  1 +
>>>   3 files changed, 100 insertions(+)
>>>   create mode 100644 migration/cpr-exec.c
>>>
>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>> index 2b074d7..b84389f 100644
>>> --- a/include/migration/cpr.h
>>> +++ b/include/migration/cpr.h
>>> @@ -53,4 +53,9 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>> +QEMUFile *cpr_exec_output(Error **errp);
>>> +QEMUFile *cpr_exec_input(Error **errp);
>>> +void cpr_exec_persist_state(QEMUFile *f);
>>> +bool cpr_exec_has_state(void);
>>> +void cpr_exec_unpersist_state(void);
>>>   #endif
>>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>>> new file mode 100644
>>> index 0000000..2c32e9c
>>> --- /dev/null
>>> +++ b/migration/cpr-exec.c
>>> @@ -0,0 +1,94 @@
>>> +/*
>>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/cutils.h"
>>> +#include "qemu/memfd.h"
>>> +#include "qapi/error.h"
>>> +#include "io/channel-file.h"
>>> +#include "io/channel-socket.h"
>>> +#include "migration/cpr.h"
>>> +#include "migration/qemu-file.h"
>>> +#include "migration/misc.h"
>>> +#include "migration/vmstate.h"
>>> +#include "system/runstate.h"
>>> +
>>> +#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>> +
>>> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
>>> +{
>>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>>> +    qio_channel_set_name(ioc, name);
>>> +    return qemu_file_new_input(ioc);
>>> +}
>>> +
>>> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
>>> +{
>>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>>> +    qio_channel_set_name(ioc, name);
>>> +    return qemu_file_new_output(ioc);
>>> +}
>>> +
>>> +void cpr_exec_persist_state(QEMUFile *f)
>>> +{
>>> +    QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
>>> +    int mfd = dup(fioc->fd);
>>> +    char val[16];
>>> +
>>> +    /* Remember mfd in environment for post-exec load */
>>> +    qemu_clear_cloexec(mfd);
>>> +    snprintf(val, sizeof(val), "%d", mfd);
>>> +    g_setenv(CPR_EXEC_STATE_NAME, val, 1);
>>> +}
>>> +
>>> +static int cpr_exec_find_state(void)
>>> +{
>>> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
>>> +    int mfd;
>>> +
>>> +    assert(val);
>>> +    g_unsetenv(CPR_EXEC_STATE_NAME);
>>> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
>>> +    return mfd;
>>> +}
>>> +
>>> +bool cpr_exec_has_state(void)
>>> +{
>>> +    return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
>>> +}
>>> +
>>> +void cpr_exec_unpersist_state(void)
>>> +{
>>> +    int mfd;
>>> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
>>> +
>>> +    g_unsetenv(CPR_EXEC_STATE_NAME);
>>> +    assert(val);
>>> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
>>> +    close(mfd);
>>> +}
>>> +
>>> +QEMUFile *cpr_exec_output(Error **errp)
>>> +{
>>> +    int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
>>
>> The build should be adjusted for Linux only.
>>
>> Thanks,
>>
>> C.
>>
>>
>>
>>> +
>>> +    if (mfd < 0) {
>>> +        error_setg_errno(errp, errno, "memfd_create failed");
>>> +        return NULL;
>>> +    }
>>> +
>>> +    return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
>>> +}
>>> +
>>> +QEMUFile *cpr_exec_input(Error **errp)
>>> +{
>>> +    int mfd = cpr_exec_find_state();
>>> +
>>> +    lseek(mfd, 0, SEEK_SET);
>>> +    return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>> +}
>>> diff --git a/migration/meson.build b/migration/meson.build
>>> index 0f71544..16909d5 100644
>>> --- a/migration/meson.build
>>> +++ b/migration/meson.build
>>> @@ -16,6 +16,7 @@ system_ss.add(files(
>>>     'channel-block.c',
>>>     'cpr.c',
>>>     'cpr-transfer.c',
>>> +  'cpr-exec.c',
>>>     'cpu-throttle.c',
>>>     'dirtyrate.c',
>>>     'exec.c',
>>
> 
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-24 18:16     ` Steven Sistare
@ 2025-09-25  7:12       ` Cédric Le Goater
  0 siblings, 0 replies; 30+ messages in thread
From: Cédric Le Goater @ 2025-09-25  7:12 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/24/25 20:16, Steven Sistare wrote:
> On 9/22/2025 11:28 AM, Cédric Le Goater wrote:
>> On 9/22/25 15:49, Steve Sistare wrote:
>>> Add the cpr-exec migration mode.  Usage:
>>>    qemu-system-$arch -machine aux-ram-share=on ...
>>>    migrate_set_parameter mode cpr-exec
>>>    migrate_set_parameter cpr-exec-command \
>>>      <arg1> <arg2> ... -incoming <uri-1> \
>>>    migrate -d <uri-1>
>>>
>>> The migrate command stops the VM, saves state to uri-1,
>>> directly exec's a new version of QEMU on the same host,
>>> replacing the original process while retaining its PID, and
>>> loads state from uri-1.  Guest RAM is preserved in place,
>>> albeit with new virtual addresses.
>>>
>>> The new QEMU process is started by exec'ing the command
>>> specified by the @cpr-exec-command parameter.  The first word of
>>> the command is the binary, and the remaining words are its
>>> arguments.  The command may be a direct invocation of new QEMU,
>>> or may be a non-QEMU command that exec's the new QEMU binary.
>>>
>>> This mode creates a second migration channel that is not visible
>>> to the user.  At the start of migration, old QEMU saves CPR state
>>> to the second channel, and at the end of migration, it tells the
>>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>>> objects are created.
>>>
>>> Because old QEMU terminates when new QEMU starts, one cannot
>>> stream data between the two, so uri-1 must be a type,
>>> such as a file, that accepts all data before old QEMU exits.
>>> Otherwise, old QEMU may quietly block writing to the channel.
>>>
>>> Memory-backend objects must have the share=on attribute, but
>>> memory-backend-epc is not supported.  The VM must be started with
>>> the '-machine aux-ram-share=on' option, which allows anonymous
>>> memory to be transferred in place to the new process.  The memfds
>>> are kept open across exec by clearing the close-on-exec flag, their
>>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> Acked-by: Markus Armbruster <armbru@redhat.com>
>>> ---
>>>   qapi/migration.json       | 25 +++++++++++++-
>>>   include/migration/cpr.h   |  1 +
>>>   migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>>>   migration/cpr.c           | 28 ++++++++++++++--
>>>   migration/migration.c     | 10 +++++-
>>>   migration/ram.c           |  1 +
>>>   migration/vmstate-types.c |  8 +++++
>>>   system/vl.c               |  4 ++-
>>>   migration/trace-events    |  1 +
>>>   9 files changed, 157 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 2be8fa1..be0f3fc 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -694,9 +694,32 @@
>>>   #     until you issue the `migrate-incoming` command.
>>>   #
>>>   #     (since 10.0)
>>> +#
>>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>>> +#     migration channel, directly exec's a new version of QEMU on the
>>> +#     same host, replacing the original process while retaining its
>>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>>> +#     in place.  Devices and their pinned pages are also preserved for
>>> +#     VFIO and IOMMUFD.
>>> +#
>>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>>> +#     the @cpr-exec-command parameter.  The command may be a direct
>>> +#     invocation of new QEMU, or may be a wrapper that exec's the new
>>> +#     QEMU binary.
>>> +#
>>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>>> +#     stream data between the two, so the channel must be a type,
>>> +#     such as a file, that accepts all data before old QEMU exits.
>>> +#     Otherwise, old QEMU may quietly block writing to the channel.
>>> +#
>>> +#     Memory-backend objects must have the share=on attribute, but
>>> +#     memory-backend-epc is not supported.  The VM must be started
>>> +#     with the '-machine aux-ram-share=on' option.
>>> +#
>>> +#     (since 10.2)
>>>   ##
>>>   { 'enum': 'MigMode',
>>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>>   ##
>>>   # @ZeroPageDetection:
>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>> index b84389f..beed392 100644
>>> --- a/include/migration/cpr.h
>>> +++ b/include/migration/cpr.h
>>> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>> +void cpr_exec_init(void);
>>>   QEMUFile *cpr_exec_output(Error **errp);
>>>   QEMUFile *cpr_exec_input(Error **errp);
>>>   void cpr_exec_persist_state(QEMUFile *f);
>>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>>> index 2c32e9c..8cf55a3 100644
>>> --- a/migration/cpr-exec.c
>>> +++ b/migration/cpr-exec.c
>>> @@ -6,15 +6,21 @@
>>>   #include "qemu/osdep.h"
>>>   #include "qemu/cutils.h"
>>> +#include "qemu/error-report.h"
>>>   #include "qemu/memfd.h"
>>>   #include "qapi/error.h"
>>> +#include "qapi/type-helpers.h"
>>>   #include "io/channel-file.h"
>>>   #include "io/channel-socket.h"
>>> +#include "block/block-global-state.h"
>>> +#include "qemu/main-loop.h"
>>>   #include "migration/cpr.h"
>>>   #include "migration/qemu-file.h"
>>> +#include "migration/migration.h"
>>>   #include "migration/misc.h"
>>>   #include "migration/vmstate.h"
>>>   #include "system/runstate.h"
>>> +#include "trace.h"
>>>   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>>>       lseek(mfd, 0, SEEK_SET);
>>>       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>>   }
>>> +
>>> +static bool preserve_fd(int fd)
>>> +{
>>> +    qemu_clear_cloexec(fd);
>>> +    return true;
>>> +}
>>> +
>>> +static bool unpreserve_fd(int fd)
>>> +{
>>> +    qemu_set_cloexec(fd);
>>> +    return true;
>>> +}
>>> +
>>> +static void cpr_exec_cb(void *opaque)
>>> +{
>>> +    MigrationState *s = migrate_get_current();
>>> +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
>>> +    Error *err = NULL;
>>> +
>>> +    /*
>>> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
>>> +     * earlier because they should not persist across miscellaneous fork and
>>> +     * exec calls that are performed during normal operation.
>>> +     */
>>> +    cpr_walk_fd(preserve_fd);
>>> +
>>> +    trace_cpr_exec();
>>> +    execvp(argv[0], argv);
>>> +
>>> +    /*
>>> +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
>>> +     * or the system is very short on resources.
>>> +     */
>>> +    g_strfreev(argv);
>>> +    cpr_walk_fd(unpreserve_fd);
>>> +
>>> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>>> +    error_report_err(error_copy(err));
>>> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>>> +    migrate_set_error(s, err);
>>> +
>>> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>>> +
>>> +    err = NULL;
>>> +    if (!migration_block_activate(&err)) {
>>> +        /* error was already reported */
>>> +        return;
>>> +    }
>>> +
>>> +    if (runstate_is_live(s->vm_old_state)) {
>>> +        vm_start();
>>> +    }
>>> +}
>>> +
>>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>> +                             Error **errp)
>>> +{
>>> +    MigrationState *s = migrate_get_current();
>>> +
>>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>> +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
>>> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
>>> +        qemu_bh_schedule(cpr_exec_bh);
>>> +        qemu_notify_event();
>>> +
>>> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>>> +        cpr_exec_unpersist_state();
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +void cpr_exec_init(void)
>>> +{
>>> +    static NotifierWithReturn exec_notifier;
>>> +
>>> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>>> +                                MIG_MODE_CPR_EXEC);
>>> +}
>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>> index d3e370e..eea3773 100644
>>> --- a/migration/cpr.c
>>> +++ b/migration/cpr.c
>>> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>       if (mode == MIG_MODE_CPR_TRANSFER) {
>>>           g_assert(channel);
>>>           f = cpr_transfer_output(channel, errp);
>>> +    } else if (mode == MIG_MODE_CPR_EXEC) {
>>> +        f = cpr_exec_output(errp);
>>>       } else {
>>>           return 0;
>>>       }
>>> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>           return ret;
>>>       }
>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>> +        cpr_exec_persist_state(f);
>>> +    }
>>> +
>>>       /*
>>>        * Close the socket only partially so we can later detect when the other
>>>        * end closes by getting a HUP event.
>>> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>       return 0;
>>>   }
>>> +static bool unpreserve_fd(int fd)
>>> +{
>>> +    qemu_set_cloexec(fd);
>>> +    return true;
>>> +}
>>> +
>>>   int cpr_state_load(MigrationChannel *channel, Error **errp)
>>>   {
>>>       int ret;
>>> @@ -220,7 +232,13 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>>       QEMUFile *f;
>>>       MigMode mode = 0;
>>> -    if (channel) {
>>> +    if (cpr_exec_has_state()) {
>>> +        mode = MIG_MODE_CPR_EXEC;
>>> +        f = cpr_exec_input(errp);
>>> +        if (channel) {
>>> +            warn_report("ignoring cpr channel for migration mode cpr-exec");
>>
>> migration/cpr.c does not include "qemu/error-report.h"
> 
> It builds just fine because it is included indirectly, but I will include it
> directly.
The build broke on my tree but I have other patches moving code. It is
better to be explicit.
Thanks,
C.
> 
> - Steve
>>> +        }
>>> +    } else if (channel) {
>>>           mode = MIG_MODE_CPR_TRANSFER;
>>>           cpr_set_incoming_mode(mode);
>>>           f = cpr_transfer_input(channel, errp);
>>> @@ -232,6 +250,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>>       }
>>>       trace_cpr_state_load(MigMode_str(mode));
>>> +    cpr_set_incoming_mode(mode);
>>>       v = qemu_get_be32(f);
>>>       if (v != QEMU_CPR_FILE_MAGIC) {
>>> @@ -253,6 +272,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>>           return ret;
>>>       }
>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>> +        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
>>> +        cpr_walk_fd(unpreserve_fd);
>>> +    }
>>> +
>>>       /*
>>>        * Let the caller decide when to close the socket (and generate a HUP event
>>>        * for the sending side).
>>> @@ -273,7 +297,7 @@ void cpr_state_close(void)
>>>   bool cpr_incoming_needed(void *opaque)
>>>   {
>>>       MigMode mode = migrate_mode();
>>> -    return mode == MIG_MODE_CPR_TRANSFER;
>>> +    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
>>>   }
>>>   /*
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 08a98f7..2515bec 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -333,6 +333,7 @@ void migration_object_init(void)
>>>       ram_mig_init();
>>>       dirty_bitmap_mig_init();
>>> +    cpr_exec_init();
>>>       /* Initialize cpu throttle timers */
>>>       cpu_throttle_init();
>>> @@ -1796,7 +1797,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
>>>   {
>>>       MigMode mode = s->parameters.mode;
>>>       return mode == MIG_MODE_CPR_REBOOT ||
>>> -           mode == MIG_MODE_CPR_TRANSFER;
>>> +           mode == MIG_MODE_CPR_TRANSFER ||
>>> +           mode == MIG_MODE_CPR_EXEC;
>>>   }
>>>   int migrate_init(MigrationState *s, Error **errp)
>>> @@ -2145,6 +2147,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>>           return false;
>>>       }
>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
>>> +        !s->parameters.has_cpr_exec_command) {
>>> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
>>> +        return false;
>>> +    }
>>> +
>>>       if (migration_is_blocked(errp)) {
>>>           return false;
>>>       }
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 7208bc1..6730a41 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
>>>       MigMode mode = migrate_mode();
>>>       return !qemu_ram_is_migratable(block) ||
>>>              mode == MIG_MODE_CPR_TRANSFER ||
>>> +           mode == MIG_MODE_CPR_EXEC ||
>>>              (migrate_ignore_shared() && qemu_ram_is_shared(block)
>>>                                       && qemu_ram_is_named_file(block));
>>>   }
>>> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
>>> index 741a588..1aa0573 100644
>>> --- a/migration/vmstate-types.c
>>> +++ b/migration/vmstate-types.c
>>> @@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>>>                     const VMStateField *field)
>>>   {
>>>       int32_t *v = pv;
>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>> +        qemu_get_sbe32s(f, v);
>>> +        return 0;
>>> +    }
>>>       *v = qemu_file_get_fd(f);
>>>       return 0;
>>>   }
>>> @@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>>>                     const VMStateField *field, JSONWriter *vmdesc)
>>>   {
>>>       int32_t *v = pv;
>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>> +        qemu_put_sbe32s(f, v);
>>> +        return 0;
>>> +    }
>>>       return qemu_file_put_fd(f, *v);
>>>   }
>>> diff --git a/system/vl.c b/system/vl.c
>>> index 4c24073..f395d04 100644
>>> --- a/system/vl.c
>>> +++ b/system/vl.c
>>> @@ -3867,6 +3867,8 @@ void qemu_init(int argc, char **argv)
>>>       }
>>>       qemu_init_displays();
>>>       accel_setup_post(current_machine);
>>> -    os_setup_post();
>>> +    if (migrate_mode() != MIG_MODE_CPR_EXEC) {
>>> +        os_setup_post();
>>> +    }
>>>       resume_mux_open();
>>>   }
>>> diff --git a/migration/trace-events b/migration/trace-events
>>> index 706db97..e8edd1f 100644
>>> --- a/migration/trace-events
>>> +++ b/migration/trace-events
>>> @@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
>>>   cpr_state_load(const char *mode) "%s mode"
>>>   cpr_transfer_input(const char *path) "%s"
>>>   cpr_transfer_output(const char *path) "%s"
>>> +cpr_exec(void) ""
>>>   # block-dirty-bitmap.c
>>>   send_bitmap_header_enter(void) ""
>>
> 
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 5/8] migration: cpr-exec save and load
  2025-09-25  7:11       ` Cédric Le Goater
@ 2025-09-25 20:38         ` Steven Sistare
  2025-09-30 16:19         ` Peter Xu
  1 sibling, 0 replies; 30+ messages in thread
From: Steven Sistare @ 2025-09-25 20:38 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/25/2025 3:11 AM, Cédric Le Goater wrote:
> On 9/24/25 20:16, Steven Sistare wrote:
>> On 9/22/2025 12:00 PM, Cédric Le Goater wrote:
>>> On 9/22/25 15:49, Steve Sistare wrote:
>>>> To preserve CPR state across exec, create a QEMUFile based on a memfd, and
>>>> keep the memfd open across exec.  Save the value of the memfd in an
>>>> environment variable so post-exec QEMU can find it.
>>>
>>> Couldn't we preserve some memory to hand off to QEMU ? Like firmwares
>>> An environment variable is a limited method.
>>
>> There is no upside in making this more complicated.  We only need to
>> pass one tidbit of information -- the file descriptor number of the memfd
>> that contains all other information.
> 
> Please adjust the build for windows, memfd is Linux only.
Will do, thanks.  I will call qemu_memfd_create, which is defined for posix
and windows but returns error for the latter, instead of memfd_create.
- Steve
>>> That's a short term hack right ? it's not even documented. 
>>
>> It is an implementation detail, known only to the matched saving
>> and loading functions inside qemu.  No one else needs to know, so
>> no documentation.
> 
> ok. Fair enough.
> 
> Thanks,
> 
> C.
> 
> 
>>
>> - Steve
>>
>>> I am sure
>>> you something else in mind.
>>>
>>>> These new functions are called in a subsequent patch.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>   include/migration/cpr.h |  5 +++
>>>>   migration/cpr-exec.c    | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   migration/meson.build   |  1 +
>>>>   3 files changed, 100 insertions(+)
>>>>   create mode 100644 migration/cpr-exec.c
>>>>
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index 2b074d7..b84389f 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -53,4 +53,9 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>>> +QEMUFile *cpr_exec_output(Error **errp);
>>>> +QEMUFile *cpr_exec_input(Error **errp);
>>>> +void cpr_exec_persist_state(QEMUFile *f);
>>>> +bool cpr_exec_has_state(void);
>>>> +void cpr_exec_unpersist_state(void);
>>>>   #endif
>>>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>>>> new file mode 100644
>>>> index 0000000..2c32e9c
>>>> --- /dev/null
>>>> +++ b/migration/cpr-exec.c
>>>> @@ -0,0 +1,94 @@
>>>> +/*
>>>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>>>> + *
>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/cutils.h"
>>>> +#include "qemu/memfd.h"
>>>> +#include "qapi/error.h"
>>>> +#include "io/channel-file.h"
>>>> +#include "io/channel-socket.h"
>>>> +#include "migration/cpr.h"
>>>> +#include "migration/qemu-file.h"
>>>> +#include "migration/misc.h"
>>>> +#include "migration/vmstate.h"
>>>> +#include "system/runstate.h"
>>>> +
>>>> +#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>>> +
>>>> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
>>>> +{
>>>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>>>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>>>> +    qio_channel_set_name(ioc, name);
>>>> +    return qemu_file_new_input(ioc);
>>>> +}
>>>> +
>>>> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
>>>> +{
>>>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>>>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>>>> +    qio_channel_set_name(ioc, name);
>>>> +    return qemu_file_new_output(ioc);
>>>> +}
>>>> +
>>>> +void cpr_exec_persist_state(QEMUFile *f)
>>>> +{
>>>> +    QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
>>>> +    int mfd = dup(fioc->fd);
>>>> +    char val[16];
>>>> +
>>>> +    /* Remember mfd in environment for post-exec load */
>>>> +    qemu_clear_cloexec(mfd);
>>>> +    snprintf(val, sizeof(val), "%d", mfd);
>>>> +    g_setenv(CPR_EXEC_STATE_NAME, val, 1);
>>>> +}
>>>> +
>>>> +static int cpr_exec_find_state(void)
>>>> +{
>>>> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
>>>> +    int mfd;
>>>> +
>>>> +    assert(val);
>>>> +    g_unsetenv(CPR_EXEC_STATE_NAME);
>>>> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
>>>> +    return mfd;
>>>> +}
>>>> +
>>>> +bool cpr_exec_has_state(void)
>>>> +{
>>>> +    return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
>>>> +}
>>>> +
>>>> +void cpr_exec_unpersist_state(void)
>>>> +{
>>>> +    int mfd;
>>>> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
>>>> +
>>>> +    g_unsetenv(CPR_EXEC_STATE_NAME);
>>>> +    assert(val);
>>>> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
>>>> +    close(mfd);
>>>> +}
>>>> +
>>>> +QEMUFile *cpr_exec_output(Error **errp)
>>>> +{
>>>> +    int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
>>>
>>> The build should be adjusted for Linux only.
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
>>>
>>>> +
>>>> +    if (mfd < 0) {
>>>> +        error_setg_errno(errp, errno, "memfd_create failed");
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
>>>> +}
>>>> +
>>>> +QEMUFile *cpr_exec_input(Error **errp)
>>>> +{
>>>> +    int mfd = cpr_exec_find_state();
>>>> +
>>>> +    lseek(mfd, 0, SEEK_SET);
>>>> +    return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>>> +}
>>>> diff --git a/migration/meson.build b/migration/meson.build
>>>> index 0f71544..16909d5 100644
>>>> --- a/migration/meson.build
>>>> +++ b/migration/meson.build
>>>> @@ -16,6 +16,7 @@ system_ss.add(files(
>>>>     'channel-block.c',
>>>>     'cpr.c',
>>>>     'cpr-transfer.c',
>>>> +  'cpr-exec.c',
>>>>     'cpu-throttle.c',
>>>>     'dirtyrate.c',
>>>>     'exec.c',
>>>
>>
> 
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 0/8] Live update: cpr-exec
  2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
                   ` (7 preceding siblings ...)
  2025-09-22 13:49 ` [PATCH V4 8/8] vfio: cpr-exec mode Steve Sistare
@ 2025-09-30 15:28 ` Steven Sistare
  2025-09-30 16:42   ` Peter Xu
  8 siblings, 1 reply; 30+ messages in thread
From: Steven Sistare @ 2025-09-30 15:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On 9/22/2025 9:49 AM, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.
I have received Acks or RB's for most of these, thank you all!
Just a reminder, these patches still need review from Peter and/or Fabiano:
   Patch 5/8: migration: cpr-exec save and load
   Patch 6/8: migration: cpr-exec mode
(And many thanks to Fabiano for reviewing the related qtest patches).
- Steve
> The new user-visible interfaces are:
>    * cpr-exec (MigMode migration parameter)
>    * cpr-exec-command (migration parameter)
> 
> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> primary difference being that old QEMU directly exec's new QEMU.  The user
> specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command.
> 
> Why?
> 
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources.  By contrast, cpr-transfer mode
> requires a new container to be created on the same host as the target of
> the CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.
> Avoiding over commitment requires extra work in the management layer.
> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.
> 
> How?
> 
> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> and by sending the unique name and value of each descriptor to new QEMU
> via CPR state.
> 
> CPR state cannot be sent over the normal migration channel, because devices
> and backends are created prior to reading the channel, so this mode sends
> CPR state over a second migration channel that is not visible to the user.
> New QEMU reads the second channel prior to creating devices or backends.
> 
> The exec itself is trivial.  After writing to the migration channels, the
> migration code calls a new main-loop hook to perform the exec.
> 
> Example:
> 
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would use a new QEMU binary path in cpr-exec-command.
> 
>    # qemu-kvm -monitor stdio
>    -object memory-backend-memfd,id=ram0,size=1G
>    -machine memory-backend=ram0 -machine aux-ram-share=on ...
> 
>    QEMU 10.1.50 monitor - type 'help' for more information
>    (qemu) info status
>    VM status: running
>    (qemu) migrate_set_parameter mode cpr-exec
>    (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>    (qemu) migrate -d file:vm.state
>    (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>    (qemu) info status
>    VM status: running
> 
> Changes in V2:
>    * dropped patch "helper to request exec" and use a BH to exec
>    * used g_shell_parse_argv for cpr-exec-command parameter
>    * fixed check for channel in cpr_state_load
>    * tweaked QAPI docs, developer docs, and code comments
>    * fixed doc: rename cpr-exec-args -> cpr-exec-command
> 
> Steve Sistare (8):
>    migration: multi-mode notifier
>    migration: add cpr_walk_fd
>    oslib: qemu_clear_cloexec
>    migration: cpr-exec-command parameter
>    migration: cpr-exec save and load
>    migration: cpr-exec mode
>    migration: cpr-exec docs
>    vfio: cpr-exec mode
> 
>   docs/devel/migration/CPR.rst   | 106 +++++++++++++++++++++++-
>   qapi/migration.json            |  46 ++++++++++-
>   include/migration/cpr.h        |   9 +++
>   include/migration/misc.h       |  12 +++
>   include/qemu/osdep.h           |   9 +++
>   hw/vfio/container.c            |   3 +-
>   hw/vfio/cpr-iommufd.c          |   3 +-
>   hw/vfio/cpr-legacy.c           |   9 ++-
>   hw/vfio/cpr.c                  |  13 +--
>   migration/cpr-exec.c           | 178 +++++++++++++++++++++++++++++++++++++++++
>   migration/cpr.c                |  41 +++++++++-
>   migration/migration-hmp-cmds.c |  30 +++++++
>   migration/migration.c          |  70 ++++++++++++----
>   migration/options.c            |  14 ++++
>   migration/ram.c                |   1 +
>   migration/vmstate-types.c      |   8 ++
>   system/vl.c                    |   4 +-
>   util/oslib-posix.c             |   9 +++
>   util/oslib-win32.c             |   4 +
>   hmp-commands.hx                |   2 +-
>   migration/meson.build          |   1 +
>   migration/trace-events         |   1 +
>   22 files changed, 538 insertions(+), 35 deletions(-)
>   create mode 100644 migration/cpr-exec.c
> 
> base-commit: e7c1e8043a69c5a8efa39d4f9d111f7c72c076e6
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 5/8] migration: cpr-exec save and load
  2025-09-25  7:11       ` Cédric Le Goater
  2025-09-25 20:38         ` Steven Sistare
@ 2025-09-30 16:19         ` Peter Xu
  2025-09-30 16:39           ` Steven Sistare
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-09-30 16:19 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Steven Sistare, qemu-devel, Fabiano Rosas, Markus Armbruster,
	Paolo Bonzini, Eric Blake, Dr. David Alan Gilbert,
	Alex Williamson
On Thu, Sep 25, 2025 at 09:11:33AM +0200, Cédric Le Goater wrote:
> > > That's a short term hack right ? it's not even documented.
> > 
> > It is an implementation detail, known only to the matched saving
> > and loading functions inside qemu.  No one else needs to know, so
> > no documentation.
> 
> ok. Fair enough.
IMHO Cedric's ask is fair.  At least when people reading the doc may get
confused of why cpr channel isn't needed for the exec mode in its API.
Could we still add one liner into the doc to describe it?  Something that
would mention a temp memfd and passing it over using environment vars would
help.
Thanks,
-- 
Peter Xu
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-22 13:49 ` [PATCH V4 6/8] migration: cpr-exec mode Steve Sistare
  2025-09-22 15:28   ` Cédric Le Goater
@ 2025-09-30 16:39   ` Peter Xu
  2025-09-30 17:18     ` Steven Sistare
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-09-30 16:39 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On Mon, Sep 22, 2025 at 06:49:43AM -0700, Steve Sistare wrote:
> Add the cpr-exec migration mode.  Usage:
>   qemu-system-$arch -machine aux-ram-share=on ...
>   migrate_set_parameter mode cpr-exec
>   migrate_set_parameter cpr-exec-command \
>     <arg1> <arg2> ... -incoming <uri-1> \
>   migrate -d <uri-1>
> 
> The migrate command stops the VM, saves state to uri-1,
> directly exec's a new version of QEMU on the same host,
> replacing the original process while retaining its PID, and
> loads state from uri-1.  Guest RAM is preserved in place,
> albeit with new virtual addresses.
> 
> The new QEMU process is started by exec'ing the command
> specified by the @cpr-exec-command parameter.  The first word of
> the command is the binary, and the remaining words are its
> arguments.  The command may be a direct invocation of new QEMU,
> or may be a non-QEMU command that exec's the new QEMU binary.
> 
> This mode creates a second migration channel that is not visible
> to the user.  At the start of migration, old QEMU saves CPR state
> to the second channel, and at the end of migration, it tells the
> main loop to call cpr_exec.  New QEMU loads CPR state early, before
> objects are created.
> 
> Because old QEMU terminates when new QEMU starts, one cannot
> stream data between the two, so uri-1 must be a type,
> such as a file, that accepts all data before old QEMU exits.
> Otherwise, old QEMU may quietly block writing to the channel.
> 
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported.  The VM must be started with
> the '-machine aux-ram-share=on' option, which allows anonymous
> memory to be transferred in place to the new process.  The memfds
> are kept open across exec by clearing the close-on-exec flag, their
> values are saved in CPR state, and they are mmap'd in new QEMU.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Acked-by: Markus Armbruster <armbru@redhat.com>
> ---
>  qapi/migration.json       | 25 +++++++++++++-
>  include/migration/cpr.h   |  1 +
>  migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>  migration/cpr.c           | 28 ++++++++++++++--
>  migration/migration.c     | 10 +++++-
>  migration/ram.c           |  1 +
>  migration/vmstate-types.c |  8 +++++
>  system/vl.c               |  4 ++-
>  migration/trace-events    |  1 +
>  9 files changed, 157 insertions(+), 5 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 2be8fa1..be0f3fc 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -694,9 +694,32 @@
>  #     until you issue the `migrate-incoming` command.
>  #
>  #     (since 10.0)
> +#
> +# @cpr-exec: The migrate command stops the VM, saves state to the
> +#     migration channel, directly exec's a new version of QEMU on the
> +#     same host, replacing the original process while retaining its
> +#     PID, and loads state from the channel.  Guest RAM is preserved
> +#     in place.  Devices and their pinned pages are also preserved for
> +#     VFIO and IOMMUFD.
> +#
> +#     Old QEMU starts new QEMU by exec'ing the command specified by
> +#     the @cpr-exec-command parameter.  The command may be a direct
> +#     invocation of new QEMU, or may be a wrapper that exec's the new
> +#     QEMU binary.
> +#
> +#     Because old QEMU terminates when new QEMU starts, one cannot
> +#     stream data between the two, so the channel must be a type,
> +#     such as a file, that accepts all data before old QEMU exits.
> +#     Otherwise, old QEMU may quietly block writing to the channel.
> +#
> +#     Memory-backend objects must have the share=on attribute, but
> +#     memory-backend-epc is not supported.  The VM must be started
> +#     with the '-machine aux-ram-share=on' option.
> +#
> +#     (since 10.2)
>  ##
>  { 'enum': 'MigMode',
> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>  
>  ##
>  # @ZeroPageDetection:
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index b84389f..beed392 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>  QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>  QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>  
> +void cpr_exec_init(void);
>  QEMUFile *cpr_exec_output(Error **errp);
>  QEMUFile *cpr_exec_input(Error **errp);
>  void cpr_exec_persist_state(QEMUFile *f);
> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> index 2c32e9c..8cf55a3 100644
> --- a/migration/cpr-exec.c
> +++ b/migration/cpr-exec.c
> @@ -6,15 +6,21 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/cutils.h"
> +#include "qemu/error-report.h"
>  #include "qemu/memfd.h"
>  #include "qapi/error.h"
> +#include "qapi/type-helpers.h"
>  #include "io/channel-file.h"
>  #include "io/channel-socket.h"
> +#include "block/block-global-state.h"
> +#include "qemu/main-loop.h"
>  #include "migration/cpr.h"
>  #include "migration/qemu-file.h"
> +#include "migration/migration.h"
>  #include "migration/misc.h"
>  #include "migration/vmstate.h"
>  #include "system/runstate.h"
> +#include "trace.h"
>  
>  #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>  
> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>      lseek(mfd, 0, SEEK_SET);
>      return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>  }
> +
> +static bool preserve_fd(int fd)
> +{
> +    qemu_clear_cloexec(fd);
> +    return true;
> +}
> +
> +static bool unpreserve_fd(int fd)
> +{
> +    qemu_set_cloexec(fd);
> +    return true;
> +}
> +
> +static void cpr_exec_cb(void *opaque)
> +{
> +    MigrationState *s = migrate_get_current();
> +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
> +    Error *err = NULL;
> +
> +    /*
> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> +     * earlier because they should not persist across miscellaneous fork and
> +     * exec calls that are performed during normal operation.
> +     */
> +    cpr_walk_fd(preserve_fd);
> +
> +    trace_cpr_exec();
> +    execvp(argv[0], argv);
> +
> +    /*
> +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
> +     * or the system is very short on resources.
> +     */
> +    g_strfreev(argv);
> +    cpr_walk_fd(unpreserve_fd);
> +
> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
> +    error_report_err(error_copy(err));
> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
I believe this is the only place we can have the state machine from
COMPLETED->FAILED.  It's pretty hacky.  Maybe add a quick comment?
> +    migrate_set_error(s, err);
> +
> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
> +
> +    err = NULL;
> +    if (!migration_block_activate(&err)) {
> +        /* error was already reported */
> +        return;
> +    }
> +
> +    if (runstate_is_live(s->vm_old_state)) {
> +        vm_start();
> +    }
We have rollback logic in migration_iteration_finish().  Make a small
helper and reuse the code?
> +}
> +
> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> +                             Error **errp)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
> +        qemu_bh_schedule(cpr_exec_bh);
> +        qemu_notify_event();
> +
Newline can be dropped.
> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> +        cpr_exec_unpersist_state();
> +    }
> +    return 0;
> +}
> +
> +void cpr_exec_init(void)
> +{
> +    static NotifierWithReturn exec_notifier;
> +
> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
> +                                MIG_MODE_CPR_EXEC);
> +}
> diff --git a/migration/cpr.c b/migration/cpr.c
> index d3e370e..eea3773 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>      if (mode == MIG_MODE_CPR_TRANSFER) {
>          g_assert(channel);
>          f = cpr_transfer_output(channel, errp);
> +    } else if (mode == MIG_MODE_CPR_EXEC) {
> +        f = cpr_exec_output(errp);
>      } else {
>          return 0;
>      }
> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>          return ret;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        cpr_exec_persist_state(f);
> +    }
> +
>      /*
>       * Close the socket only partially so we can later detect when the other
>       * end closes by getting a HUP event.
> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>      return 0;
>  }
>  
> +static bool unpreserve_fd(int fd)
> +{
> +    qemu_set_cloexec(fd);
> +    return true;
> +}
Is this function defined twice?
> +
>  int cpr_state_load(MigrationChannel *channel, Error **errp)
>  {
>      int ret;
> @@ -220,7 +232,13 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>      QEMUFile *f;
>      MigMode mode = 0;
>  
> -    if (channel) {
> +    if (cpr_exec_has_state()) {
> +        mode = MIG_MODE_CPR_EXEC;
> +        f = cpr_exec_input(errp);
> +        if (channel) {
> +            warn_report("ignoring cpr channel for migration mode cpr-exec");
> +        }
> +    } else if (channel) {
>          mode = MIG_MODE_CPR_TRANSFER;
>          cpr_set_incoming_mode(mode);
>          f = cpr_transfer_input(channel, errp);
> @@ -232,6 +250,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>      }
>  
>      trace_cpr_state_load(MigMode_str(mode));
> +    cpr_set_incoming_mode(mode);
>  
>      v = qemu_get_be32(f);
>      if (v != QEMU_CPR_FILE_MAGIC) {
> @@ -253,6 +272,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>          return ret;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
> +        cpr_walk_fd(unpreserve_fd);
> +    }
> +
>      /*
>       * Let the caller decide when to close the socket (and generate a HUP event
>       * for the sending side).
> @@ -273,7 +297,7 @@ void cpr_state_close(void)
>  bool cpr_incoming_needed(void *opaque)
>  {
>      MigMode mode = migrate_mode();
> -    return mode == MIG_MODE_CPR_TRANSFER;
> +    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
>  }
>  
>  /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 08a98f7..2515bec 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -333,6 +333,7 @@ void migration_object_init(void)
>  
>      ram_mig_init();
>      dirty_bitmap_mig_init();
> +    cpr_exec_init();
>  
>      /* Initialize cpu throttle timers */
>      cpu_throttle_init();
> @@ -1796,7 +1797,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
>  {
>      MigMode mode = s->parameters.mode;
>      return mode == MIG_MODE_CPR_REBOOT ||
> -           mode == MIG_MODE_CPR_TRANSFER;
> +           mode == MIG_MODE_CPR_TRANSFER ||
> +           mode == MIG_MODE_CPR_EXEC;
>  }
>  
>  int migrate_init(MigrationState *s, Error **errp)
> @@ -2145,6 +2147,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>          return false;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
> +        !s->parameters.has_cpr_exec_command) {
> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
> +        return false;
> +    }
> +
>      if (migration_is_blocked(errp)) {
>          return false;
>      }
> diff --git a/migration/ram.c b/migration/ram.c
> index 7208bc1..6730a41 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
>      MigMode mode = migrate_mode();
>      return !qemu_ram_is_migratable(block) ||
>             mode == MIG_MODE_CPR_TRANSFER ||
> +           mode == MIG_MODE_CPR_EXEC ||
>             (migrate_ignore_shared() && qemu_ram_is_shared(block)
>                                      && qemu_ram_is_named_file(block));
>  }
> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> index 741a588..1aa0573 100644
> --- a/migration/vmstate-types.c
> +++ b/migration/vmstate-types.c
> @@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>                    const VMStateField *field)
>  {
>      int32_t *v = pv;
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        qemu_get_sbe32s(f, v);
> +        return 0;
> +    }
>      *v = qemu_file_get_fd(f);
>      return 0;
>  }
> @@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>                    const VMStateField *field, JSONWriter *vmdesc)
>  {
>      int32_t *v = pv;
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        qemu_put_sbe32s(f, v);
> +        return 0;
> +    }
>      return qemu_file_put_fd(f, *v);
>  }
>  
> diff --git a/system/vl.c b/system/vl.c
> index 4c24073..f395d04 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -3867,6 +3867,8 @@ void qemu_init(int argc, char **argv)
>      }
>      qemu_init_displays();
>      accel_setup_post(current_machine);
> -    os_setup_post();
> +    if (migrate_mode() != MIG_MODE_CPR_EXEC) {
> +        os_setup_post();
> +    }
>      resume_mux_open();
>  }
> diff --git a/migration/trace-events b/migration/trace-events
> index 706db97..e8edd1f 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
>  cpr_state_load(const char *mode) "%s mode"
>  cpr_transfer_input(const char *path) "%s"
>  cpr_transfer_output(const char *path) "%s"
> +cpr_exec(void) ""
>  
>  # block-dirty-bitmap.c
>  send_bitmap_header_enter(void) ""
> -- 
> 1.8.3.1
> 
-- 
Peter Xu
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 5/8] migration: cpr-exec save and load
  2025-09-30 16:19         ` Peter Xu
@ 2025-09-30 16:39           ` Steven Sistare
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Sistare @ 2025-09-30 16:39 UTC (permalink / raw)
  To: Peter Xu, Cédric Le Goater
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson
On 9/30/2025 12:19 PM, Peter Xu wrote:
> On Thu, Sep 25, 2025 at 09:11:33AM +0200, Cédric Le Goater wrote:
>>>> That's a short term hack right ? it's not even documented.
>>>
>>> It is an implementation detail, known only to the matched saving
>>> and loading functions inside qemu.  No one else needs to know, so
>>> no documentation.
>>
>> ok. Fair enough.
> 
> IMHO Cedric's ask is fair.  At least when people reading the doc may get
> confused of why cpr channel isn't needed for the exec mode in its API.
> 
> Could we still add one liner into the doc to describe it?  Something that
> would mention a temp memfd and passing it over using environment vars would
> help.
Sure.  I will add to CPR.rst:
This mode does not require a channel of type ``cpr``.  The information
that is passed over that channel for cpr-transfer mode is instead
serialized to a memfd, the number of the fd is saved in the
QEMU_CPR_EXEC_STATE environment variable during the exec of new QEMU.
and new QEMU mmaps the memfd.
- Steve
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 0/8] Live update: cpr-exec
  2025-09-30 15:28 ` [PATCH V4 0/8] Live update: cpr-exec Steven Sistare
@ 2025-09-30 16:42   ` Peter Xu
  2025-09-30 16:52     ` Steven Sistare
  2025-09-30 19:49     ` Steven Sistare
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-30 16:42 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On Tue, Sep 30, 2025 at 11:28:58AM -0400, Steven Sistare wrote:
> Just a reminder, these patches still need review from Peter and/or Fabiano:
> 
>   Patch 5/8: migration: cpr-exec save and load
>   Patch 6/8: migration: cpr-exec mode
I read them and left some comments where I have.  For patch 5 please
remember to include the header that Cedric pointed out, because it does
break the builds.
Other than that the series looks OK.  I suggest when you repost, have the
testcases be together.  I saw Fabiano queued most of the test patches, but
it shouldn't be an issue no matter which lands first.
Thanks,
-- 
Peter Xu
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 0/8] Live update: cpr-exec
  2025-09-30 16:42   ` Peter Xu
@ 2025-09-30 16:52     ` Steven Sistare
  2025-09-30 19:49     ` Steven Sistare
  1 sibling, 0 replies; 30+ messages in thread
From: Steven Sistare @ 2025-09-30 16:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On 9/30/2025 12:42 PM, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 11:28:58AM -0400, Steven Sistare wrote:
>> Just a reminder, these patches still need review from Peter and/or Fabiano:
>>
>>    Patch 5/8: migration: cpr-exec save and load
>>    Patch 6/8: migration: cpr-exec mode
> 
> I read them and left some comments where I have.  For patch 5 please
> remember to include the header that Cedric pointed out, because it does
> break the builds.
> 
> Other than that the series looks OK.  I suggest when you repost, have the
> testcases be together.  I saw Fabiano queued most of the test patches, but
> it shouldn't be an issue no matter which lands first.
Thanks very much Peter.  I will finish responding to your comments and then post V5.
- Steve
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-30 16:39   ` Peter Xu
@ 2025-09-30 17:18     ` Steven Sistare
  2025-09-30 18:20       ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Sistare @ 2025-09-30 17:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On 9/30/2025 12:39 PM, Peter Xu wrote:
> On Mon, Sep 22, 2025 at 06:49:43AM -0700, Steve Sistare wrote:
>> Add the cpr-exec migration mode.  Usage:
>>    qemu-system-$arch -machine aux-ram-share=on ...
>>    migrate_set_parameter mode cpr-exec
>>    migrate_set_parameter cpr-exec-command \
>>      <arg1> <arg2> ... -incoming <uri-1> \
>>    migrate -d <uri-1>
>>
>> The migrate command stops the VM, saves state to uri-1,
>> directly exec's a new version of QEMU on the same host,
>> replacing the original process while retaining its PID, and
>> loads state from uri-1.  Guest RAM is preserved in place,
>> albeit with new virtual addresses.
>>
>> The new QEMU process is started by exec'ing the command
>> specified by the @cpr-exec-command parameter.  The first word of
>> the command is the binary, and the remaining words are its
>> arguments.  The command may be a direct invocation of new QEMU,
>> or may be a non-QEMU command that exec's the new QEMU binary.
>>
>> This mode creates a second migration channel that is not visible
>> to the user.  At the start of migration, old QEMU saves CPR state
>> to the second channel, and at the end of migration, it tells the
>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>> objects are created.
>>
>> Because old QEMU terminates when new QEMU starts, one cannot
>> stream data between the two, so uri-1 must be a type,
>> such as a file, that accepts all data before old QEMU exits.
>> Otherwise, old QEMU may quietly block writing to the channel.
>>
>> Memory-backend objects must have the share=on attribute, but
>> memory-backend-epc is not supported.  The VM must be started with
>> the '-machine aux-ram-share=on' option, which allows anonymous
>> memory to be transferred in place to the new process.  The memfds
>> are kept open across exec by clearing the close-on-exec flag, their
>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Acked-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qapi/migration.json       | 25 +++++++++++++-
>>   include/migration/cpr.h   |  1 +
>>   migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/cpr.c           | 28 ++++++++++++++--
>>   migration/migration.c     | 10 +++++-
>>   migration/ram.c           |  1 +
>>   migration/vmstate-types.c |  8 +++++
>>   system/vl.c               |  4 ++-
>>   migration/trace-events    |  1 +
>>   9 files changed, 157 insertions(+), 5 deletions(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 2be8fa1..be0f3fc 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -694,9 +694,32 @@
>>   #     until you issue the `migrate-incoming` command.
>>   #
>>   #     (since 10.0)
>> +#
>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>> +#     migration channel, directly exec's a new version of QEMU on the
>> +#     same host, replacing the original process while retaining its
>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>> +#     in place.  Devices and their pinned pages are also preserved for
>> +#     VFIO and IOMMUFD.
>> +#
>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>> +#     the @cpr-exec-command parameter.  The command may be a direct
>> +#     invocation of new QEMU, or may be a wrapper that exec's the new
>> +#     QEMU binary.
>> +#
>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>> +#     stream data between the two, so the channel must be a type,
>> +#     such as a file, that accepts all data before old QEMU exits.
>> +#     Otherwise, old QEMU may quietly block writing to the channel.
>> +#
>> +#     Memory-backend objects must have the share=on attribute, but
>> +#     memory-backend-epc is not supported.  The VM must be started
>> +#     with the '-machine aux-ram-share=on' option.
>> +#
>> +#     (since 10.2)
>>   ##
>>   { 'enum': 'MigMode',
>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>   
>>   ##
>>   # @ZeroPageDetection:
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index b84389f..beed392 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>   
>> +void cpr_exec_init(void);
>>   QEMUFile *cpr_exec_output(Error **errp);
>>   QEMUFile *cpr_exec_input(Error **errp);
>>   void cpr_exec_persist_state(QEMUFile *f);
>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>> index 2c32e9c..8cf55a3 100644
>> --- a/migration/cpr-exec.c
>> +++ b/migration/cpr-exec.c
>> @@ -6,15 +6,21 @@
>>   
>>   #include "qemu/osdep.h"
>>   #include "qemu/cutils.h"
>> +#include "qemu/error-report.h"
>>   #include "qemu/memfd.h"
>>   #include "qapi/error.h"
>> +#include "qapi/type-helpers.h"
>>   #include "io/channel-file.h"
>>   #include "io/channel-socket.h"
>> +#include "block/block-global-state.h"
>> +#include "qemu/main-loop.h"
>>   #include "migration/cpr.h"
>>   #include "migration/qemu-file.h"
>> +#include "migration/migration.h"
>>   #include "migration/misc.h"
>>   #include "migration/vmstate.h"
>>   #include "system/runstate.h"
>> +#include "trace.h"
>>   
>>   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>   
>> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>>       lseek(mfd, 0, SEEK_SET);
>>       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>   }
>> +
>> +static bool preserve_fd(int fd)
>> +{
>> +    qemu_clear_cloexec(fd);
>> +    return true;
>> +}
>> +
>> +static bool unpreserve_fd(int fd)
>> +{
>> +    qemu_set_cloexec(fd);
>> +    return true;
>> +}
>> +
>> +static void cpr_exec_cb(void *opaque)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
>> +    Error *err = NULL;
>> +
>> +    /*
>> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
>> +     * earlier because they should not persist across miscellaneous fork and
>> +     * exec calls that are performed during normal operation.
>> +     */
>> +    cpr_walk_fd(preserve_fd);
>> +
>> +    trace_cpr_exec();
>> +    execvp(argv[0], argv);
>> +
>> +    /*
>> +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
>> +     * or the system is very short on resources.
>> +     */
>> +    g_strfreev(argv);
>> +    cpr_walk_fd(unpreserve_fd);
>> +
>> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>> +    error_report_err(error_copy(err));
>> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> 
> I believe this is the only place we can have the state machine from
> COMPLETED->FAILED.  It's pretty hacky.  Maybe add a quick comment?
OK.
>> +    migrate_set_error(s, err);
>> +
>> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>> +
>> +    err = NULL;
>> +    if (!migration_block_activate(&err)) {
>> +        /* error was already reported */
>> +        return;
>> +    }
>> +
>> +    if (runstate_is_live(s->vm_old_state)) {
>> +        vm_start();
>> +    }
> 
> We have rollback logic in migration_iteration_finish().  Make a small
> helper and reuse the code?
Hmm.  There is some overlap, but also subtle differences.  For so littlecode, it does not feel worth any risk of regression (or worth the time to
test and verify all conditions).
>> +}
>> +
>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>> +                             Error **errp)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +
>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>> +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
>> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
>> +        qemu_bh_schedule(cpr_exec_bh);
>> +        qemu_notify_event();
>> +
> 
> Newline can be dropped.
OK.
>> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>> +        cpr_exec_unpersist_state();
>> +    }
>> +    return 0;
>> +}
>> +
>> +void cpr_exec_init(void)
>> +{
>> +    static NotifierWithReturn exec_notifier;
>> +
>> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>> +                                MIG_MODE_CPR_EXEC);
>> +}
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index d3e370e..eea3773 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>       if (mode == MIG_MODE_CPR_TRANSFER) {
>>           g_assert(channel);
>>           f = cpr_transfer_output(channel, errp);
>> +    } else if (mode == MIG_MODE_CPR_EXEC) {
>> +        f = cpr_exec_output(errp);
>>       } else {
>>           return 0;
>>       }
>> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>           return ret;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        cpr_exec_persist_state(f);
>> +    }
>> +
>>       /*
>>        * Close the socket only partially so we can later detect when the other
>>        * end closes by getting a HUP event.
>> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>       return 0;
>>   }
>>   
>> +static bool unpreserve_fd(int fd)
>> +{
>> +    qemu_set_cloexec(fd);
>> +    return true;
>> +}
> 
> Is this function defined twice?
Yes, since it is tiny.  I judged that defining this small helper twice, near each
of its call sites, was better for the reader.
- Steve
>> +
>>   int cpr_state_load(MigrationChannel *channel, Error **errp)
>>   {
>>       int ret;
>> @@ -220,7 +232,13 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>       QEMUFile *f;
>>       MigMode mode = 0;
>>   
>> -    if (channel) {
>> +    if (cpr_exec_has_state()) {
>> +        mode = MIG_MODE_CPR_EXEC;
>> +        f = cpr_exec_input(errp);
>> +        if (channel) {
>> +            warn_report("ignoring cpr channel for migration mode cpr-exec");
>> +        }
>> +    } else if (channel) {
>>           mode = MIG_MODE_CPR_TRANSFER;
>>           cpr_set_incoming_mode(mode);
>>           f = cpr_transfer_input(channel, errp);
>> @@ -232,6 +250,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>       }
>>   
>>       trace_cpr_state_load(MigMode_str(mode));
>> +    cpr_set_incoming_mode(mode);
>>   
>>       v = qemu_get_be32(f);
>>       if (v != QEMU_CPR_FILE_MAGIC) {
>> @@ -253,6 +272,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>           return ret;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
>> +        cpr_walk_fd(unpreserve_fd);
>> +    }
>> +
>>       /*
>>        * Let the caller decide when to close the socket (and generate a HUP event
>>        * for the sending side).
>> @@ -273,7 +297,7 @@ void cpr_state_close(void)
>>   bool cpr_incoming_needed(void *opaque)
>>   {
>>       MigMode mode = migrate_mode();
>> -    return mode == MIG_MODE_CPR_TRANSFER;
>> +    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
>>   }
>>   
>>   /*
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 08a98f7..2515bec 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -333,6 +333,7 @@ void migration_object_init(void)
>>   
>>       ram_mig_init();
>>       dirty_bitmap_mig_init();
>> +    cpr_exec_init();
>>   
>>       /* Initialize cpu throttle timers */
>>       cpu_throttle_init();
>> @@ -1796,7 +1797,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
>>   {
>>       MigMode mode = s->parameters.mode;
>>       return mode == MIG_MODE_CPR_REBOOT ||
>> -           mode == MIG_MODE_CPR_TRANSFER;
>> +           mode == MIG_MODE_CPR_TRANSFER ||
>> +           mode == MIG_MODE_CPR_EXEC;
>>   }
>>   
>>   int migrate_init(MigrationState *s, Error **errp)
>> @@ -2145,6 +2147,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>           return false;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
>> +        !s->parameters.has_cpr_exec_command) {
>> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
>> +        return false;
>> +    }
>> +
>>       if (migration_is_blocked(errp)) {
>>           return false;
>>       }
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7208bc1..6730a41 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
>>       MigMode mode = migrate_mode();
>>       return !qemu_ram_is_migratable(block) ||
>>              mode == MIG_MODE_CPR_TRANSFER ||
>> +           mode == MIG_MODE_CPR_EXEC ||
>>              (migrate_ignore_shared() && qemu_ram_is_shared(block)
>>                                       && qemu_ram_is_named_file(block));
>>   }
>> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
>> index 741a588..1aa0573 100644
>> --- a/migration/vmstate-types.c
>> +++ b/migration/vmstate-types.c
>> @@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>>                     const VMStateField *field)
>>   {
>>       int32_t *v = pv;
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        qemu_get_sbe32s(f, v);
>> +        return 0;
>> +    }
>>       *v = qemu_file_get_fd(f);
>>       return 0;
>>   }
>> @@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>>                     const VMStateField *field, JSONWriter *vmdesc)
>>   {
>>       int32_t *v = pv;
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        qemu_put_sbe32s(f, v);
>> +        return 0;
>> +    }
>>       return qemu_file_put_fd(f, *v);
>>   }
>>   
>> diff --git a/system/vl.c b/system/vl.c
>> index 4c24073..f395d04 100644
>> --- a/system/vl.c
>> +++ b/system/vl.c
>> @@ -3867,6 +3867,8 @@ void qemu_init(int argc, char **argv)
>>       }
>>       qemu_init_displays();
>>       accel_setup_post(current_machine);
>> -    os_setup_post();
>> +    if (migrate_mode() != MIG_MODE_CPR_EXEC) {
>> +        os_setup_post();
>> +    }
>>       resume_mux_open();
>>   }
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 706db97..e8edd1f 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
>>   cpr_state_load(const char *mode) "%s mode"
>>   cpr_transfer_input(const char *path) "%s"
>>   cpr_transfer_output(const char *path) "%s"
>> +cpr_exec(void) ""
>>   
>>   # block-dirty-bitmap.c
>>   send_bitmap_header_enter(void) ""
>> -- 
>> 1.8.3.1
>>
> 
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-30 17:18     ` Steven Sistare
@ 2025-09-30 18:20       ` Peter Xu
  2025-09-30 18:29         ` Steven Sistare
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-09-30 18:20 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On Tue, Sep 30, 2025 at 01:18:34PM -0400, Steven Sistare wrote:
> On 9/30/2025 12:39 PM, Peter Xu wrote:
> > On Mon, Sep 22, 2025 at 06:49:43AM -0700, Steve Sistare wrote:
> > > Add the cpr-exec migration mode.  Usage:
> > >    qemu-system-$arch -machine aux-ram-share=on ...
> > >    migrate_set_parameter mode cpr-exec
> > >    migrate_set_parameter cpr-exec-command \
> > >      <arg1> <arg2> ... -incoming <uri-1> \
> > >    migrate -d <uri-1>
> > > 
> > > The migrate command stops the VM, saves state to uri-1,
> > > directly exec's a new version of QEMU on the same host,
> > > replacing the original process while retaining its PID, and
> > > loads state from uri-1.  Guest RAM is preserved in place,
> > > albeit with new virtual addresses.
> > > 
> > > The new QEMU process is started by exec'ing the command
> > > specified by the @cpr-exec-command parameter.  The first word of
> > > the command is the binary, and the remaining words are its
> > > arguments.  The command may be a direct invocation of new QEMU,
> > > or may be a non-QEMU command that exec's the new QEMU binary.
> > > 
> > > This mode creates a second migration channel that is not visible
> > > to the user.  At the start of migration, old QEMU saves CPR state
> > > to the second channel, and at the end of migration, it tells the
> > > main loop to call cpr_exec.  New QEMU loads CPR state early, before
> > > objects are created.
> > > 
> > > Because old QEMU terminates when new QEMU starts, one cannot
> > > stream data between the two, so uri-1 must be a type,
> > > such as a file, that accepts all data before old QEMU exits.
> > > Otherwise, old QEMU may quietly block writing to the channel.
> > > 
> > > Memory-backend objects must have the share=on attribute, but
> > > memory-backend-epc is not supported.  The VM must be started with
> > > the '-machine aux-ram-share=on' option, which allows anonymous
> > > memory to be transferred in place to the new process.  The memfds
> > > are kept open across exec by clearing the close-on-exec flag, their
> > > values are saved in CPR state, and they are mmap'd in new QEMU.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > Acked-by: Markus Armbruster <armbru@redhat.com>
> > > ---
> > >   qapi/migration.json       | 25 +++++++++++++-
> > >   include/migration/cpr.h   |  1 +
> > >   migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
> > >   migration/cpr.c           | 28 ++++++++++++++--
> > >   migration/migration.c     | 10 +++++-
> > >   migration/ram.c           |  1 +
> > >   migration/vmstate-types.c |  8 +++++
> > >   system/vl.c               |  4 ++-
> > >   migration/trace-events    |  1 +
> > >   9 files changed, 157 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 2be8fa1..be0f3fc 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -694,9 +694,32 @@
> > >   #     until you issue the `migrate-incoming` command.
> > >   #
> > >   #     (since 10.0)
> > > +#
> > > +# @cpr-exec: The migrate command stops the VM, saves state to the
> > > +#     migration channel, directly exec's a new version of QEMU on the
> > > +#     same host, replacing the original process while retaining its
> > > +#     PID, and loads state from the channel.  Guest RAM is preserved
> > > +#     in place.  Devices and their pinned pages are also preserved for
> > > +#     VFIO and IOMMUFD.
> > > +#
> > > +#     Old QEMU starts new QEMU by exec'ing the command specified by
> > > +#     the @cpr-exec-command parameter.  The command may be a direct
> > > +#     invocation of new QEMU, or may be a wrapper that exec's the new
> > > +#     QEMU binary.
> > > +#
> > > +#     Because old QEMU terminates when new QEMU starts, one cannot
> > > +#     stream data between the two, so the channel must be a type,
> > > +#     such as a file, that accepts all data before old QEMU exits.
> > > +#     Otherwise, old QEMU may quietly block writing to the channel.
> > > +#
> > > +#     Memory-backend objects must have the share=on attribute, but
> > > +#     memory-backend-epc is not supported.  The VM must be started
> > > +#     with the '-machine aux-ram-share=on' option.
> > > +#
> > > +#     (since 10.2)
> > >   ##
> > >   { 'enum': 'MigMode',
> > > -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> > > +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
> > >   ##
> > >   # @ZeroPageDetection:
> > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > > index b84389f..beed392 100644
> > > --- a/include/migration/cpr.h
> > > +++ b/include/migration/cpr.h
> > > @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
> > >   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> > >   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
> > > +void cpr_exec_init(void);
> > >   QEMUFile *cpr_exec_output(Error **errp);
> > >   QEMUFile *cpr_exec_input(Error **errp);
> > >   void cpr_exec_persist_state(QEMUFile *f);
> > > diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> > > index 2c32e9c..8cf55a3 100644
> > > --- a/migration/cpr-exec.c
> > > +++ b/migration/cpr-exec.c
> > > @@ -6,15 +6,21 @@
> > >   #include "qemu/osdep.h"
> > >   #include "qemu/cutils.h"
> > > +#include "qemu/error-report.h"
> > >   #include "qemu/memfd.h"
> > >   #include "qapi/error.h"
> > > +#include "qapi/type-helpers.h"
> > >   #include "io/channel-file.h"
> > >   #include "io/channel-socket.h"
> > > +#include "block/block-global-state.h"
> > > +#include "qemu/main-loop.h"
> > >   #include "migration/cpr.h"
> > >   #include "migration/qemu-file.h"
> > > +#include "migration/migration.h"
> > >   #include "migration/misc.h"
> > >   #include "migration/vmstate.h"
> > >   #include "system/runstate.h"
> > > +#include "trace.h"
> > >   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
> > > @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
> > >       lseek(mfd, 0, SEEK_SET);
> > >       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
> > >   }
> > > +
> > > +static bool preserve_fd(int fd)
> > > +{
> > > +    qemu_clear_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > > +static bool unpreserve_fd(int fd)
> > > +{
> > > +    qemu_set_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > > +static void cpr_exec_cb(void *opaque)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
> > > +    Error *err = NULL;
> > > +
> > > +    /*
> > > +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> > > +     * earlier because they should not persist across miscellaneous fork and
> > > +     * exec calls that are performed during normal operation.
> > > +     */
> > > +    cpr_walk_fd(preserve_fd);
> > > +
> > > +    trace_cpr_exec();
> > > +    execvp(argv[0], argv);
> > > +
> > > +    /*
> > > +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
> > > +     * or the system is very short on resources.
> > > +     */
> > > +    g_strfreev(argv);
> > > +    cpr_walk_fd(unpreserve_fd);
> > > +
> > > +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
> > > +    error_report_err(error_copy(err));
> > > +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> > 
> > I believe this is the only place we can have the state machine from
> > COMPLETED->FAILED.  It's pretty hacky.  Maybe add a quick comment?
> 
> OK.
> > > +    migrate_set_error(s, err);
> > > +
> > > +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
> > > +
> > > +    err = NULL;
> > > +    if (!migration_block_activate(&err)) {
> > > +        /* error was already reported */
> > > +        return;
> > > +    }
> > > +
> > > +    if (runstate_is_live(s->vm_old_state)) {
> > > +        vm_start();
> > > +    }
> > 
> > We have rollback logic in migration_iteration_finish().  Make a small
> > helper and reuse the code?
> Hmm.  There is some overlap, but also subtle differences.  For so littlecode, it does not feel worth any risk of regression (or worth the time to
> test and verify all conditions).
We have a fix not yet landed but should likely land soon one way or
another:
https://lore.kernel.org/all/20250915115918.3520735-2-jmarcin@redhat.com/
That should close one gap.
There's definitely reasons on sharing code, e.g. when we fix the path we
fix all users, not one.  We also don't make mistake in one path but not in
the other.  One solid example is, I feel like err is leaked above..
I'm fine if you prefer landing this first, but I'll still suggest a cleanup
on top after above patch lands.
> > > +}
> > > +
> > > +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> > > +                             Error **errp)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +
> > > +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> > > +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
> > > +        assert(s->state == MIGRATION_STATUS_COMPLETED);
> > > +        qemu_bh_schedule(cpr_exec_bh);
> > > +        qemu_notify_event();
> > > +
> > 
> > Newline can be dropped.
> OK.
> 
> > > +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> > > +        cpr_exec_unpersist_state();
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > > +void cpr_exec_init(void)
> > > +{
> > > +    static NotifierWithReturn exec_notifier;
> > > +
> > > +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
> > > +                                MIG_MODE_CPR_EXEC);
> > > +}
> > > diff --git a/migration/cpr.c b/migration/cpr.c
> > > index d3e370e..eea3773 100644
> > > --- a/migration/cpr.c
> > > +++ b/migration/cpr.c
> > > @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >       if (mode == MIG_MODE_CPR_TRANSFER) {
> > >           g_assert(channel);
> > >           f = cpr_transfer_output(channel, errp);
> > > +    } else if (mode == MIG_MODE_CPR_EXEC) {
> > > +        f = cpr_exec_output(errp);
> > >       } else {
> > >           return 0;
> > >       }
> > > @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >           return ret;
> > >       }
> > > +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> > > +        cpr_exec_persist_state(f);
> > > +    }
> > > +
> > >       /*
> > >        * Close the socket only partially so we can later detect when the other
> > >        * end closes by getting a HUP event.
> > > @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >       return 0;
> > >   }
> > > +static bool unpreserve_fd(int fd)
> > > +{
> > > +    qemu_set_cloexec(fd);
> > > +    return true;
> > > +}
> > 
> > Is this function defined twice?
> 
> Yes, since it is tiny.  I judged that defining this small helper twice, near each
> of its call sites, was better for the reader.
I still think we should avoid doing that.
Btw, I even think this helper should be removed on both places because
they're almost only used for a cpr_walk_fd() context, so instead looks like
we need cpr_unpreserve_fds(), which does:
        cpr_walk_fd(unpreserve_fd);
Then it can be defined in cpr.c once and export it in cpr.h.  Would that be
better?
-- 
Peter Xu
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 6/8] migration: cpr-exec mode
  2025-09-30 18:20       ` Peter Xu
@ 2025-09-30 18:29         ` Steven Sistare
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Sistare @ 2025-09-30 18:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On 9/30/2025 2:20 PM, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 01:18:34PM -0400, Steven Sistare wrote:
>> On 9/30/2025 12:39 PM, Peter Xu wrote:
>>> On Mon, Sep 22, 2025 at 06:49:43AM -0700, Steve Sistare wrote:
>>>> Add the cpr-exec migration mode.  Usage:
>>>>     qemu-system-$arch -machine aux-ram-share=on ...
>>>>     migrate_set_parameter mode cpr-exec
>>>>     migrate_set_parameter cpr-exec-command \
>>>>       <arg1> <arg2> ... -incoming <uri-1> \
>>>>     migrate -d <uri-1>
>>>>
>>>> The migrate command stops the VM, saves state to uri-1,
>>>> directly exec's a new version of QEMU on the same host,
>>>> replacing the original process while retaining its PID, and
>>>> loads state from uri-1.  Guest RAM is preserved in place,
>>>> albeit with new virtual addresses.
>>>>
>>>> The new QEMU process is started by exec'ing the command
>>>> specified by the @cpr-exec-command parameter.  The first word of
>>>> the command is the binary, and the remaining words are its
>>>> arguments.  The command may be a direct invocation of new QEMU,
>>>> or may be a non-QEMU command that exec's the new QEMU binary.
>>>>
>>>> This mode creates a second migration channel that is not visible
>>>> to the user.  At the start of migration, old QEMU saves CPR state
>>>> to the second channel, and at the end of migration, it tells the
>>>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>>>> objects are created.
>>>>
>>>> Because old QEMU terminates when new QEMU starts, one cannot
>>>> stream data between the two, so uri-1 must be a type,
>>>> such as a file, that accepts all data before old QEMU exits.
>>>> Otherwise, old QEMU may quietly block writing to the channel.
>>>>
>>>> Memory-backend objects must have the share=on attribute, but
>>>> memory-backend-epc is not supported.  The VM must be started with
>>>> the '-machine aux-ram-share=on' option, which allows anonymous
>>>> memory to be transferred in place to the new process.  The memfds
>>>> are kept open across exec by clearing the close-on-exec flag, their
>>>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> Acked-by: Markus Armbruster <armbru@redhat.com>
>>>> ---
>>>>    qapi/migration.json       | 25 +++++++++++++-
>>>>    include/migration/cpr.h   |  1 +
>>>>    migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>    migration/cpr.c           | 28 ++++++++++++++--
>>>>    migration/migration.c     | 10 +++++-
>>>>    migration/ram.c           |  1 +
>>>>    migration/vmstate-types.c |  8 +++++
>>>>    system/vl.c               |  4 ++-
>>>>    migration/trace-events    |  1 +
>>>>    9 files changed, 157 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 2be8fa1..be0f3fc 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -694,9 +694,32 @@
>>>>    #     until you issue the `migrate-incoming` command.
>>>>    #
>>>>    #     (since 10.0)
>>>> +#
>>>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>>>> +#     migration channel, directly exec's a new version of QEMU on the
>>>> +#     same host, replacing the original process while retaining its
>>>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>>>> +#     in place.  Devices and their pinned pages are also preserved for
>>>> +#     VFIO and IOMMUFD.
>>>> +#
>>>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>>>> +#     the @cpr-exec-command parameter.  The command may be a direct
>>>> +#     invocation of new QEMU, or may be a wrapper that exec's the new
>>>> +#     QEMU binary.
>>>> +#
>>>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>>>> +#     stream data between the two, so the channel must be a type,
>>>> +#     such as a file, that accepts all data before old QEMU exits.
>>>> +#     Otherwise, old QEMU may quietly block writing to the channel.
>>>> +#
>>>> +#     Memory-backend objects must have the share=on attribute, but
>>>> +#     memory-backend-epc is not supported.  The VM must be started
>>>> +#     with the '-machine aux-ram-share=on' option.
>>>> +#
>>>> +#     (since 10.2)
>>>>    ##
>>>>    { 'enum': 'MigMode',
>>>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>>>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>>>    ##
>>>>    # @ZeroPageDetection:
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index b84389f..beed392 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>>>    QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>>    QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>>> +void cpr_exec_init(void);
>>>>    QEMUFile *cpr_exec_output(Error **errp);
>>>>    QEMUFile *cpr_exec_input(Error **errp);
>>>>    void cpr_exec_persist_state(QEMUFile *f);
>>>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>>>> index 2c32e9c..8cf55a3 100644
>>>> --- a/migration/cpr-exec.c
>>>> +++ b/migration/cpr-exec.c
>>>> @@ -6,15 +6,21 @@
>>>>    #include "qemu/osdep.h"
>>>>    #include "qemu/cutils.h"
>>>> +#include "qemu/error-report.h"
>>>>    #include "qemu/memfd.h"
>>>>    #include "qapi/error.h"
>>>> +#include "qapi/type-helpers.h"
>>>>    #include "io/channel-file.h"
>>>>    #include "io/channel-socket.h"
>>>> +#include "block/block-global-state.h"
>>>> +#include "qemu/main-loop.h"
>>>>    #include "migration/cpr.h"
>>>>    #include "migration/qemu-file.h"
>>>> +#include "migration/migration.h"
>>>>    #include "migration/misc.h"
>>>>    #include "migration/vmstate.h"
>>>>    #include "system/runstate.h"
>>>> +#include "trace.h"
>>>>    #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>>> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>>>>        lseek(mfd, 0, SEEK_SET);
>>>>        return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>>>    }
>>>> +
>>>> +static bool preserve_fd(int fd)
>>>> +{
>>>> +    qemu_clear_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> +    qemu_set_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void cpr_exec_cb(void *opaque)
>>>> +{
>>>> +    MigrationState *s = migrate_get_current();
>>>> +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
>>>> +    Error *err = NULL;
>>>> +
>>>> +    /*
>>>> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
>>>> +     * earlier because they should not persist across miscellaneous fork and
>>>> +     * exec calls that are performed during normal operation.
>>>> +     */
>>>> +    cpr_walk_fd(preserve_fd);
>>>> +
>>>> +    trace_cpr_exec();
>>>> +    execvp(argv[0], argv);
>>>> +
>>>> +    /*
>>>> +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
>>>> +     * or the system is very short on resources.
>>>> +     */
>>>> +    g_strfreev(argv);
>>>> +    cpr_walk_fd(unpreserve_fd);
>>>> +
>>>> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>>>> +    error_report_err(error_copy(err));
>>>> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>>>
>>> I believe this is the only place we can have the state machine from
>>> COMPLETED->FAILED.  It's pretty hacky.  Maybe add a quick comment?
>>
>> OK.
>>>> +    migrate_set_error(s, err);
>>>> +
>>>> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>>>> +
>>>> +    err = NULL;
>>>> +    if (!migration_block_activate(&err)) {
>>>> +        /* error was already reported */
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (runstate_is_live(s->vm_old_state)) {
>>>> +        vm_start();
>>>> +    }
>>>
>>> We have rollback logic in migration_iteration_finish().  Make a small
>>> helper and reuse the code?
>> Hmm.  There is some overlap, but also subtle differences.  For so littlecode, it does not feel worth any risk of regression (or worth the time to
>> test and verify all conditions).
> 
> We have a fix not yet landed but should likely land soon one way or
> another:
> 
> https://lore.kernel.org/all/20250915115918.3520735-2-jmarcin@redhat.com/
> 
> That should close one gap.
> 
> There's definitely reasons on sharing code, e.g. when we fix the path we
> fix all users, not one.  We also don't make mistake in one path but not in
> the other.  One solid example is, I feel like err is leaked above..
> 
> I'm fine if you prefer landing this first, but I'll still suggest a cleanup
> on top after above patch lands.
OK, let's do that.
>>>> +}
>>>> +
>>>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>>> +                             Error **errp)
>>>> +{
>>>> +    MigrationState *s = migrate_get_current();
>>>> +
>>>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>>> +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
>>>> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
>>>> +        qemu_bh_schedule(cpr_exec_bh);
>>>> +        qemu_notify_event();
>>>> +
>>>
>>> Newline can be dropped.
>> OK.
>>
>>>> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>>>> +        cpr_exec_unpersist_state();
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void cpr_exec_init(void)
>>>> +{
>>>> +    static NotifierWithReturn exec_notifier;
>>>> +
>>>> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>>>> +                                MIG_MODE_CPR_EXEC);
>>>> +}
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index d3e370e..eea3773 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>        if (mode == MIG_MODE_CPR_TRANSFER) {
>>>>            g_assert(channel);
>>>>            f = cpr_transfer_output(channel, errp);
>>>> +    } else if (mode == MIG_MODE_CPR_EXEC) {
>>>> +        f = cpr_exec_output(errp);
>>>>        } else {
>>>>            return 0;
>>>>        }
>>>> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>            return ret;
>>>>        }
>>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>>> +        cpr_exec_persist_state(f);
>>>> +    }
>>>> +
>>>>        /*
>>>>         * Close the socket only partially so we can later detect when the other
>>>>         * end closes by getting a HUP event.
>>>> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>        return 0;
>>>>    }
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> +    qemu_set_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>
>>> Is this function defined twice?
>>
>> Yes, since it is tiny.  I judged that defining this small helper twice, near each
>> of its call sites, was better for the reader.
> 
> I still think we should avoid doing that.
> 
> Btw, I even think this helper should be removed on both places because
> they're almost only used for a cpr_walk_fd() context, so instead looks like
> we need cpr_unpreserve_fds(), which does:
> 
>          cpr_walk_fd(unpreserve_fd);
> 
> Then it can be defined in cpr.c once and export it in cpr.h.  Would that be
> better?
OK, I'll do that.
- Steve
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 0/8] Live update: cpr-exec
  2025-09-30 16:42   ` Peter Xu
  2025-09-30 16:52     ` Steven Sistare
@ 2025-09-30 19:49     ` Steven Sistare
  2025-09-30 20:40       ` Peter Xu
  1 sibling, 1 reply; 30+ messages in thread
From: Steven Sistare @ 2025-09-30 19:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On 9/30/2025 12:42 PM, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 11:28:58AM -0400, Steven Sistare wrote:
>> Just a reminder, these patches still need review from Peter and/or Fabiano:
>>
>>    Patch 5/8: migration: cpr-exec save and load
>>    Patch 6/8: migration: cpr-exec mode
> 
> I read them and left some comments where I have.  For patch 5 please
> remember to include the header that Cedric pointed out, because it does
> break the builds.
> 
> Other than that the series looks OK.  I suggest when you repost, have the
> testcases be together.  I saw Fabiano queued most of the test patches, but
> it shouldn't be an issue no matter which lands first.
If I post V5 with the patch "cpr-exec test", and it lands before the
other test patches, then V5 will not build.  "cpr-exec test" depends on
a handful of new test functions.
- Steve
^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [PATCH V4 0/8] Live update: cpr-exec
  2025-09-30 19:49     ` Steven Sistare
@ 2025-09-30 20:40       ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2025-09-30 20:40 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Cedric Le Goater,
	Alex Williamson
On Tue, Sep 30, 2025 at 03:49:12PM -0400, Steven Sistare wrote:
> On 9/30/2025 12:42 PM, Peter Xu wrote:
> > On Tue, Sep 30, 2025 at 11:28:58AM -0400, Steven Sistare wrote:
> > > Just a reminder, these patches still need review from Peter and/or Fabiano:
> > > 
> > >    Patch 5/8: migration: cpr-exec save and load
> > >    Patch 6/8: migration: cpr-exec mode
> > 
> > I read them and left some comments where I have.  For patch 5 please
> > remember to include the header that Cedric pointed out, because it does
> > break the builds.
> > 
> > Other than that the series looks OK.  I suggest when you repost, have the
> > testcases be together.  I saw Fabiano queued most of the test patches, but
> > it shouldn't be an issue no matter which lands first.
> 
> If I post V5 with the patch "cpr-exec test", and it lands before the
> other test patches, then V5 will not build.  "cpr-exec test" depends on
> a handful of new test functions.
I meant the repost needs to include all the dependency test patches too, no
matter whether Fabiano will already have sent a pull.
-- 
Peter Xu
^ permalink raw reply	[flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-09-30 20:42 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
2025-09-22 13:49 ` [PATCH V4 1/8] migration: multi-mode notifier Steve Sistare
2025-09-22 15:18   ` Cédric Le Goater
2025-09-24 18:15     ` Steven Sistare
2025-09-22 13:49 ` [PATCH V4 2/8] migration: add cpr_walk_fd Steve Sistare
2025-09-22 13:49 ` [PATCH V4 3/8] oslib: qemu_clear_cloexec Steve Sistare
2025-09-22 13:49 ` [PATCH V4 4/8] migration: cpr-exec-command parameter Steve Sistare
2025-09-22 13:49 ` [PATCH V4 5/8] migration: cpr-exec save and load Steve Sistare
2025-09-22 16:00   ` Cédric Le Goater
2025-09-24 18:16     ` Steven Sistare
2025-09-25  7:11       ` Cédric Le Goater
2025-09-25 20:38         ` Steven Sistare
2025-09-30 16:19         ` Peter Xu
2025-09-30 16:39           ` Steven Sistare
2025-09-22 13:49 ` [PATCH V4 6/8] migration: cpr-exec mode Steve Sistare
2025-09-22 15:28   ` Cédric Le Goater
2025-09-24 18:16     ` Steven Sistare
2025-09-25  7:12       ` Cédric Le Goater
2025-09-30 16:39   ` Peter Xu
2025-09-30 17:18     ` Steven Sistare
2025-09-30 18:20       ` Peter Xu
2025-09-30 18:29         ` Steven Sistare
2025-09-22 13:49 ` [PATCH V4 7/8] migration: cpr-exec docs Steve Sistare
2025-09-22 13:49 ` [PATCH V4 8/8] vfio: cpr-exec mode Steve Sistare
2025-09-22 15:28   ` Cédric Le Goater
2025-09-30 15:28 ` [PATCH V4 0/8] Live update: cpr-exec Steven Sistare
2025-09-30 16:42   ` Peter Xu
2025-09-30 16:52     ` Steven Sistare
2025-09-30 19:49     ` Steven Sistare
2025-09-30 20:40       ` Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).