[PATCH V3 0/9] Live update: cpr-exec

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V3 0/9] Live update: cpr-exec
@ 2025-08-14 17:17 Steve Sistare
  2025-08-14 17:17 ` [PATCH V3 1/9] migration: multi-mode notifier Steve Sistare
                   ` (10 more replies)
  0 siblings, 11 replies; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

This patch series adds the live migration cpr-exec mode.  

The new user-visible interfaces are:
  * cpr-exec (MigMode migration parameter)
  * cpr-exec-command (migration parameter)

cpr-exec mode is similar in most respects to cpr-transfer mode, with the 
primary difference being that old QEMU directly exec's new QEMU.  The user
specifies the command to exec new QEMU in the migration parameter
cpr-exec-command.

Why?

In a containerized QEMU environment, cpr-exec reuses an existing QEMU
container and its assigned resources.  By contrast, cpr-transfer mode
requires a new container to be created on the same host as the target of
the CPR operation.  Resources must be reserved for the new container, while
the old container still reserves resources until the operation completes.
Avoiding over commitment requires extra work in the management layer.
This is one reason why a cloud provider may prefer cpr-exec.  A second reason
is that the container may include agents with their own connections to the
outside world, and such connections remain intact if the container is reused.

How?

cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
and by sending the unique name and value of each descriptor to new QEMU
via CPR state.

CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel that is not visible to the user.
New QEMU reads the second channel prior to creating devices or backends.

The exec itself is trivial.  After writing to the migration channels, the
migration code calls a new main-loop hook to perform the exec.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in cpr-exec-command.

  # qemu-kvm -monitor stdio
  -object memory-backend-memfd,id=ram0,size=1G
  -machine memory-backend=ram0 -machine aux-ram-share=on ...

  QEMU 10.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 10.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

Steve Sistare (9):
  migration: multi-mode notifier
  migration: add cpr_walk_fd
  oslib: qemu_clear_cloexec
  vl: helper to request exec
  migration: cpr-exec-command parameter
  migration: cpr-exec save and load
  migration: cpr-exec mode
  migration: cpr-exec docs
  vfio: cpr-exec mode

 docs/devel/migration/CPR.rst   | 103 ++++++++++++++++++++++++-
 qapi/migration.json            |  46 ++++++++++-
 include/migration/cpr.h        |   9 +++
 include/migration/misc.h       |  12 +++
 include/qemu/osdep.h           |   9 +++
 include/system/runstate.h      |   3 +
 hw/vfio/container.c            |   3 +-
 hw/vfio/cpr-iommufd.c          |   3 +-
 hw/vfio/cpr-legacy.c           |   9 ++-
 hw/vfio/cpr.c                  |  13 ++--
 migration/cpr-exec.c           | 168 +++++++++++++++++++++++++++++++++++++++++
 migration/cpr.c                |  39 +++++++++-
 migration/migration-hmp-cmds.c |  25 ++++++
 migration/migration.c          |  70 +++++++++++++----
 migration/options.c            |  14 ++++
 migration/ram.c                |   1 +
 migration/vmstate-types.c      |   8 ++
 system/runstate.c              |  29 +++++++
 util/oslib-posix.c             |   9 +++
 util/oslib-win32.c             |   4 +
 hmp-commands.hx                |   2 +-
 migration/meson.build          |   1 +
 migration/trace-events         |   1 +
 23 files changed, 548 insertions(+), 33 deletions(-)
 create mode 100644 migration/cpr-exec.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 1/9] migration: multi-mode notifier
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-08-19 13:09   ` Fabiano Rosas
  2025-09-09 15:43   ` Peter Xu
  2025-08-14 17:17 ` [PATCH V3 2/9] migration: add cpr_walk_fd Steve Sistare
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

Allow a notifier to be added for multiple migration modes.
To allow a notifier to appear on multiple per-node lists, use
a generic list type.  We can no longer use NotifierWithReturnList,
because it shoe horns the notifier onto a single list.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/misc.h | 12 ++++++++++
 migration/migration.c    | 60 +++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index a261f99..592b930 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -95,7 +95,19 @@ void migration_add_notifier(NotifierWithReturn *notify,
 void migration_add_notifier_mode(NotifierWithReturn *notify,
                                  MigrationNotifyFunc func, MigMode mode);
 
+/*
+ * Same as migration_add_notifier, but applies to all @mode in the argument
+ * list.  The list is terminated by -1 or MIG_MODE_ALL.  For the latter,
+ * the notifier is added for all modes.
+ */
+void migration_add_notifier_modes(NotifierWithReturn *notify,
+                                  MigrationNotifyFunc func, MigMode mode, ...);
+
+/*
+ * Remove a notifier from all modes.
+ */
 void migration_remove_notifier(NotifierWithReturn *notify);
+
 void migration_file_set_error(int ret, Error *err);
 
 /* True if incoming migration entered POSTCOPY_INCOMING_DISCARD */
diff --git a/migration/migration.c b/migration/migration.c
index 49d1e7d..271c521 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -74,11 +74,7 @@
 
 #define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
 
-static NotifierWithReturnList migration_state_notifiers[] = {
-    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
-    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
-    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
-};
+static GSList *migration_state_notifiers[MIG_MODE__MAX];
 
 /* Messages sent on the return path from destination to source */
 enum mig_rp_message_type {
@@ -1666,23 +1662,51 @@ void migration_cancel(void)
     }
 }
 
+static int get_modes(MigMode mode, va_list ap);
+
+static void add_notifiers(NotifierWithReturn *notify, int modes)
+{
+    for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
+        if (modes & BIT(mode)) {
+            migration_state_notifiers[mode] =
+                g_slist_prepend(migration_state_notifiers[mode], notify);
+        }
+    }
+}
+
+void migration_add_notifier_modes(NotifierWithReturn *notify,
+                                  MigrationNotifyFunc func, MigMode mode, ...)
+{
+    int modes;
+    va_list ap;
+
+    va_start(ap, mode);
+    modes = get_modes(mode, ap);
+    va_end(ap);
+
+    notify->notify = (NotifierWithReturnFunc)func;
+    add_notifiers(notify, modes);
+}
+
 void migration_add_notifier_mode(NotifierWithReturn *notify,
                                  MigrationNotifyFunc func, MigMode mode)
 {
-    notify->notify = (NotifierWithReturnFunc)func;
-    notifier_with_return_list_add(&migration_state_notifiers[mode], notify);
+    migration_add_notifier_modes(notify, func, mode, -1);
 }
 
 void migration_add_notifier(NotifierWithReturn *notify,
                             MigrationNotifyFunc func)
 {
-    migration_add_notifier_mode(notify, func, MIG_MODE_NORMAL);
+    migration_add_notifier_modes(notify, func, MIG_MODE_NORMAL, -1);
 }
 
 void migration_remove_notifier(NotifierWithReturn *notify)
 {
     if (notify->notify) {
-        notifier_with_return_remove(notify);
+        for (MigMode mode = 0; mode < MIG_MODE__MAX; mode++) {
+            migration_blockers[mode] =
+                g_slist_remove(migration_state_notifiers[mode], notify);
+        }
         notify->notify = NULL;
     }
 }
@@ -1692,13 +1716,23 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
 {
     MigMode mode = s->parameters.mode;
     MigrationEvent e;
+    NotifierWithReturn *notifier;
+    GSList *elem, *next;
     int ret;
 
     e.type = type;
-    ret = notifier_with_return_list_notify(&migration_state_notifiers[mode],
-                                           &e, errp);
-    assert(!ret || type == MIG_EVENT_PRECOPY_SETUP);
-    return ret;
+
+    for (elem = migration_state_notifiers[mode]; elem; elem = next) {
+        next = elem->next;
+        notifier = (NotifierWithReturn *)elem->data;
+        ret = notifier->notify(notifier, &e, errp);
+        if (ret) {
+            assert(type == MIG_EVENT_PRECOPY_SETUP);
+            return ret;
+        }
+    }
+
+    return 0;
 }
 
 bool migration_has_failed(MigrationState *s)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 1/9] migration: multi-mode notifier
  2025-08-14 17:17 ` [PATCH V3 1/9] migration: multi-mode notifier Steve Sistare
@ 2025-08-19 13:09   ` Fabiano Rosas
  2025-09-09 15:43   ` Peter Xu
  1 sibling, 0 replies; 47+ messages in thread
From: Fabiano Rosas @ 2025-08-19 13:09 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, Markus Armbruster, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Allow a notifier to be added for multiple migration modes.
> To allow a notifier to appear on multiple per-node lists, use
> a generic list type.  We can no longer use NotifierWithReturnList,
> because it shoe horns the notifier onto a single list.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 1/9] migration: multi-mode notifier
  2025-08-14 17:17 ` [PATCH V3 1/9] migration: multi-mode notifier Steve Sistare
  2025-08-19 13:09   ` Fabiano Rosas
@ 2025-09-09 15:43   ` Peter Xu
  2025-09-09 16:40     ` Steven Sistare
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-09-09 15:43 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On Thu, Aug 14, 2025 at 10:17:15AM -0700, Steve Sistare wrote:
> Allow a notifier to be added for multiple migration modes.
> To allow a notifier to appear on multiple per-node lists, use
> a generic list type.  We can no longer use NotifierWithReturnList,
> because it shoe horns the notifier onto a single list.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/misc.h | 12 ++++++++++
>  migration/migration.c    | 60 +++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 59 insertions(+), 13 deletions(-)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index a261f99..592b930 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -95,7 +95,19 @@ void migration_add_notifier(NotifierWithReturn *notify,
>  void migration_add_notifier_mode(NotifierWithReturn *notify,
>                                   MigrationNotifyFunc func, MigMode mode);
>  
> +/*
> + * Same as migration_add_notifier, but applies to all @mode in the argument
> + * list.  The list is terminated by -1 or MIG_MODE_ALL.  For the latter,
> + * the notifier is added for all modes.
> + */
> +void migration_add_notifier_modes(NotifierWithReturn *notify,
> +                                  MigrationNotifyFunc func, MigMode mode, ...);

Would it be more common to pass in a bitmask instead (rather than n
parameters, plus a ending -1)?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 1/9] migration: multi-mode notifier
  2025-09-09 15:43   ` Peter Xu
@ 2025-09-09 16:40     ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-09 16:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On 9/9/2025 11:43 AM, Peter Xu wrote:
> On Thu, Aug 14, 2025 at 10:17:15AM -0700, Steve Sistare wrote:
>> Allow a notifier to be added for multiple migration modes.
>> To allow a notifier to appear on multiple per-node lists, use
>> a generic list type.  We can no longer use NotifierWithReturnList,
>> because it shoe horns the notifier onto a single list.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/misc.h | 12 ++++++++++
>>   migration/migration.c    | 60 +++++++++++++++++++++++++++++++++++++-----------
>>   2 files changed, 59 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index a261f99..592b930 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -95,7 +95,19 @@ void migration_add_notifier(NotifierWithReturn *notify,
>>   void migration_add_notifier_mode(NotifierWithReturn *notify,
>>                                    MigrationNotifyFunc func, MigMode mode);
>>   
>> +/*
>> + * Same as migration_add_notifier, but applies to all @mode in the argument
>> + * list.  The list is terminated by -1 or MIG_MODE_ALL.  For the latter,
>> + * the notifier is added for all modes.
>> + */
>> +void migration_add_notifier_modes(NotifierWithReturn *notify,
>> +                                  MigrationNotifyFunc func, MigMode mode, ...);
> 
> Would it be more common to pass in a bitmask instead (rather than n
> parameters, plus a ending -1)?

Yes, but I defined it this way to avoid the common error of passing a bit position
rather than a mask, eg:

   A = 10
   B = 7
   migration_add_notifier_modes(A | B)               WRONG
   migration_add_notifier_modes(BIT(A) | BIT(B))     CORRECT

and because IMO passing A, B is slightly more readable than passing BIT(A) | BIT(B).

Note the blocker functions also take modes using varargs, so using a bitmask for
the notifiers would give us two different representations.

- Steve




^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 2/9] migration: add cpr_walk_fd
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
  2025-08-14 17:17 ` [PATCH V3 1/9] migration: multi-mode notifier Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-09-09 15:45   ` Peter Xu
  2025-08-14 17:17 ` [PATCH V3 3/9] oslib: qemu_clear_cloexec Steve Sistare
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

Add a helper to walk all CPR fd's and run a callback for each.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h |  3 +++
 migration/cpr.c         | 13 +++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index baff57f..f4fc5ca 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -35,6 +35,9 @@ void cpr_resave_fd(const char *name, int id, int fd);
 int cpr_open_fd(const char *path, int flags, const char *name, int id,
                 Error **errp);
 
+typedef bool (*cpr_walk_fd_cb)(int fd);
+bool cpr_walk_fd(cpr_walk_fd_cb cb);
+
 MigMode cpr_get_incoming_mode(void);
 void cpr_set_incoming_mode(MigMode mode);
 bool cpr_is_incoming(void);
diff --git a/migration/cpr.c b/migration/cpr.c
index 6d01b8c..021bd6a 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -134,6 +134,19 @@ int cpr_open_fd(const char *path, int flags, const char *name, int id,
     return fd;
 }
 
+bool cpr_walk_fd(cpr_walk_fd_cb cb)
+{
+    CprFd *elem;
+
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        g_assert(elem->fd >= 0);
+        if (!cb(elem->fd)) {
+            return false;
+        }
+    }
+    return true;
+}
+
 /*************************************************************************/
 static const VMStateDescription vmstate_cpr_state = {
     .name = CPR_STATE,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 2/9] migration: add cpr_walk_fd
  2025-08-14 17:17 ` [PATCH V3 2/9] migration: add cpr_walk_fd Steve Sistare
@ 2025-09-09 15:45   ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2025-09-09 15:45 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On Thu, Aug 14, 2025 at 10:17:16AM -0700, Steve Sistare wrote:
> Add a helper to walk all CPR fd's and run a callback for each.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 3/9] oslib: qemu_clear_cloexec
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
  2025-08-14 17:17 ` [PATCH V3 1/9] migration: multi-mode notifier Steve Sistare
  2025-08-14 17:17 ` [PATCH V3 2/9] migration: add cpr_walk_fd Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-08-14 17:17 ` [PATCH V3 4/9] vl: helper to request exec Steve Sistare
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

Define qemu_clear_cloexec, analogous to qemu_set_cloexec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 include/qemu/osdep.h | 9 +++++++++
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 22 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 96fe51b..30136ea 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -680,6 +680,15 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
 
 void qemu_set_cloexec(int fd);
 
+/*
+ * Clear FD_CLOEXEC for a descriptor.
+ *
+ * The caller must guarantee that no other fork+exec's occur before the
+ * exec that is intended to inherit this descriptor, eg by suspending CPUs
+ * and blocking monitor commands.
+ */
+void qemu_clear_cloexec(int fd);
+
 /* Return a dynamically allocated directory path that is appropriate for storing
  * local state.
  *
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 4ff577e..4c04658 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -307,6 +307,15 @@ int qemu_socketpair(int domain, int type, int protocol, int sv[2])
     return ret;
 }
 
+void qemu_clear_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 char *
 qemu_get_local_state_dir(void)
 {
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index b735163..843a901 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -222,6 +222,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clear_cloexec(int fd)
+{
+}
+
 int qemu_get_thread_id(void)
 {
     return GetCurrentThreadId();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V3 4/9] vl: helper to request exec
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (2 preceding siblings ...)
  2025-08-14 17:17 ` [PATCH V3 3/9] oslib: qemu_clear_cloexec Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-09-09 15:51   ` Peter Xu
  2025-08-14 17:17 ` [PATCH V3 5/9] migration: cpr-exec-command parameter Steve Sistare
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

Add a qemu_system_exec_request() hook that causes the main loop to exit and
exec a command using the specified arguments.  This will be used during CPR
to exec a new version of QEMU.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/system/runstate.h |  3 +++
 system/runstate.c         | 29 +++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/system/runstate.h b/include/system/runstate.h
index 929379a..c005f49 100644
--- a/include/system/runstate.h
+++ b/include/system/runstate.h
@@ -128,6 +128,8 @@ typedef enum WakeupReason {
     QEMU_WAKEUP_REASON_OTHER,
 } WakeupReason;
 
+typedef void (*qemu_exec_func)(char **exec_argv);
+
 void qemu_system_reset_request(ShutdownCause reason);
 void qemu_system_suspend_request(void);
 void qemu_register_suspend_notifier(Notifier *notifier);
@@ -139,6 +141,7 @@ void qemu_register_wakeup_support(void);
 void qemu_system_shutdown_request_with_code(ShutdownCause reason,
                                             int exit_code);
 void qemu_system_shutdown_request(ShutdownCause reason);
+void qemu_system_exec_request(qemu_exec_func func, const strList *args);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
 void qemu_register_shutdown_notifier(Notifier *notifier);
diff --git a/system/runstate.c b/system/runstate.c
index 6178b00..b4980ff 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -41,6 +41,7 @@
 #include "qapi/error.h"
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-events-run-state.h"
+#include "qapi/type-helpers.h"
 #include "qemu/accel.h"
 #include "qemu/error-report.h"
 #include "qemu/job.h"
@@ -422,6 +423,8 @@ static NotifierList wakeup_notifiers =
 static NotifierList shutdown_notifiers =
     NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
+qemu_exec_func exec_func;
+static char **exec_argv;
 
 ShutdownCause qemu_shutdown_requested_get(void)
 {
@@ -443,6 +446,11 @@ static int qemu_shutdown_requested(void)
     return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
 }
 
+static int qemu_exec_requested(void)
+{
+    return exec_argv != NULL;
+}
+
 static void qemu_kill_report(void)
 {
     if (!qtest_driver() && shutdown_signal) {
@@ -803,6 +811,23 @@ void qemu_system_shutdown_request(ShutdownCause reason)
     qemu_notify_event();
 }
 
+static void qemu_system_exec(void)
+{
+    exec_func(exec_argv);
+
+    /* exec failed */
+    g_strfreev(exec_argv);
+    exec_argv = NULL;
+    exec_func = NULL;
+}
+
+void qemu_system_exec_request(qemu_exec_func func, const strList *args)
+{
+    exec_func = func;
+    exec_argv = strv_from_str_list(args);
+    qemu_notify_event();
+}
+
 static void qemu_system_powerdown(void)
 {
     qapi_event_send_powerdown();
@@ -849,6 +874,10 @@ static bool main_loop_should_exit(int *status)
     if (qemu_suspend_requested()) {
         qemu_system_suspend();
     }
+    if (qemu_exec_requested()) {
+        qemu_system_exec();
+        return false;
+    }
     request = qemu_shutdown_requested();
     if (request) {
         qemu_kill_report();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 4/9] vl: helper to request exec
  2025-08-14 17:17 ` [PATCH V3 4/9] vl: helper to request exec Steve Sistare
@ 2025-09-09 15:51   ` Peter Xu
  2025-09-12 14:49     ` Steven Sistare
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-09-09 15:51 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On Thu, Aug 14, 2025 at 10:17:18AM -0700, Steve Sistare wrote:
> Add a qemu_system_exec_request() hook that causes the main loop to exit and
> exec a command using the specified arguments.  This will be used during CPR
> to exec a new version of QEMU.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/system/runstate.h |  3 +++
>  system/runstate.c         | 29 +++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/include/system/runstate.h b/include/system/runstate.h
> index 929379a..c005f49 100644
> --- a/include/system/runstate.h
> +++ b/include/system/runstate.h
> @@ -128,6 +128,8 @@ typedef enum WakeupReason {
>      QEMU_WAKEUP_REASON_OTHER,
>  } WakeupReason;
>  
> +typedef void (*qemu_exec_func)(char **exec_argv);
> +
>  void qemu_system_reset_request(ShutdownCause reason);
>  void qemu_system_suspend_request(void);
>  void qemu_register_suspend_notifier(Notifier *notifier);
> @@ -139,6 +141,7 @@ void qemu_register_wakeup_support(void);
>  void qemu_system_shutdown_request_with_code(ShutdownCause reason,
>                                              int exit_code);
>  void qemu_system_shutdown_request(ShutdownCause reason);
> +void qemu_system_exec_request(qemu_exec_func func, const strList *args);
>  void qemu_system_powerdown_request(void);
>  void qemu_register_powerdown_notifier(Notifier *notifier);
>  void qemu_register_shutdown_notifier(Notifier *notifier);
> diff --git a/system/runstate.c b/system/runstate.c
> index 6178b00..b4980ff 100644
> --- a/system/runstate.c
> +++ b/system/runstate.c
> @@ -41,6 +41,7 @@
>  #include "qapi/error.h"
>  #include "qapi/qapi-commands-run-state.h"
>  #include "qapi/qapi-events-run-state.h"
> +#include "qapi/type-helpers.h"
>  #include "qemu/accel.h"
>  #include "qemu/error-report.h"
>  #include "qemu/job.h"
> @@ -422,6 +423,8 @@ static NotifierList wakeup_notifiers =
>  static NotifierList shutdown_notifiers =
>      NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
>  static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
> +qemu_exec_func exec_func;
> +static char **exec_argv;
>  
>  ShutdownCause qemu_shutdown_requested_get(void)
>  {
> @@ -443,6 +446,11 @@ static int qemu_shutdown_requested(void)
>      return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
>  }
>  
> +static int qemu_exec_requested(void)
> +{
> +    return exec_argv != NULL;
> +}
> +
>  static void qemu_kill_report(void)
>  {
>      if (!qtest_driver() && shutdown_signal) {
> @@ -803,6 +811,23 @@ void qemu_system_shutdown_request(ShutdownCause reason)
>      qemu_notify_event();
>  }
>  
> +static void qemu_system_exec(void)
> +{
> +    exec_func(exec_argv);
> +
> +    /* exec failed */
> +    g_strfreev(exec_argv);
> +    exec_argv = NULL;
> +    exec_func = NULL;

Would this really happen?

If so, do we at least want to dump something?

> +}
> +
> +void qemu_system_exec_request(qemu_exec_func func, const strList *args)
> +{
> +    exec_func = func;
> +    exec_argv = strv_from_str_list(args);
> +    qemu_notify_event();
> +}
> +
>  static void qemu_system_powerdown(void)
>  {
>      qapi_event_send_powerdown();
> @@ -849,6 +874,10 @@ static bool main_loop_should_exit(int *status)
>      if (qemu_suspend_requested()) {
>          qemu_system_suspend();
>      }
> +    if (qemu_exec_requested()) {
> +        qemu_system_exec();
> +        return false;
> +    }

Some explanation of why it needs to be done explicitly here would be
helpful.  E.g., can we do exec inside a BH scheduled for the main thread?
What if we exec() directly in another thread (rather than the main loop
thread)?

>      request = qemu_shutdown_requested();
>      if (request) {
>          qemu_kill_report();
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 4/9] vl: helper to request exec
  2025-09-09 15:51   ` Peter Xu
@ 2025-09-12 14:49     ` Steven Sistare
  2025-09-15 16:35       ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-09-12 14:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On 9/9/2025 11:51 AM, Peter Xu wrote:
> On Thu, Aug 14, 2025 at 10:17:18AM -0700, Steve Sistare wrote:
>> Add a qemu_system_exec_request() hook that causes the main loop to exit and
>> exec a command using the specified arguments.  This will be used during CPR
>> to exec a new version of QEMU.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/system/runstate.h |  3 +++
>>   system/runstate.c         | 29 +++++++++++++++++++++++++++++
>>   2 files changed, 32 insertions(+)
>>
>> diff --git a/include/system/runstate.h b/include/system/runstate.h
>> index 929379a..c005f49 100644
>> --- a/include/system/runstate.h
>> +++ b/include/system/runstate.h
>> @@ -128,6 +128,8 @@ typedef enum WakeupReason {
>>       QEMU_WAKEUP_REASON_OTHER,
>>   } WakeupReason;
>>   
>> +typedef void (*qemu_exec_func)(char **exec_argv);
>> +
>>   void qemu_system_reset_request(ShutdownCause reason);
>>   void qemu_system_suspend_request(void);
>>   void qemu_register_suspend_notifier(Notifier *notifier);
>> @@ -139,6 +141,7 @@ void qemu_register_wakeup_support(void);
>>   void qemu_system_shutdown_request_with_code(ShutdownCause reason,
>>                                               int exit_code);
>>   void qemu_system_shutdown_request(ShutdownCause reason);
>> +void qemu_system_exec_request(qemu_exec_func func, const strList *args);
>>   void qemu_system_powerdown_request(void);
>>   void qemu_register_powerdown_notifier(Notifier *notifier);
>>   void qemu_register_shutdown_notifier(Notifier *notifier);
>> diff --git a/system/runstate.c b/system/runstate.c
>> index 6178b00..b4980ff 100644
>> --- a/system/runstate.c
>> +++ b/system/runstate.c
>> @@ -41,6 +41,7 @@
>>   #include "qapi/error.h"
>>   #include "qapi/qapi-commands-run-state.h"
>>   #include "qapi/qapi-events-run-state.h"
>> +#include "qapi/type-helpers.h"
>>   #include "qemu/accel.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/job.h"
>> @@ -422,6 +423,8 @@ static NotifierList wakeup_notifiers =
>>   static NotifierList shutdown_notifiers =
>>       NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
>>   static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
>> +qemu_exec_func exec_func;
>> +static char **exec_argv;
>>   
>>   ShutdownCause qemu_shutdown_requested_get(void)
>>   {
>> @@ -443,6 +446,11 @@ static int qemu_shutdown_requested(void)
>>       return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
>>   }
>>   
>> +static int qemu_exec_requested(void)
>> +{
>> +    return exec_argv != NULL;
>> +}
>> +
>>   static void qemu_kill_report(void)
>>   {
>>       if (!qtest_driver() && shutdown_signal) {
>> @@ -803,6 +811,23 @@ void qemu_system_shutdown_request(ShutdownCause reason)
>>       qemu_notify_event();
>>   }
>>   
>> +static void qemu_system_exec(void)
>> +{
>> +    exec_func(exec_argv);
>> +
>> +    /* exec failed */
>> +    g_strfreev(exec_argv);
>> +    exec_argv = NULL;
>> +    exec_func = NULL;
> 
> Would this really happen?
> 
> If so, do we at least want to dump something?
> 
>> +}
>> +
>> +void qemu_system_exec_request(qemu_exec_func func, const strList *args)
>> +{
>> +    exec_func = func;
>> +    exec_argv = strv_from_str_list(args);
>> +    qemu_notify_event();
>> +}
>> +
>>   static void qemu_system_powerdown(void)
>>   {
>>       qapi_event_send_powerdown();
>> @@ -849,6 +874,10 @@ static bool main_loop_should_exit(int *status)
>>       if (qemu_suspend_requested()) {
>>           qemu_system_suspend();
>>       }
>> +    if (qemu_exec_requested()) {
>> +        qemu_system_exec();
>> +        return false;
>> +    }
> 
> Some explanation of why it needs to be done explicitly here would be
> helpful.  E.g., can we do exec inside a BH scheduled for the main thread?
> What if we exec() directly in another thread (rather than the main loop
> thread)?

A BH is a good idea, thanks.
It only requires a few lines of code, and no globals.
I will drop this patch and add a BH to patch "migration: cpr-exec mode".

- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 4/9] vl: helper to request exec
  2025-09-12 14:49     ` Steven Sistare
@ 2025-09-15 16:35       ` Peter Xu
  2025-09-19 15:27         ` Steven Sistare
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-09-15 16:35 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On Fri, Sep 12, 2025 at 10:49:23AM -0400, Steven Sistare wrote:
> On 9/9/2025 11:51 AM, Peter Xu wrote:
> > On Thu, Aug 14, 2025 at 10:17:18AM -0700, Steve Sistare wrote:
> > > Add a qemu_system_exec_request() hook that causes the main loop to exit and
> > > exec a command using the specified arguments.  This will be used during CPR
> > > to exec a new version of QEMU.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   include/system/runstate.h |  3 +++
> > >   system/runstate.c         | 29 +++++++++++++++++++++++++++++
> > >   2 files changed, 32 insertions(+)
> > > 
> > > diff --git a/include/system/runstate.h b/include/system/runstate.h
> > > index 929379a..c005f49 100644
> > > --- a/include/system/runstate.h
> > > +++ b/include/system/runstate.h
> > > @@ -128,6 +128,8 @@ typedef enum WakeupReason {
> > >       QEMU_WAKEUP_REASON_OTHER,
> > >   } WakeupReason;
> > > +typedef void (*qemu_exec_func)(char **exec_argv);
> > > +
> > >   void qemu_system_reset_request(ShutdownCause reason);
> > >   void qemu_system_suspend_request(void);
> > >   void qemu_register_suspend_notifier(Notifier *notifier);
> > > @@ -139,6 +141,7 @@ void qemu_register_wakeup_support(void);
> > >   void qemu_system_shutdown_request_with_code(ShutdownCause reason,
> > >                                               int exit_code);
> > >   void qemu_system_shutdown_request(ShutdownCause reason);
> > > +void qemu_system_exec_request(qemu_exec_func func, const strList *args);
> > >   void qemu_system_powerdown_request(void);
> > >   void qemu_register_powerdown_notifier(Notifier *notifier);
> > >   void qemu_register_shutdown_notifier(Notifier *notifier);
> > > diff --git a/system/runstate.c b/system/runstate.c
> > > index 6178b00..b4980ff 100644
> > > --- a/system/runstate.c
> > > +++ b/system/runstate.c
> > > @@ -41,6 +41,7 @@
> > >   #include "qapi/error.h"
> > >   #include "qapi/qapi-commands-run-state.h"
> > >   #include "qapi/qapi-events-run-state.h"
> > > +#include "qapi/type-helpers.h"
> > >   #include "qemu/accel.h"
> > >   #include "qemu/error-report.h"
> > >   #include "qemu/job.h"
> > > @@ -422,6 +423,8 @@ static NotifierList wakeup_notifiers =
> > >   static NotifierList shutdown_notifiers =
> > >       NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
> > >   static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
> > > +qemu_exec_func exec_func;
> > > +static char **exec_argv;
> > >   ShutdownCause qemu_shutdown_requested_get(void)
> > >   {
> > > @@ -443,6 +446,11 @@ static int qemu_shutdown_requested(void)
> > >       return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
> > >   }
> > > +static int qemu_exec_requested(void)
> > > +{
> > > +    return exec_argv != NULL;
> > > +}
> > > +
> > >   static void qemu_kill_report(void)
> > >   {
> > >       if (!qtest_driver() && shutdown_signal) {
> > > @@ -803,6 +811,23 @@ void qemu_system_shutdown_request(ShutdownCause reason)
> > >       qemu_notify_event();
> > >   }
> > > +static void qemu_system_exec(void)
> > > +{
> > > +    exec_func(exec_argv);
> > > +
> > > +    /* exec failed */
> > > +    g_strfreev(exec_argv);
> > > +    exec_argv = NULL;
> > > +    exec_func = NULL;
> > 
> > Would this really happen?
> > 
> > If so, do we at least want to dump something?
> > 
> > > +}
> > > +
> > > +void qemu_system_exec_request(qemu_exec_func func, const strList *args)
> > > +{
> > > +    exec_func = func;
> > > +    exec_argv = strv_from_str_list(args);
> > > +    qemu_notify_event();
> > > +}
> > > +
> > >   static void qemu_system_powerdown(void)
> > >   {
> > >       qapi_event_send_powerdown();
> > > @@ -849,6 +874,10 @@ static bool main_loop_should_exit(int *status)
> > >       if (qemu_suspend_requested()) {
> > >           qemu_system_suspend();
> > >       }
> > > +    if (qemu_exec_requested()) {
> > > +        qemu_system_exec();
> > > +        return false;
> > > +    }
> > 
> > Some explanation of why it needs to be done explicitly here would be
> > helpful.  E.g., can we do exec inside a BH scheduled for the main thread?
> > What if we exec() directly in another thread (rather than the main loop
> > thread)?
> 
> A BH is a good idea, thanks.
> It only requires a few lines of code, and no globals.
> I will drop this patch and add a BH to patch "migration: cpr-exec mode".

That would be better, thanks.

Then, what happens if we exec() in the migration thread directly?  IOW, is
BH required (to serialize with something happening in the main thread), or
just looks slightly more clean when exec happens in the main thread?

These info would be great to be put into commit message, but if there's no
obvious reason, IMHO we _could_ exec() directly in the migration thread, as
I don't see whatever to be synchronized in the main thread..  Anyway, if we
want to exec(), IMHO it would be best as straightforward as possible.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 4/9] vl: helper to request exec
  2025-09-15 16:35       ` Peter Xu
@ 2025-09-19 15:27         ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-19 15:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On 9/15/2025 12:35 PM, Peter Xu wrote:
> On Fri, Sep 12, 2025 at 10:49:23AM -0400, Steven Sistare wrote:
>> On 9/9/2025 11:51 AM, Peter Xu wrote:
>>> On Thu, Aug 14, 2025 at 10:17:18AM -0700, Steve Sistare wrote:
>>>> Add a qemu_system_exec_request() hook that causes the main loop to exit and
>>>> exec a command using the specified arguments.  This will be used during CPR
>>>> to exec a new version of QEMU.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    include/system/runstate.h |  3 +++
>>>>    system/runstate.c         | 29 +++++++++++++++++++++++++++++
>>>>    2 files changed, 32 insertions(+)
>>>>
>>>> diff --git a/include/system/runstate.h b/include/system/runstate.h
>>>> index 929379a..c005f49 100644
>>>> --- a/include/system/runstate.h
>>>> +++ b/include/system/runstate.h
>>>> @@ -128,6 +128,8 @@ typedef enum WakeupReason {
>>>>        QEMU_WAKEUP_REASON_OTHER,
>>>>    } WakeupReason;
>>>> +typedef void (*qemu_exec_func)(char **exec_argv);
>>>> +
>>>>    void qemu_system_reset_request(ShutdownCause reason);
>>>>    void qemu_system_suspend_request(void);
>>>>    void qemu_register_suspend_notifier(Notifier *notifier);
>>>> @@ -139,6 +141,7 @@ void qemu_register_wakeup_support(void);
>>>>    void qemu_system_shutdown_request_with_code(ShutdownCause reason,
>>>>                                                int exit_code);
>>>>    void qemu_system_shutdown_request(ShutdownCause reason);
>>>> +void qemu_system_exec_request(qemu_exec_func func, const strList *args);
>>>>    void qemu_system_powerdown_request(void);
>>>>    void qemu_register_powerdown_notifier(Notifier *notifier);
>>>>    void qemu_register_shutdown_notifier(Notifier *notifier);
>>>> diff --git a/system/runstate.c b/system/runstate.c
>>>> index 6178b00..b4980ff 100644
>>>> --- a/system/runstate.c
>>>> +++ b/system/runstate.c
>>>> @@ -41,6 +41,7 @@
>>>>    #include "qapi/error.h"
>>>>    #include "qapi/qapi-commands-run-state.h"
>>>>    #include "qapi/qapi-events-run-state.h"
>>>> +#include "qapi/type-helpers.h"
>>>>    #include "qemu/accel.h"
>>>>    #include "qemu/error-report.h"
>>>>    #include "qemu/job.h"
>>>> @@ -422,6 +423,8 @@ static NotifierList wakeup_notifiers =
>>>>    static NotifierList shutdown_notifiers =
>>>>        NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
>>>>    static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
>>>> +qemu_exec_func exec_func;
>>>> +static char **exec_argv;
>>>>    ShutdownCause qemu_shutdown_requested_get(void)
>>>>    {
>>>> @@ -443,6 +446,11 @@ static int qemu_shutdown_requested(void)
>>>>        return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
>>>>    }
>>>> +static int qemu_exec_requested(void)
>>>> +{
>>>> +    return exec_argv != NULL;
>>>> +}
>>>> +
>>>>    static void qemu_kill_report(void)
>>>>    {
>>>>        if (!qtest_driver() && shutdown_signal) {
>>>> @@ -803,6 +811,23 @@ void qemu_system_shutdown_request(ShutdownCause reason)
>>>>        qemu_notify_event();
>>>>    }
>>>> +static void qemu_system_exec(void)
>>>> +{
>>>> +    exec_func(exec_argv);
>>>> +
>>>> +    /* exec failed */
>>>> +    g_strfreev(exec_argv);
>>>> +    exec_argv = NULL;
>>>> +    exec_func = NULL;
>>>
>>> Would this really happen?
>>>
>>> If so, do we at least want to dump something?
>>>
>>>> +}
>>>> +
>>>> +void qemu_system_exec_request(qemu_exec_func func, const strList *args)
>>>> +{
>>>> +    exec_func = func;
>>>> +    exec_argv = strv_from_str_list(args);
>>>> +    qemu_notify_event();
>>>> +}
>>>> +
>>>>    static void qemu_system_powerdown(void)
>>>>    {
>>>>        qapi_event_send_powerdown();
>>>> @@ -849,6 +874,10 @@ static bool main_loop_should_exit(int *status)
>>>>        if (qemu_suspend_requested()) {
>>>>            qemu_system_suspend();
>>>>        }
>>>> +    if (qemu_exec_requested()) {
>>>> +        qemu_system_exec();
>>>> +        return false;
>>>> +    }
>>>
>>> Some explanation of why it needs to be done explicitly here would be
>>> helpful.  E.g., can we do exec inside a BH scheduled for the main thread?
>>> What if we exec() directly in another thread (rather than the main loop
>>> thread)?
>>
>> A BH is a good idea, thanks.
>> It only requires a few lines of code, and no globals.
>> I will drop this patch and add a BH to patch "migration: cpr-exec mode".
> 
> That would be better, thanks.
> 
> Then, what happens if we exec() in the migration thread directly?  IOW, is
> BH required (to serialize with something happening in the main thread), or
> just looks slightly more clean when exec happens in the main thread?
> 
> These info would be great to be put into commit message, but if there's no
> obvious reason, IMHO we _could_ exec() directly in the migration thread, as
> I don't see whatever to be synchronized in the main thread..  Anyway, if we
> want to exec(), IMHO it would be best as straightforward as possible.

A direct exec is not desirable for 2 reasons.

One is that cpr_exec_notifier is called in the middle of processing MIG_EVENT_PRECOPY_DONE
notifiers, and some notifiers may fall in the list after cpr_exec_notifier, and would never
be called.

Two is that cpr_exec_notifier is not called in migration_thread, it is called from a bh:
   migration_bh_dispatch_bh -> migration_cleanup -> migration_call_notifiers -> cpr_exec_notifier
so it is cleanest to post another bh to do the actual exec.

I will add those notes to the commit message.

- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 5/9] migration: cpr-exec-command parameter
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (3 preceding siblings ...)
  2025-08-14 17:17 ` [PATCH V3 4/9] vl: helper to request exec Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-09-08 16:07   ` Daniel P. Berrangé
  2025-09-11 15:10   ` Markus Armbruster
  2025-08-14 17:17 ` [PATCH V3 6/9] migration: cpr-exec save and load Steve Sistare
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

Create the cpr-exec-command migration parameter, defined as a list of
strings.  It will be used for cpr-exec migration mode in a subsequent
patch, and contains forward references to cpr-exec mode in the qapi
doc.

No functional change, except that cpr-exec-command is shown by the
'info migrate' command.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 qapi/migration.json            | 21 ++++++++++++++++++---
 migration/migration-hmp-cmds.c | 25 +++++++++++++++++++++++++
 migration/options.c            | 14 ++++++++++++++
 hmp-commands.hx                |  2 +-
 4 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 2387c21..ea410fd 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -924,6 +924,10 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+#     is @cpr-exec.  The first list element is the program's filename,
+#     the remainder its arguments. (Since 10.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -950,7 +954,8 @@
            'vcpu-dirty-limit',
            'mode',
            'zero-page-detection',
-           'direct-io'] }
+           'direct-io',
+           'cpr-exec-command'] }
 
 ##
 # @MigrateSetParameters:
@@ -1105,6 +1110,10 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+#     is @cpr-exec.  The first list element is the program's filename,
+#     the remainder its arguments. (Since 10.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1146,7 +1155,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-exec-command': [ 'str' ]} }
 
 ##
 # @migrate-set-parameters:
@@ -1315,6 +1325,10 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+#     is @cpr-exec.  The first list element is the program's filename,
+#     the remainder its arguments. (Since 10.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1353,7 +1367,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-exec-command': [ 'str' ]} }
 
 ##
 # @query-migrate-parameters:
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 0fc21f0..79aa528 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -306,6 +306,18 @@ void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict)
     qapi_free_MigrationCapabilityStatusList(caps);
 }
 
+static void monitor_print_cpr_exec_command(Monitor *mon, strList *args)
+{
+    monitor_printf(mon, "%s:",
+        MigrationParameter_str(MIGRATION_PARAMETER_CPR_EXEC_COMMAND));
+
+    while (args) {
+        monitor_printf(mon, " %s", args->value);
+        args = args->next;
+    }
+    monitor_printf(mon, "\n");
+}
+
 void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
 {
     MigrationParameters *params;
@@ -435,6 +447,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
                                MIGRATION_PARAMETER_DIRECT_IO),
                            params->direct_io ? "on" : "off");
         }
+
+        assert(params->has_cpr_exec_command);
+        monitor_print_cpr_exec_command(mon, params->cpr_exec_command);
     }
 
     qapi_free_MigrationParameters(params);
@@ -716,6 +731,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_direct_io = true;
         visit_type_bool(v, param, &p->direct_io, &err);
         break;
+    case MIGRATION_PARAMETER_CPR_EXEC_COMMAND: {
+        g_autofree char **strv = g_strsplit(valuestr ?: "", " ", -1);
+        strList **tail = &p->cpr_exec_command;
+
+        for (int i = 0; strv[i]; i++) {
+            QAPI_LIST_APPEND(tail, strv[i]);
+        }
+        p->has_cpr_exec_command = true;
+        break;
+    }
     default:
         g_assert_not_reached();
     }
diff --git a/migration/options.c b/migration/options.c
index 4e923a2..5183112 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -959,6 +959,9 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->zero_page_detection = s->parameters.zero_page_detection;
     params->has_direct_io = true;
     params->direct_io = s->parameters.direct_io;
+    params->has_cpr_exec_command = true;
+    params->cpr_exec_command = QAPI_CLONE(strList,
+                                          s->parameters.cpr_exec_command);
 
     return params;
 }
@@ -993,6 +996,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_mode = true;
     params->has_zero_page_detection = true;
     params->has_direct_io = true;
+    params->has_cpr_exec_command = true;
 }
 
 /*
@@ -1297,6 +1301,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_direct_io) {
         dest->direct_io = params->direct_io;
     }
+
+    if (params->has_cpr_exec_command) {
+        dest->cpr_exec_command = params->cpr_exec_command;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1429,6 +1437,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_direct_io) {
         s->parameters.direct_io = params->direct_io;
     }
+
+    if (params->has_cpr_exec_command) {
+        qapi_free_strList(s->parameters.cpr_exec_command);
+        s->parameters.cpr_exec_command =
+            QAPI_CLONE(strList, params->cpr_exec_command);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/hmp-commands.hx b/hmp-commands.hx
index d0e4f35..3cace8f 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1009,7 +1009,7 @@ ERST
 
     {
         .name       = "migrate_set_parameter",
-        .args_type  = "parameter:s,value:s",
+        .args_type  = "parameter:s,value:S",
         .params     = "parameter value",
         .help       = "Set the parameter for migration",
         .cmd        = hmp_migrate_set_parameter,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 5/9] migration: cpr-exec-command parameter
  2025-08-14 17:17 ` [PATCH V3 5/9] migration: cpr-exec-command parameter Steve Sistare
@ 2025-09-08 16:07   ` Daniel P. Berrangé
  2025-09-09 15:22     ` Steven Sistare
  2025-09-11 15:10   ` Markus Armbruster
  1 sibling, 1 reply; 47+ messages in thread
From: Daniel P. Berrangé @ 2025-09-08 16:07 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Eric Blake, Dr. David Alan Gilbert

On Thu, Aug 14, 2025 at 10:17:19AM -0700, Steve Sistare wrote:
> Create the cpr-exec-command migration parameter, defined as a list of
> strings.  It will be used for cpr-exec migration mode in a subsequent
> patch, and contains forward references to cpr-exec mode in the qapi
> doc.
> 
> No functional change, except that cpr-exec-command is shown by the
> 'info migrate' command.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  qapi/migration.json            | 21 ++++++++++++++++++---
>  migration/migration-hmp-cmds.c | 25 +++++++++++++++++++++++++
>  migration/options.c            | 14 ++++++++++++++
>  hmp-commands.hx                |  2 +-
>  4 files changed, 58 insertions(+), 4 deletions(-)

> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 0fc21f0..79aa528 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -306,6 +306,18 @@ void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict)
>      qapi_free_MigrationCapabilityStatusList(caps);
>  }
>  
> +static void monitor_print_cpr_exec_command(Monitor *mon, strList *args)
> +{
> +    monitor_printf(mon, "%s:",
> +        MigrationParameter_str(MIGRATION_PARAMETER_CPR_EXEC_COMMAND));
> +
> +    while (args) {
> +        monitor_printf(mon, " %s", args->value);
> +        args = args->next;
> +    }
> +    monitor_printf(mon, "\n");
> +}
> +
>  void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>  {
>      MigrationParameters *params;
> @@ -435,6 +447,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>                                 MIGRATION_PARAMETER_DIRECT_IO),
>                             params->direct_io ? "on" : "off");
>          }
> +
> +        assert(params->has_cpr_exec_command);
> +        monitor_print_cpr_exec_command(mon, params->cpr_exec_command);
>      }
>  
>      qapi_free_MigrationParameters(params);
> @@ -716,6 +731,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          p->has_direct_io = true;
>          visit_type_bool(v, param, &p->direct_io, &err);
>          break;
> +    case MIGRATION_PARAMETER_CPR_EXEC_COMMAND: {
> +        g_autofree char **strv = g_strsplit(valuestr ?: "", " ", -1);


Perhaps we should use   g_shell_parse_argv() in the HMP case ? IIUC
it should handle quoting for args containing whitespace (as long as
HMP itself has not already mangled that ?).

> +        strList **tail = &p->cpr_exec_command;
> +
> +        for (int i = 0; strv[i]; i++) {
> +            QAPI_LIST_APPEND(tail, strv[i]);
> +        }
> +        p->has_cpr_exec_command = true;
> +        break;
> +    }
>      default:
>          g_assert_not_reached();
>      }

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 5/9] migration: cpr-exec-command parameter
  2025-09-08 16:07   ` Daniel P. Berrangé
@ 2025-09-09 15:22     ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-09 15:22 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Fabiano Rosas, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Eric Blake, Dr. David Alan Gilbert

On 9/8/2025 12:07 PM, Daniel P. Berrangé wrote:
> On Thu, Aug 14, 2025 at 10:17:19AM -0700, Steve Sistare wrote:
>> Create the cpr-exec-command migration parameter, defined as a list of
>> strings.  It will be used for cpr-exec migration mode in a subsequent
>> patch, and contains forward references to cpr-exec mode in the qapi
>> doc.
>>
>> No functional change, except that cpr-exec-command is shown by the
>> 'info migrate' command.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   qapi/migration.json            | 21 ++++++++++++++++++---
>>   migration/migration-hmp-cmds.c | 25 +++++++++++++++++++++++++
>>   migration/options.c            | 14 ++++++++++++++
>>   hmp-commands.hx                |  2 +-
>>   4 files changed, 58 insertions(+), 4 deletions(-)
> 
>> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
>> index 0fc21f0..79aa528 100644
>> --- a/migration/migration-hmp-cmds.c
>> +++ b/migration/migration-hmp-cmds.c
>> @@ -306,6 +306,18 @@ void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict)
>>       qapi_free_MigrationCapabilityStatusList(caps);
>>   }
>>   
>> +static void monitor_print_cpr_exec_command(Monitor *mon, strList *args)
>> +{
>> +    monitor_printf(mon, "%s:",
>> +        MigrationParameter_str(MIGRATION_PARAMETER_CPR_EXEC_COMMAND));
>> +
>> +    while (args) {
>> +        monitor_printf(mon, " %s", args->value);
>> +        args = args->next;
>> +    }
>> +    monitor_printf(mon, "\n");
>> +}
>> +
>>   void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>>   {
>>       MigrationParameters *params;
>> @@ -435,6 +447,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>>                                  MIGRATION_PARAMETER_DIRECT_IO),
>>                              params->direct_io ? "on" : "off");
>>           }
>> +
>> +        assert(params->has_cpr_exec_command);
>> +        monitor_print_cpr_exec_command(mon, params->cpr_exec_command);
>>       }
>>   
>>       qapi_free_MigrationParameters(params);
>> @@ -716,6 +731,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>>           p->has_direct_io = true;
>>           visit_type_bool(v, param, &p->direct_io, &err);
>>           break;
>> +    case MIGRATION_PARAMETER_CPR_EXEC_COMMAND: {
>> +        g_autofree char **strv = g_strsplit(valuestr ?: "", " ", -1);
> 
> 
> Perhaps we should use   g_shell_parse_argv() in the HMP case ? IIUC
> it should handle quoting for args containing whitespace (as long as
> HMP itself has not already mangled that ?).

Thank-you Daniel, that is a good idea.
I verified it works with HMP:

$ build/qemu-system-x86_64 -display none -monitor stdio
(qemu) migrate_set_parameter cpr-exec-command 'a b' c
[0] = a b
[1] = c
(qemu) migrate_set_parameter cpr-exec-command "a b" c
[0] = a b
[1] = c

- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 5/9] migration: cpr-exec-command parameter
  2025-08-14 17:17 ` [PATCH V3 5/9] migration: cpr-exec-command parameter Steve Sistare
  2025-09-08 16:07   ` Daniel P. Berrangé
@ 2025-09-11 15:10   ` Markus Armbruster
  2025-09-12 14:48     ` Steven Sistare
  1 sibling, 1 reply; 47+ messages in thread
From: Markus Armbruster @ 2025-09-11 15:10 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Peter Xu, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert

Steve Sistare <steven.sistare@oracle.com> writes:

> Create the cpr-exec-command migration parameter, defined as a list of
> strings.  It will be used for cpr-exec migration mode in a subsequent
> patch, and contains forward references to cpr-exec mode in the qapi
> doc.
>
> No functional change, except that cpr-exec-command is shown by the
> 'info migrate' command.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  qapi/migration.json            | 21 ++++++++++++++++++---
>  migration/migration-hmp-cmds.c | 25 +++++++++++++++++++++++++
>  migration/options.c            | 14 ++++++++++++++
>  hmp-commands.hx                |  2 +-
>  4 files changed, 58 insertions(+), 4 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 2387c21..ea410fd 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -924,6 +924,10 @@
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
>  #
> +# @cpr-exec-command: Command to start the new QEMU process when @mode
> +#     is @cpr-exec.  The first list element is the program's filename,
> +#     the remainder its arguments. (Since 10.2)

Please add a second space in ". (" for all three copies.

> +#
>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and
> @@ -950,7 +954,8 @@
>             'vcpu-dirty-limit',
>             'mode',
>             'zero-page-detection',
> -           'direct-io'] }
> +           'direct-io',
> +           'cpr-exec-command'] }
>  
>  ##
>  # @MigrateSetParameters:
> @@ -1105,6 +1110,10 @@
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
>  #
> +# @cpr-exec-command: Command to start the new QEMU process when @mode
> +#     is @cpr-exec.  The first list element is the program's filename,
> +#     the remainder its arguments. (Since 10.2)
> +#
>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and
> @@ -1146,7 +1155,8 @@
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
>              '*zero-page-detection': 'ZeroPageDetection',
> -            '*direct-io': 'bool' } }
> +            '*direct-io': 'bool',
> +            '*cpr-exec-command': [ 'str' ]} }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1315,6 +1325,10 @@
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
>  #
> +# @cpr-exec-command: Command to start the new QEMU process when @mode
> +#     is @cpr-exec.  The first list element is the program's filename,
> +#     the remainder its arguments. (Since 10.2)
> +#
>  # Features:
>  #
>  # @unstable: Members @x-checkpoint-delay and
> @@ -1353,7 +1367,8 @@
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
>              '*zero-page-detection': 'ZeroPageDetection',
> -            '*direct-io': 'bool' } }
> +            '*direct-io': 'bool',
> +            '*cpr-exec-command': [ 'str' ]} }
>  
>  ##
>  # @query-migrate-parameters:

Acked-by: Markus Armbruster <armbru@redhat.com>


[...]



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 5/9] migration: cpr-exec-command parameter
  2025-09-11 15:10   ` Markus Armbruster
@ 2025-09-12 14:48     ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-12 14:48 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Fabiano Rosas, Peter Xu, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert

On 9/11/2025 11:10 AM, Markus Armbruster wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Create the cpr-exec-command migration parameter, defined as a list of
>> strings.  It will be used for cpr-exec migration mode in a subsequent
>> patch, and contains forward references to cpr-exec mode in the qapi
>> doc.
>>
>> No functional change, except that cpr-exec-command is shown by the
>> 'info migrate' command.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   qapi/migration.json            | 21 ++++++++++++++++++---
>>   migration/migration-hmp-cmds.c | 25 +++++++++++++++++++++++++
>>   migration/options.c            | 14 ++++++++++++++
>>   hmp-commands.hx                |  2 +-
>>   4 files changed, 58 insertions(+), 4 deletions(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 2387c21..ea410fd 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -924,6 +924,10 @@
>>   #     only has effect if the @mapped-ram capability is enabled.
>>   #     (Since 9.1)
>>   #
>> +# @cpr-exec-command: Command to start the new QEMU process when @mode
>> +#     is @cpr-exec.  The first list element is the program's filename,
>> +#     the remainder its arguments. (Since 10.2)
> 
> Please add a second space in ". (" for all three copies.

Sure.  Thanks for reviewing.

- Steve

>> +#
>>   # Features:
>>   #
>>   # @unstable: Members @x-checkpoint-delay and
>> @@ -950,7 +954,8 @@
>>              'vcpu-dirty-limit',
>>              'mode',
>>              'zero-page-detection',
>> -           'direct-io'] }
>> +           'direct-io',
>> +           'cpr-exec-command'] }
>>   
>>   ##
>>   # @MigrateSetParameters:
>> @@ -1105,6 +1110,10 @@
>>   #     only has effect if the @mapped-ram capability is enabled.
>>   #     (Since 9.1)
>>   #
>> +# @cpr-exec-command: Command to start the new QEMU process when @mode
>> +#     is @cpr-exec.  The first list element is the program's filename,
>> +#     the remainder its arguments. (Since 10.2)
>> +#
>>   # Features:
>>   #
>>   # @unstable: Members @x-checkpoint-delay and
>> @@ -1146,7 +1155,8 @@
>>               '*vcpu-dirty-limit': 'uint64',
>>               '*mode': 'MigMode',
>>               '*zero-page-detection': 'ZeroPageDetection',
>> -            '*direct-io': 'bool' } }
>> +            '*direct-io': 'bool',
>> +            '*cpr-exec-command': [ 'str' ]} }
>>   
>>   ##
>>   # @migrate-set-parameters:
>> @@ -1315,6 +1325,10 @@
>>   #     only has effect if the @mapped-ram capability is enabled.
>>   #     (Since 9.1)
>>   #
>> +# @cpr-exec-command: Command to start the new QEMU process when @mode
>> +#     is @cpr-exec.  The first list element is the program's filename,
>> +#     the remainder its arguments. (Since 10.2)
>> +#
>>   # Features:
>>   #
>>   # @unstable: Members @x-checkpoint-delay and
>> @@ -1353,7 +1367,8 @@
>>               '*vcpu-dirty-limit': 'uint64',
>>               '*mode': 'MigMode',
>>               '*zero-page-detection': 'ZeroPageDetection',
>> -            '*direct-io': 'bool' } }
>> +            '*direct-io': 'bool',
>> +            '*cpr-exec-command': [ 'str' ]} }
>>   
>>   ##
>>   # @query-migrate-parameters:
> 
> Acked-by: Markus Armbruster <armbru@redhat.com>
> 
> 
> [...]
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 6/9] migration: cpr-exec save and load
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (4 preceding siblings ...)
  2025-08-14 17:17 ` [PATCH V3 5/9] migration: cpr-exec-command parameter Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-09-19 15:35   ` Steven Sistare
  2025-08-14 17:17 ` [PATCH V3 7/9] migration: cpr-exec mode Steve Sistare
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

To preserve CPR state across exec, create a QEMUFile based on a memfd, and
keep the memfd open across exec.  Save the value of the memfd in an
environment variable so post-exec QEMU can find it.

These new functions are called in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h |  5 +++
 migration/cpr-exec.c    | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |  1 +
 3 files changed, 100 insertions(+)
 create mode 100644 migration/cpr-exec.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index f4fc5ca..aaeec02 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -54,4 +54,9 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
+QEMUFile *cpr_exec_output(Error **errp);
+QEMUFile *cpr_exec_input(Error **errp);
+void cpr_exec_persist_state(QEMUFile *f);
+bool cpr_exec_has_state(void);
+void cpr_exec_unpersist_state(void);
 #endif
diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
new file mode 100644
index 0000000..2c32e9c
--- /dev/null
+++ b/migration/cpr-exec.c
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "io/channel-file.h"
+#include "io/channel-socket.h"
+#include "migration/cpr.h"
+#include "migration/qemu-file.h"
+#include "migration/misc.h"
+#include "migration/vmstate.h"
+#include "system/runstate.h"
+
+#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
+
+static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    return qemu_file_new_input(ioc);
+}
+
+static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    return qemu_file_new_output(ioc);
+}
+
+void cpr_exec_persist_state(QEMUFile *f)
+{
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
+    int mfd = dup(fioc->fd);
+    char val[16];
+
+    /* Remember mfd in environment for post-exec load */
+    qemu_clear_cloexec(mfd);
+    snprintf(val, sizeof(val), "%d", mfd);
+    g_setenv(CPR_EXEC_STATE_NAME, val, 1);
+}
+
+static int cpr_exec_find_state(void)
+{
+    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
+    int mfd;
+
+    assert(val);
+    g_unsetenv(CPR_EXEC_STATE_NAME);
+    assert(!qemu_strtoi(val, NULL, 10, &mfd));
+    return mfd;
+}
+
+bool cpr_exec_has_state(void)
+{
+    return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
+}
+
+void cpr_exec_unpersist_state(void)
+{
+    int mfd;
+    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
+
+    g_unsetenv(CPR_EXEC_STATE_NAME);
+    assert(val);
+    assert(!qemu_strtoi(val, NULL, 10, &mfd));
+    close(mfd);
+}
+
+QEMUFile *cpr_exec_output(Error **errp)
+{
+    int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
+
+    if (mfd < 0) {
+        error_setg_errno(errp, errno, "memfd_create failed");
+        return NULL;
+    }
+
+    return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
+}
+
+QEMUFile *cpr_exec_input(Error **errp)
+{
+    int mfd = cpr_exec_find_state();
+
+    lseek(mfd, 0, SEEK_SET);
+    return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
+}
diff --git a/migration/meson.build b/migration/meson.build
index 276da3b..6087ccc 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -16,6 +16,7 @@ system_ss.add(files(
   'channel-block.c',
   'cpr.c',
   'cpr-transfer.c',
+  'cpr-exec.c',
   'cpu-throttle.c',
   'dirtyrate.c',
   'exec.c',
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 6/9] migration: cpr-exec save and load
  2025-08-14 17:17 ` [PATCH V3 6/9] migration: cpr-exec save and load Steve Sistare
@ 2025-09-19 15:35   ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-19 15:35 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: Markus Armbruster, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert

This still needs review - steve

On 8/14/2025 1:17 PM, Steve Sistare wrote:
> To preserve CPR state across exec, create a QEMUFile based on a memfd, and
> keep the memfd open across exec.  Save the value of the memfd in an
> environment variable so post-exec QEMU can find it.
> 
> These new functions are called in a subsequent patch.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/migration/cpr.h |  5 +++
>   migration/cpr-exec.c    | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
>   migration/meson.build   |  1 +
>   3 files changed, 100 insertions(+)
>   create mode 100644 migration/cpr-exec.c
> 
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index f4fc5ca..aaeec02 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -54,4 +54,9 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>   
> +QEMUFile *cpr_exec_output(Error **errp);
> +QEMUFile *cpr_exec_input(Error **errp);
> +void cpr_exec_persist_state(QEMUFile *f);
> +bool cpr_exec_has_state(void);
> +void cpr_exec_unpersist_state(void);
>   #endif
> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> new file mode 100644
> index 0000000..2c32e9c
> --- /dev/null
> +++ b/migration/cpr-exec.c
> @@ -0,0 +1,94 @@
> +/*
> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/cutils.h"
> +#include "qemu/memfd.h"
> +#include "qapi/error.h"
> +#include "io/channel-file.h"
> +#include "io/channel-socket.h"
> +#include "migration/cpr.h"
> +#include "migration/qemu-file.h"
> +#include "migration/misc.h"
> +#include "migration/vmstate.h"
> +#include "system/runstate.h"
> +
> +#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
> +
> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    return qemu_file_new_input(ioc);
> +}
> +
> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    return qemu_file_new_output(ioc);
> +}
> +
> +void cpr_exec_persist_state(QEMUFile *f)
> +{
> +    QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
> +    int mfd = dup(fioc->fd);
> +    char val[16];
> +
> +    /* Remember mfd in environment for post-exec load */
> +    qemu_clear_cloexec(mfd);
> +    snprintf(val, sizeof(val), "%d", mfd);
> +    g_setenv(CPR_EXEC_STATE_NAME, val, 1);
> +}
> +
> +static int cpr_exec_find_state(void)
> +{
> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
> +    int mfd;
> +
> +    assert(val);
> +    g_unsetenv(CPR_EXEC_STATE_NAME);
> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
> +    return mfd;
> +}
> +
> +bool cpr_exec_has_state(void)
> +{
> +    return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
> +}
> +
> +void cpr_exec_unpersist_state(void)
> +{
> +    int mfd;
> +    const char *val = g_getenv(CPR_EXEC_STATE_NAME);
> +
> +    g_unsetenv(CPR_EXEC_STATE_NAME);
> +    assert(val);
> +    assert(!qemu_strtoi(val, NULL, 10, &mfd));
> +    close(mfd);
> +}
> +
> +QEMUFile *cpr_exec_output(Error **errp)
> +{
> +    int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
> +
> +    if (mfd < 0) {
> +        error_setg_errno(errp, errno, "memfd_create failed");
> +        return NULL;
> +    }
> +
> +    return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
> +}
> +
> +QEMUFile *cpr_exec_input(Error **errp)
> +{
> +    int mfd = cpr_exec_find_state();
> +
> +    lseek(mfd, 0, SEEK_SET);
> +    return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
> +}
> diff --git a/migration/meson.build b/migration/meson.build
> index 276da3b..6087ccc 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -16,6 +16,7 @@ system_ss.add(files(
>     'channel-block.c',
>     'cpr.c',
>     'cpr-transfer.c',
> +  'cpr-exec.c',
>     'cpu-throttle.c',
>     'dirtyrate.c',
>     'exec.c',



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 7/9] migration: cpr-exec mode
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (5 preceding siblings ...)
  2025-08-14 17:17 ` [PATCH V3 6/9] migration: cpr-exec save and load Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-09-09 16:32   ` Peter Xu
  2025-09-11 15:09   ` Markus Armbruster
  2025-08-14 17:17 ` [PATCH V3 8/9] migration: cpr-exec docs Steve Sistare
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

Add the cpr-exec migration mode.  Usage:
  qemu-system-$arch -machine aux-ram-share=on ...
  migrate_set_parameter mode cpr-exec
  migrate_set_parameter cpr-exec-command \
    <arg1> <arg2> ... -incoming <uri-1> \
  migrate -d <uri-1>

The migrate command stops the VM, saves state to uri-1,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from uri-1.  Guest RAM is preserved in place,
albeit with new virtual addresses.

The new QEMU process is started by exec'ing the command
specified by the @cpr-exec-command parameter.  The first word of
the command is the binary, and the remaining words are its
arguments.  The command may be a direct invocation of new QEMU,
or may be a non-QEMU command that exec's the new QEMU binary.

This mode creates a second migration channel that is not visible
to the user.  At the start of migration, old QEMU saves CPR state
to the second channel, and at the end of migration, it tells the
main loop to call cpr_exec.  New QEMU loads CPR state early, before
objects are created.

Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so uri-1 must be a type,
such as a file, that accepts all data before old QEMU exits.
Otherwise, old QEMU may quietly block writing to the channel.

Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started with
the '-machine aux-ram-share=on' option, which allows anonymous
memory to be transferred in place to the new process.  The memfds
are kept open across exec by clearing the close-on-exec flag, their
values are saved in CPR state, and they are mmap'd in new QEMU.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 qapi/migration.json       | 25 +++++++++++++++-
 include/migration/cpr.h   |  1 +
 migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
 migration/cpr.c           | 26 ++++++++++++++++-
 migration/migration.c     | 10 ++++++-
 migration/ram.c           |  1 +
 migration/vmstate-types.c |  8 +++++
 migration/trace-events    |  1 +
 8 files changed, 143 insertions(+), 3 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index ea410fd..cbc90e8 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -694,9 +694,32 @@
 #     until you issue the `migrate-incoming` command.
 #
 #     (since 10.0)
+#
+# @cpr-exec: The migrate command stops the VM, saves state to the
+#     migration channel, directly exec's a new version of QEMU on the
+#     same host, replacing the original process while retaining its
+#     PID, and loads state from the channel.  Guest RAM is preserved
+#     in place.  Devices and their pinned pages are also preserved for
+#     VFIO and IOMMUFD.
+#
+#     Old QEMU starts new QEMU by exec'ing the command specified by
+#     the @cpr-exec-command parameter.  The command may be a direct
+#     invocation of new QEMU, or may be a non-QEMU command that exec's
+#     the new QEMU binary.
+#
+#     Because old QEMU terminates when new QEMU starts, one cannot
+#     stream data between the two, so the channel must be a type,
+#     such as a file, that accepts all data before old QEMU exits.
+#     Otherwise, old QEMU may quietly block writing to the channel.
+#
+#     Memory-backend objects must have the share=on attribute, but
+#     memory-backend-epc is not supported.  The VM must be started
+#     with the '-machine aux-ram-share=on' option.
+#
+#     (since 10.2)
 ##
 { 'enum': 'MigMode',
-  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
+  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
 
 ##
 # @ZeroPageDetection:
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index aaeec02..e99e48e 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -54,6 +54,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
+void cpr_exec_init(void);
 QEMUFile *cpr_exec_output(Error **errp);
 QEMUFile *cpr_exec_input(Error **errp);
 void cpr_exec_persist_state(QEMUFile *f);
diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
index 2c32e9c..7d0429f 100644
--- a/migration/cpr-exec.c
+++ b/migration/cpr-exec.c
@@ -6,15 +6,20 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include "qemu/memfd.h"
 #include "qapi/error.h"
 #include "io/channel-file.h"
 #include "io/channel-socket.h"
+#include "block/block-global-state.h"
+#include "qemu/main-loop.h"
 #include "migration/cpr.h"
 #include "migration/qemu-file.h"
+#include "migration/migration.h"
 #include "migration/misc.h"
 #include "migration/vmstate.h"
 #include "system/runstate.h"
+#include "trace.h"
 
 #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
 
@@ -92,3 +97,72 @@ QEMUFile *cpr_exec_input(Error **errp)
     lseek(mfd, 0, SEEK_SET);
     return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
 }
+
+static bool preserve_fd(int fd)
+{
+    qemu_clear_cloexec(fd);
+    return true;
+}
+
+static bool unpreserve_fd(int fd)
+{
+    qemu_set_cloexec(fd);
+    return true;
+}
+
+static void cpr_exec(char **argv)
+{
+    MigrationState *s = migrate_get_current();
+    Error *err = NULL;
+
+    /*
+     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
+     * earlier because they should not persist across miscellaneous fork and
+     * exec calls that are performed during normal operation.
+     */
+    cpr_walk_fd(preserve_fd);
+
+    trace_cpr_exec();
+    execvp(argv[0], argv);
+
+    cpr_walk_fd(unpreserve_fd);
+
+    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
+    error_report_err(error_copy(err));
+    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+    migrate_set_error(s, err);
+
+    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
+
+    err = NULL;
+    if (!migration_block_activate(&err)) {
+        /* error was already reported */
+        return;
+    }
+
+    if (runstate_is_live(s->vm_old_state)) {
+        vm_start();
+    }
+}
+
+static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
+                             Error **errp)
+{
+    MigrationState *s = migrate_get_current();
+
+    if (e->type == MIG_EVENT_PRECOPY_DONE) {
+        assert(s->state == MIGRATION_STATUS_COMPLETED);
+        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_command);
+    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
+        cpr_exec_unpersist_state();
+    }
+    return 0;
+}
+
+void cpr_exec_init(void)
+{
+    static NotifierWithReturn exec_notifier;
+
+    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
+                                MIG_MODE_CPR_EXEC);
+}
diff --git a/migration/cpr.c b/migration/cpr.c
index 021bd6a..2078d05 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -198,6 +198,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
     if (mode == MIG_MODE_CPR_TRANSFER) {
         g_assert(channel);
         f = cpr_transfer_output(channel, errp);
+    } else if (mode == MIG_MODE_CPR_EXEC) {
+        f = cpr_exec_output(errp);
     } else {
         return 0;
     }
@@ -215,6 +217,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
         return ret;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        cpr_exec_persist_state(f);
+    }
+
     /*
      * Close the socket only partially so we can later detect when the other
      * end closes by getting a HUP event.
@@ -226,6 +232,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
     return 0;
 }
 
+static bool unpreserve_fd(int fd)
+{
+    qemu_set_cloexec(fd);
+    return true;
+}
+
 int cpr_state_load(MigrationChannel *channel, Error **errp)
 {
     int ret;
@@ -237,6 +249,12 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
         mode = MIG_MODE_CPR_TRANSFER;
         cpr_set_incoming_mode(mode);
         f = cpr_transfer_input(channel, errp);
+    } else if (cpr_exec_has_state()) {
+        mode = MIG_MODE_CPR_EXEC;
+        f = cpr_exec_input(errp);
+        if (channel) {
+            warn_report("ignoring cpr channel for migration mode cpr-exec");
+        }
     } else {
         return 0;
     }
@@ -245,6 +263,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
     }
 
     trace_cpr_state_load(MigMode_str(mode));
+    cpr_set_incoming_mode(mode);
 
     v = qemu_get_be32(f);
     if (v != QEMU_CPR_FILE_MAGIC) {
@@ -266,6 +285,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
         return ret;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
+        cpr_walk_fd(unpreserve_fd);
+    }
+
     /*
      * Let the caller decide when to close the socket (and generate a HUP event
      * for the sending side).
@@ -286,7 +310,7 @@ void cpr_state_close(void)
 bool cpr_incoming_needed(void *opaque)
 {
     MigMode mode = migrate_mode();
-    return mode == MIG_MODE_CPR_TRANSFER;
+    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
 }
 
 /*
diff --git a/migration/migration.c b/migration/migration.c
index 271c521..d604284 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -333,6 +333,7 @@ void migration_object_init(void)
 
     ram_mig_init();
     dirty_bitmap_mig_init();
+    cpr_exec_init();
 
     /* Initialize cpu throttle timers */
     cpu_throttle_init();
@@ -1797,7 +1798,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
 {
     MigMode mode = s->parameters.mode;
     return mode == MIG_MODE_CPR_REBOOT ||
-           mode == MIG_MODE_CPR_TRANSFER;
+           mode == MIG_MODE_CPR_TRANSFER ||
+           mode == MIG_MODE_CPR_EXEC;
 }
 
 int migrate_init(MigrationState *s, Error **errp)
@@ -2146,6 +2148,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
         return false;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
+        !s->parameters.has_cpr_exec_command) {
+        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
diff --git a/migration/ram.c b/migration/ram.c
index 7208bc1..6730a41 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
     MigMode mode = migrate_mode();
     return !qemu_ram_is_migratable(block) ||
            mode == MIG_MODE_CPR_TRANSFER ||
+           mode == MIG_MODE_CPR_EXEC ||
            (migrate_ignore_shared() && qemu_ram_is_shared(block)
                                     && qemu_ram_is_named_file(block));
 }
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index 741a588..1aa0573 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
                   const VMStateField *field)
 {
     int32_t *v = pv;
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        qemu_get_sbe32s(f, v);
+        return 0;
+    }
     *v = qemu_file_get_fd(f);
     return 0;
 }
@@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
                   const VMStateField *field, JSONWriter *vmdesc)
 {
     int32_t *v = pv;
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        qemu_put_sbe32s(f, v);
+        return 0;
+    }
     return qemu_file_put_fd(f, *v);
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 706db97..e8edd1f 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
 cpr_state_load(const char *mode) "%s mode"
 cpr_transfer_input(const char *path) "%s"
 cpr_transfer_output(const char *path) "%s"
+cpr_exec(void) ""
 
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 7/9] migration: cpr-exec mode
  2025-08-14 17:17 ` [PATCH V3 7/9] migration: cpr-exec mode Steve Sistare
@ 2025-09-09 16:32   ` Peter Xu
  2025-09-09 18:10     ` Steven Sistare
  2025-09-11 15:09   ` Markus Armbruster
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-09-09 16:32 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On Thu, Aug 14, 2025 at 10:17:21AM -0700, Steve Sistare wrote:
> Add the cpr-exec migration mode.  Usage:
>   qemu-system-$arch -machine aux-ram-share=on ...
>   migrate_set_parameter mode cpr-exec
>   migrate_set_parameter cpr-exec-command \
>     <arg1> <arg2> ... -incoming <uri-1> \
>   migrate -d <uri-1>
> 
> The migrate command stops the VM, saves state to uri-1,
> directly exec's a new version of QEMU on the same host,
> replacing the original process while retaining its PID, and
> loads state from uri-1.  Guest RAM is preserved in place,
> albeit with new virtual addresses.
> 
> The new QEMU process is started by exec'ing the command
> specified by the @cpr-exec-command parameter.  The first word of
> the command is the binary, and the remaining words are its
> arguments.  The command may be a direct invocation of new QEMU,
> or may be a non-QEMU command that exec's the new QEMU binary.
> 
> This mode creates a second migration channel that is not visible
> to the user.  At the start of migration, old QEMU saves CPR state
> to the second channel, and at the end of migration, it tells the
> main loop to call cpr_exec.  New QEMU loads CPR state early, before
> objects are created.
> 
> Because old QEMU terminates when new QEMU starts, one cannot
> stream data between the two, so uri-1 must be a type,
> such as a file, that accepts all data before old QEMU exits.
> Otherwise, old QEMU may quietly block writing to the channel.
> 
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported.  The VM must be started with
> the '-machine aux-ram-share=on' option, which allows anonymous
> memory to be transferred in place to the new process.  The memfds
> are kept open across exec by clearing the close-on-exec flag, their
> values are saved in CPR state, and they are mmap'd in new QEMU.

Some generic questions around exec..

How do we know we can already safely kill all threads?

IIUC vcpu threads must be all stopped.  I wonder if we want to assert that
in the exec helper below.

What about rest threads?  RCU threads should be for freeing resources,
looks ok if to be ignored.  But others?

Or would process states still matter in some cases? e.g. when QEMU is
talking to another vhost-user, or vfio-user, or virtio-fs, or ... whatever
other process, then suddenly the other process doesn't recognize this QEMU
anymore?

What about file locks or similiar shared locks that can be running in an
iothread?  Is it possible that old QEMU took some shared locks, suddenly
qemu exec(), then the lock is never released?

> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  qapi/migration.json       | 25 +++++++++++++++-
>  include/migration/cpr.h   |  1 +
>  migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
>  migration/cpr.c           | 26 ++++++++++++++++-
>  migration/migration.c     | 10 ++++++-
>  migration/ram.c           |  1 +
>  migration/vmstate-types.c |  8 +++++
>  migration/trace-events    |  1 +
>  8 files changed, 143 insertions(+), 3 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index ea410fd..cbc90e8 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -694,9 +694,32 @@
>  #     until you issue the `migrate-incoming` command.
>  #
>  #     (since 10.0)
> +#
> +# @cpr-exec: The migrate command stops the VM, saves state to the
> +#     migration channel, directly exec's a new version of QEMU on the
> +#     same host, replacing the original process while retaining its
> +#     PID, and loads state from the channel.  Guest RAM is preserved
> +#     in place.  Devices and their pinned pages are also preserved for
> +#     VFIO and IOMMUFD.
> +#
> +#     Old QEMU starts new QEMU by exec'ing the command specified by
> +#     the @cpr-exec-command parameter.  The command may be a direct
> +#     invocation of new QEMU, or may be a non-QEMU command that exec's
> +#     the new QEMU binary.
> +#
> +#     Because old QEMU terminates when new QEMU starts, one cannot
> +#     stream data between the two, so the channel must be a type,
> +#     such as a file, that accepts all data before old QEMU exits.
> +#     Otherwise, old QEMU may quietly block writing to the channel.

The CPR channel (in case of exec mode) is persisted via env var.  Why not
do that too for the main migration stream?

Does it has something to do with the size of the binary chunk to store all
device states (and some private mem)?  Or other concerns?

It just feels like it would look cleaner for cpr-exec to not need -incoming
XXX at all, e.g. if the series already used envvar anyway, we can use that
too so new QEMU would know it's cpr-exec incoming migration, without
-incoming parameter at all.

> +#
> +#     Memory-backend objects must have the share=on attribute, but
> +#     memory-backend-epc is not supported.  The VM must be started
> +#     with the '-machine aux-ram-share=on' option.
> +#
> +#     (since 10.2)
>  ##
>  { 'enum': 'MigMode',
> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>  
>  ##
>  # @ZeroPageDetection:
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index aaeec02..e99e48e 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -54,6 +54,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
>  QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>  QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>  
> +void cpr_exec_init(void);
>  QEMUFile *cpr_exec_output(Error **errp);
>  QEMUFile *cpr_exec_input(Error **errp);
>  void cpr_exec_persist_state(QEMUFile *f);
> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> index 2c32e9c..7d0429f 100644
> --- a/migration/cpr-exec.c
> +++ b/migration/cpr-exec.c
> @@ -6,15 +6,20 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/cutils.h"
> +#include "qemu/error-report.h"
>  #include "qemu/memfd.h"
>  #include "qapi/error.h"
>  #include "io/channel-file.h"
>  #include "io/channel-socket.h"
> +#include "block/block-global-state.h"
> +#include "qemu/main-loop.h"
>  #include "migration/cpr.h"
>  #include "migration/qemu-file.h"
> +#include "migration/migration.h"
>  #include "migration/misc.h"
>  #include "migration/vmstate.h"
>  #include "system/runstate.h"
> +#include "trace.h"
>  
>  #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>  
> @@ -92,3 +97,72 @@ QEMUFile *cpr_exec_input(Error **errp)
>      lseek(mfd, 0, SEEK_SET);
>      return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>  }
> +
> +static bool preserve_fd(int fd)
> +{
> +    qemu_clear_cloexec(fd);
> +    return true;
> +}
> +
> +static bool unpreserve_fd(int fd)
> +{
> +    qemu_set_cloexec(fd);
> +    return true;
> +}
> +
> +static void cpr_exec(char **argv)
> +{
> +    MigrationState *s = migrate_get_current();
> +    Error *err = NULL;
> +
> +    /*
> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> +     * earlier because they should not persist across miscellaneous fork and
> +     * exec calls that are performed during normal operation.
> +     */
> +    cpr_walk_fd(preserve_fd);
> +
> +    trace_cpr_exec();
> +    execvp(argv[0], argv);
> +
> +    cpr_walk_fd(unpreserve_fd);
> +
> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
> +    error_report_err(error_copy(err));

Feel free to ignore my question in the other patch, so we dump some errors
here.. which makes sense.

> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);

This is indeed FAILED migration, however it seems to imply it can catch
whatever possible failures that incoming could have.  Strictly speaking
this is not migration failure, but exec failure..  Maybe we need a comment
above this one explaining that we won't be able to capture any migration
issues, it's too late after exec() succeeded, so there's higher risk of
crashing the VM.

Luckily we still are on the same host, so things like mismatched kernel
versions at least won't crash this migration.. aka not as easy to fail a
migration as cross- hosts indeed. But still, I'd say I agree with Vladimir
that this is a major flaw of the design if so.

> +    migrate_set_error(s, err);
> +
> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
> +
> +    err = NULL;
> +    if (!migration_block_activate(&err)) {
> +        /* error was already reported */
> +        return;
> +    }
> +
> +    if (runstate_is_live(s->vm_old_state)) {
> +        vm_start();
> +    }
> +}
> +
> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> +                             Error **errp)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
> +        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_command);
> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> +        cpr_exec_unpersist_state();
> +    }
> +    return 0;
> +}
> +
> +void cpr_exec_init(void)
> +{
> +    static NotifierWithReturn exec_notifier;
> +
> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
> +                                MIG_MODE_CPR_EXEC);

Why using a notifier?  IMHO exec() is something important enough to not be
hiding in a notifier..  and CPR is already a major part of migration in the
framework, IMHO it'll be cleaner to invoke any CPR request in the migration
subsystem.  AFAIU notifiers are normally only for outside migration/ purposes.

> +}
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 021bd6a..2078d05 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -198,6 +198,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>      if (mode == MIG_MODE_CPR_TRANSFER) {
>          g_assert(channel);
>          f = cpr_transfer_output(channel, errp);
> +    } else if (mode == MIG_MODE_CPR_EXEC) {
> +        f = cpr_exec_output(errp);
>      } else {
>          return 0;
>      }
> @@ -215,6 +217,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>          return ret;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        cpr_exec_persist_state(f);
> +    }
> +
>      /*
>       * Close the socket only partially so we can later detect when the other
>       * end closes by getting a HUP event.
> @@ -226,6 +232,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>      return 0;
>  }
>  
> +static bool unpreserve_fd(int fd)
> +{
> +    qemu_set_cloexec(fd);
> +    return true;
> +}
> +
>  int cpr_state_load(MigrationChannel *channel, Error **errp)
>  {
>      int ret;
> @@ -237,6 +249,12 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>          mode = MIG_MODE_CPR_TRANSFER;
>          cpr_set_incoming_mode(mode);
>          f = cpr_transfer_input(channel, errp);
> +    } else if (cpr_exec_has_state()) {
> +        mode = MIG_MODE_CPR_EXEC;
> +        f = cpr_exec_input(errp);
> +        if (channel) {
> +            warn_report("ignoring cpr channel for migration mode cpr-exec");

This looks like dead code?  channel can't be set when reaching here, AFAIU..

> +        }
>      } else {
>          return 0;
>      }
> @@ -245,6 +263,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>      }
>  
>      trace_cpr_state_load(MigMode_str(mode));
> +    cpr_set_incoming_mode(mode);
>  
>      v = qemu_get_be32(f);
>      if (v != QEMU_CPR_FILE_MAGIC) {
> @@ -266,6 +285,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>          return ret;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
> +        cpr_walk_fd(unpreserve_fd);
> +    }
> +
>      /*
>       * Let the caller decide when to close the socket (and generate a HUP event
>       * for the sending side).
> @@ -286,7 +310,7 @@ void cpr_state_close(void)
>  bool cpr_incoming_needed(void *opaque)
>  {
>      MigMode mode = migrate_mode();
> -    return mode == MIG_MODE_CPR_TRANSFER;
> +    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
>  }
>  
>  /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 271c521..d604284 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -333,6 +333,7 @@ void migration_object_init(void)
>  
>      ram_mig_init();
>      dirty_bitmap_mig_init();
> +    cpr_exec_init();
>  
>      /* Initialize cpu throttle timers */
>      cpu_throttle_init();
> @@ -1797,7 +1798,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
>  {
>      MigMode mode = s->parameters.mode;
>      return mode == MIG_MODE_CPR_REBOOT ||
> -           mode == MIG_MODE_CPR_TRANSFER;
> +           mode == MIG_MODE_CPR_TRANSFER ||
> +           mode == MIG_MODE_CPR_EXEC;
>  }
>  
>  int migrate_init(MigrationState *s, Error **errp)
> @@ -2146,6 +2148,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>          return false;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
> +        !s->parameters.has_cpr_exec_command) {
> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
> +        return false;
> +    }
> +
>      if (migration_is_blocked(errp)) {
>          return false;
>      }
> diff --git a/migration/ram.c b/migration/ram.c
> index 7208bc1..6730a41 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
>      MigMode mode = migrate_mode();
>      return !qemu_ram_is_migratable(block) ||
>             mode == MIG_MODE_CPR_TRANSFER ||
> +           mode == MIG_MODE_CPR_EXEC ||
>             (migrate_ignore_shared() && qemu_ram_is_shared(block)
>                                      && qemu_ram_is_named_file(block));
>  }
> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> index 741a588..1aa0573 100644
> --- a/migration/vmstate-types.c
> +++ b/migration/vmstate-types.c
> @@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>                    const VMStateField *field)
>  {
>      int32_t *v = pv;
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        qemu_get_sbe32s(f, v);
> +        return 0;
> +    }
>      *v = qemu_file_get_fd(f);
>      return 0;
>  }
> @@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>                    const VMStateField *field, JSONWriter *vmdesc)
>  {
>      int32_t *v = pv;
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        qemu_put_sbe32s(f, v);
> +        return 0;
> +    }
>      return qemu_file_put_fd(f, *v);
>  }
>  
> diff --git a/migration/trace-events b/migration/trace-events
> index 706db97..e8edd1f 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
>  cpr_state_load(const char *mode) "%s mode"
>  cpr_transfer_input(const char *path) "%s"
>  cpr_transfer_output(const char *path) "%s"
> +cpr_exec(void) ""
>  
>  # block-dirty-bitmap.c
>  send_bitmap_header_enter(void) ""
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 7/9] migration: cpr-exec mode
  2025-09-09 16:32   ` Peter Xu
@ 2025-09-09 18:10     ` Steven Sistare
  2025-09-09 19:27       ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-09-09 18:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On 9/9/2025 12:32 PM, Peter Xu wrote:
> On Thu, Aug 14, 2025 at 10:17:21AM -0700, Steve Sistare wrote:
>> Add the cpr-exec migration mode.  Usage:
>>    qemu-system-$arch -machine aux-ram-share=on ...
>>    migrate_set_parameter mode cpr-exec
>>    migrate_set_parameter cpr-exec-command \
>>      <arg1> <arg2> ... -incoming <uri-1> \
>>    migrate -d <uri-1>
>>
>> The migrate command stops the VM, saves state to uri-1,
>> directly exec's a new version of QEMU on the same host,
>> replacing the original process while retaining its PID, and
>> loads state from uri-1.  Guest RAM is preserved in place,
>> albeit with new virtual addresses.
>>
>> The new QEMU process is started by exec'ing the command
>> specified by the @cpr-exec-command parameter.  The first word of
>> the command is the binary, and the remaining words are its
>> arguments.  The command may be a direct invocation of new QEMU,
>> or may be a non-QEMU command that exec's the new QEMU binary.
>>
>> This mode creates a second migration channel that is not visible
>> to the user.  At the start of migration, old QEMU saves CPR state
>> to the second channel, and at the end of migration, it tells the
>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>> objects are created.
>>
>> Because old QEMU terminates when new QEMU starts, one cannot
>> stream data between the two, so uri-1 must be a type,
>> such as a file, that accepts all data before old QEMU exits.
>> Otherwise, old QEMU may quietly block writing to the channel.
>>
>> Memory-backend objects must have the share=on attribute, but
>> memory-backend-epc is not supported.  The VM must be started with
>> the '-machine aux-ram-share=on' option, which allows anonymous
>> memory to be transferred in place to the new process.  The memfds
>> are kept open across exec by clearing the close-on-exec flag, their
>> values are saved in CPR state, and they are mmap'd in new QEMU.
> 
> Some generic questions around exec..
> 
> How do we know we can already safely kill all threads?
> 
> IIUC vcpu threads must be all stopped.  I wonder if we want to assert that
> in the exec helper below.
> 
> What about rest threads?  RCU threads should be for freeing resources,
> looks ok if to be ignored.  But others?

These threads are dormant, just as they are in the post migration state.
There is no difference.  They can be safely killed, just as they can be
post migration.

> Or would process states still matter in some cases? e.g. when QEMU is
> talking to another vhost-user, or vfio-user, or virtio-fs, or ... whatever
> other process, then suddenly the other process doesn't recognize this QEMU
> anymore?

These cases need more development to work with cpr.  The external process
can be used by new qemu if the socket connection (fd) is preserved in new QEMU.

> What about file locks or similiar shared locks that can be running in an
> iothread?  Is it possible that old QEMU took some shared locks, suddenly
> qemu exec(), then the lock is never released?

Same as the post-migrate state.
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   qapi/migration.json       | 25 +++++++++++++++-
>>   include/migration/cpr.h   |  1 +
>>   migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/cpr.c           | 26 ++++++++++++++++-
>>   migration/migration.c     | 10 ++++++-
>>   migration/ram.c           |  1 +
>>   migration/vmstate-types.c |  8 +++++
>>   migration/trace-events    |  1 +
>>   8 files changed, 143 insertions(+), 3 deletions(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index ea410fd..cbc90e8 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -694,9 +694,32 @@
>>   #     until you issue the `migrate-incoming` command.
>>   #
>>   #     (since 10.0)
>> +#
>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>> +#     migration channel, directly exec's a new version of QEMU on the
>> +#     same host, replacing the original process while retaining its
>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>> +#     in place.  Devices and their pinned pages are also preserved for
>> +#     VFIO and IOMMUFD.
>> +#
>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>> +#     the @cpr-exec-command parameter.  The command may be a direct
>> +#     invocation of new QEMU, or may be a non-QEMU command that exec's
>> +#     the new QEMU binary.
>> +#
>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>> +#     stream data between the two, so the channel must be a type,
>> +#     such as a file, that accepts all data before old QEMU exits.
>> +#     Otherwise, old QEMU may quietly block writing to the channel.
> 
> The CPR channel (in case of exec mode) is persisted via env var.  Why not
> do that too for the main migration stream?
> 
> Does it has something to do with the size of the binary chunk to store all
> device states (and some private mem)?  Or other concerns?

It was not necessary to add code for a new way to move migration data for
the main stream when the existing code and interface works just fine.  One
of the design principles pushed on me was to make cpr look as much like live
migration as possible, and cpr-exec does that.  It has no issues juggling
2 streams, and no delayed start of the monitor. cpr-transfer is actually the
oddball.
  > It just feels like it would look cleaner for cpr-exec to not need -incoming
> XXX at all, e.g. if the series already used envvar anyway, we can use that
> too so new QEMU would know it's cpr-exec incoming migration, without
> -incoming parameter at all.
 > >> +#
>> +#     Memory-backend objects must have the share=on attribute, but
>> +#     memory-backend-epc is not supported.  The VM must be started
>> +#     with the '-machine aux-ram-share=on' option.
>> +#
>> +#     (since 10.2)
>>   ##
>>   { 'enum': 'MigMode',
>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>   
>>   ##
>>   # @ZeroPageDetection:
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index aaeec02..e99e48e 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -54,6 +54,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>   
>> +void cpr_exec_init(void);
>>   QEMUFile *cpr_exec_output(Error **errp);
>>   QEMUFile *cpr_exec_input(Error **errp);
>>   void cpr_exec_persist_state(QEMUFile *f);
>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>> index 2c32e9c..7d0429f 100644
>> --- a/migration/cpr-exec.c
>> +++ b/migration/cpr-exec.c
>> @@ -6,15 +6,20 @@
>>   
>>   #include "qemu/osdep.h"
>>   #include "qemu/cutils.h"
>> +#include "qemu/error-report.h"
>>   #include "qemu/memfd.h"
>>   #include "qapi/error.h"
>>   #include "io/channel-file.h"
>>   #include "io/channel-socket.h"
>> +#include "block/block-global-state.h"
>> +#include "qemu/main-loop.h"
>>   #include "migration/cpr.h"
>>   #include "migration/qemu-file.h"
>> +#include "migration/migration.h"
>>   #include "migration/misc.h"
>>   #include "migration/vmstate.h"
>>   #include "system/runstate.h"
>> +#include "trace.h"
>>   
>>   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>   
>> @@ -92,3 +97,72 @@ QEMUFile *cpr_exec_input(Error **errp)
>>       lseek(mfd, 0, SEEK_SET);
>>       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>   }
>> +
>> +static bool preserve_fd(int fd)
>> +{
>> +    qemu_clear_cloexec(fd);
>> +    return true;
>> +}
>> +
>> +static bool unpreserve_fd(int fd)
>> +{
>> +    qemu_set_cloexec(fd);
>> +    return true;
>> +}
>> +
>> +static void cpr_exec(char **argv)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +    Error *err = NULL;
>> +
>> +    /*
>> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
>> +     * earlier because they should not persist across miscellaneous fork and
>> +     * exec calls that are performed during normal operation.
>> +     */
>> +    cpr_walk_fd(preserve_fd);
>> +
>> +    trace_cpr_exec();
>> +    execvp(argv[0], argv);
>> +
>> +    cpr_walk_fd(unpreserve_fd);
>> +
>> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>> +    error_report_err(error_copy(err));
> 
> Feel free to ignore my question in the other patch, so we dump some errors
> here.. which makes sense.
> 
>> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> 
> This is indeed FAILED migration, however it seems to imply it can catch
> whatever possible failures that incoming could have.  Strictly speaking
> this is not migration failure, but exec failure..  Maybe we need a comment
> above this one explaining that we won't be able to capture any migration
> issues, it's too late after exec() succeeded, so there's higher risk of
> crashing the VM.

exec() can fail if the user provided a bogus cpr-exec-command, in which case
recovery is possible.  exec() should never fail for valid exec arguments,
unless the system is very sick and running out of resources, in which case
all bets are off.

> Luckily we still are on the same host, so things like mismatched kernel
> versions at least won't crash this migration.. aka not as easy to fail a
> migration as cross- hosts indeed. But still, I'd say I agree with Vladimir
> that this is a major flaw of the design if so.
> 
>> +    migrate_set_error(s, err);
>> +
>> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>> +
>> +    err = NULL;
>> +    if (!migration_block_activate(&err)) {
>> +        /* error was already reported */
>> +        return;
>> +    }
>> +
>> +    if (runstate_is_live(s->vm_old_state)) {
>> +        vm_start();
>> +    }
>> +}
>> +
>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>> +                             Error **errp)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +
>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
>> +        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_command);
>> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>> +        cpr_exec_unpersist_state();
>> +    }
>> +    return 0;
>> +}
>> +
>> +void cpr_exec_init(void)
>> +{
>> +    static NotifierWithReturn exec_notifier;
>> +
>> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>> +                                MIG_MODE_CPR_EXEC);
> 
> Why using a notifier?  IMHO exec() is something important enough to not be
> hiding in a notifier..  and CPR is already a major part of migration in the
> framework, IMHO it'll be cleaner to invoke any CPR request in the migration
> subsystem.  AFAIU notifiers are normally only for outside migration/ purposes.

This minimizes the number of control flow conditionals in the core migration code.
That's a good thing, and I thought you would like it.

The alternative is to add code right after notifiers are called to check the
mode, and call cpr_exec_notifier.  Seems silly when we have this generic
mechanism to define callouts to occur at well-defined points during execution.

Note that cpr_exec_notifier does not directly call exec.  It posts the exec
request.  It also recovers if cpr failed.

>> +}
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 021bd6a..2078d05 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -198,6 +198,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>       if (mode == MIG_MODE_CPR_TRANSFER) {
>>           g_assert(channel);
>>           f = cpr_transfer_output(channel, errp);
>> +    } else if (mode == MIG_MODE_CPR_EXEC) {
>> +        f = cpr_exec_output(errp);
>>       } else {
>>           return 0;
>>       }
>> @@ -215,6 +217,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>           return ret;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        cpr_exec_persist_state(f);
>> +    }
>> +
>>       /*
>>        * Close the socket only partially so we can later detect when the other
>>        * end closes by getting a HUP event.
>> @@ -226,6 +232,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>       return 0;
>>   }
>>   
>> +static bool unpreserve_fd(int fd)
>> +{
>> +    qemu_set_cloexec(fd);
>> +    return true;
>> +}
>> +
>>   int cpr_state_load(MigrationChannel *channel, Error **errp)
>>   {
>>       int ret;
>> @@ -237,6 +249,12 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>           mode = MIG_MODE_CPR_TRANSFER;
>>           cpr_set_incoming_mode(mode);
>>           f = cpr_transfer_input(channel, errp);
>> +    } else if (cpr_exec_has_state()) {
>> +        mode = MIG_MODE_CPR_EXEC;
>> +        f = cpr_exec_input(errp);
>> +        if (channel) {
>> +            warn_report("ignoring cpr channel for migration mode cpr-exec");
> 
> This looks like dead code?  channel can't be set when reaching here, AFAIU..

The user could define a cpr channel in qemu command line arguments, and it would
reach here.  In that case the user is confused, but I warn instead of abort, to
keep new QEMU alive.  I perform this sanity check here, rather than at top level,
because I have localized awareness of cpr_exec state to here.

- Steve
>> +        }
>>       } else {
>>           return 0;
>>       }
>> @@ -245,6 +263,7 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>       }
>>   
>>       trace_cpr_state_load(MigMode_str(mode));
>> +    cpr_set_incoming_mode(mode);
>>   
>>       v = qemu_get_be32(f);
>>       if (v != QEMU_CPR_FILE_MAGIC) {
>> @@ -266,6 +285,11 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>           return ret;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
>> +        cpr_walk_fd(unpreserve_fd);
>> +    }
>> +
>>       /*
>>        * Let the caller decide when to close the socket (and generate a HUP event
>>        * for the sending side).
>> @@ -286,7 +310,7 @@ void cpr_state_close(void)
>>   bool cpr_incoming_needed(void *opaque)
>>   {
>>       MigMode mode = migrate_mode();
>> -    return mode == MIG_MODE_CPR_TRANSFER;
>> +    return mode == MIG_MODE_CPR_TRANSFER || mode == MIG_MODE_CPR_EXEC;
>>   }
>>   
>>   /*
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 271c521..d604284 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -333,6 +333,7 @@ void migration_object_init(void)
>>   
>>       ram_mig_init();
>>       dirty_bitmap_mig_init();
>> +    cpr_exec_init();
>>   
>>       /* Initialize cpu throttle timers */
>>       cpu_throttle_init();
>> @@ -1797,7 +1798,8 @@ bool migrate_mode_is_cpr(MigrationState *s)
>>   {
>>       MigMode mode = s->parameters.mode;
>>       return mode == MIG_MODE_CPR_REBOOT ||
>> -           mode == MIG_MODE_CPR_TRANSFER;
>> +           mode == MIG_MODE_CPR_TRANSFER ||
>> +           mode == MIG_MODE_CPR_EXEC;
>>   }
>>   
>>   int migrate_init(MigrationState *s, Error **errp)
>> @@ -2146,6 +2148,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>           return false;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
>> +        !s->parameters.has_cpr_exec_command) {
>> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
>> +        return false;
>> +    }
>> +
>>       if (migration_is_blocked(errp)) {
>>           return false;
>>       }
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7208bc1..6730a41 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -228,6 +228,7 @@ bool migrate_ram_is_ignored(RAMBlock *block)
>>       MigMode mode = migrate_mode();
>>       return !qemu_ram_is_migratable(block) ||
>>              mode == MIG_MODE_CPR_TRANSFER ||
>> +           mode == MIG_MODE_CPR_EXEC ||
>>              (migrate_ignore_shared() && qemu_ram_is_shared(block)
>>                                       && qemu_ram_is_named_file(block));
>>   }
>> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
>> index 741a588..1aa0573 100644
>> --- a/migration/vmstate-types.c
>> +++ b/migration/vmstate-types.c
>> @@ -321,6 +321,10 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>>                     const VMStateField *field)
>>   {
>>       int32_t *v = pv;
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        qemu_get_sbe32s(f, v);
>> +        return 0;
>> +    }
>>       *v = qemu_file_get_fd(f);
>>       return 0;
>>   }
>> @@ -329,6 +333,10 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>>                     const VMStateField *field, JSONWriter *vmdesc)
>>   {
>>       int32_t *v = pv;
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        qemu_put_sbe32s(f, v);
>> +        return 0;
>> +    }
>>       return qemu_file_put_fd(f, *v);
>>   }
>>   
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 706db97..e8edd1f 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -354,6 +354,7 @@ cpr_state_save(const char *mode) "%s mode"
>>   cpr_state_load(const char *mode) "%s mode"
>>   cpr_transfer_input(const char *path) "%s"
>>   cpr_transfer_output(const char *path) "%s"
>> +cpr_exec(void) ""
>>   
>>   # block-dirty-bitmap.c
>>   send_bitmap_header_enter(void) ""
>> -- 
>> 1.8.3.1




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 7/9] migration: cpr-exec mode
  2025-09-09 18:10     ` Steven Sistare
@ 2025-09-09 19:27       ` Peter Xu
  2025-09-12 14:49         ` Steven Sistare
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-09-09 19:27 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On Tue, Sep 09, 2025 at 02:10:14PM -0400, Steven Sistare wrote:
> On 9/9/2025 12:32 PM, Peter Xu wrote:
> > On Thu, Aug 14, 2025 at 10:17:21AM -0700, Steve Sistare wrote:
> > > Add the cpr-exec migration mode.  Usage:
> > >    qemu-system-$arch -machine aux-ram-share=on ...
> > >    migrate_set_parameter mode cpr-exec
> > >    migrate_set_parameter cpr-exec-command \
> > >      <arg1> <arg2> ... -incoming <uri-1> \
> > >    migrate -d <uri-1>
> > > 
> > > The migrate command stops the VM, saves state to uri-1,
> > > directly exec's a new version of QEMU on the same host,
> > > replacing the original process while retaining its PID, and
> > > loads state from uri-1.  Guest RAM is preserved in place,
> > > albeit with new virtual addresses.
> > > 
> > > The new QEMU process is started by exec'ing the command
> > > specified by the @cpr-exec-command parameter.  The first word of
> > > the command is the binary, and the remaining words are its
> > > arguments.  The command may be a direct invocation of new QEMU,
> > > or may be a non-QEMU command that exec's the new QEMU binary.
> > > 
> > > This mode creates a second migration channel that is not visible
> > > to the user.  At the start of migration, old QEMU saves CPR state
> > > to the second channel, and at the end of migration, it tells the
> > > main loop to call cpr_exec.  New QEMU loads CPR state early, before
> > > objects are created.
> > > 
> > > Because old QEMU terminates when new QEMU starts, one cannot
> > > stream data between the two, so uri-1 must be a type,
> > > such as a file, that accepts all data before old QEMU exits.
> > > Otherwise, old QEMU may quietly block writing to the channel.
> > > 
> > > Memory-backend objects must have the share=on attribute, but
> > > memory-backend-epc is not supported.  The VM must be started with
> > > the '-machine aux-ram-share=on' option, which allows anonymous
> > > memory to be transferred in place to the new process.  The memfds
> > > are kept open across exec by clearing the close-on-exec flag, their
> > > values are saved in CPR state, and they are mmap'd in new QEMU.
> > 
> > Some generic questions around exec..
> > 
> > How do we know we can already safely kill all threads?
> > 
> > IIUC vcpu threads must be all stopped.  I wonder if we want to assert that
> > in the exec helper below.
> > 
> > What about rest threads?  RCU threads should be for freeing resources,
> > looks ok if to be ignored.  But others?
> 
> These threads are dormant, just as they are in the post migration state.
> There is no difference.  They can be safely killed, just as they can be
> post migration.
> 
> > Or would process states still matter in some cases? e.g. when QEMU is
> > talking to another vhost-user, or vfio-user, or virtio-fs, or ... whatever
> > other process, then suddenly the other process doesn't recognize this QEMU
> > anymore?
> 
> These cases need more development to work with cpr.  The external process
> can be used by new qemu if the socket connection (fd) is preserved in new QEMU.
> 
> > What about file locks or similiar shared locks that can be running in an
> > iothread?  Is it possible that old QEMU took some shared locks, suddenly
> > qemu exec(), then the lock is never released?
> 
> Same as the post-migrate state.

IIUC the difference is "migrate" for cpr-transfer triggers migration only;
another "quit" required to gracefully stop the src QEMU instance from mgmt.
But for cpr-exec, it's attached to migration cleanup -> exec in a roll.

I'm not sure if things can be missing within the period.  For example,
libvirt may have logic making sure "quit" runs only after dest QEMU evicts
some event.  But I confess I don't have an explicit example of what would
cause issues, so it's a pure question.

> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   qapi/migration.json       | 25 +++++++++++++++-
> > >   include/migration/cpr.h   |  1 +
> > >   migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
> > >   migration/cpr.c           | 26 ++++++++++++++++-
> > >   migration/migration.c     | 10 ++++++-
> > >   migration/ram.c           |  1 +
> > >   migration/vmstate-types.c |  8 +++++
> > >   migration/trace-events    |  1 +
> > >   8 files changed, 143 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index ea410fd..cbc90e8 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -694,9 +694,32 @@
> > >   #     until you issue the `migrate-incoming` command.
> > >   #
> > >   #     (since 10.0)
> > > +#
> > > +# @cpr-exec: The migrate command stops the VM, saves state to the
> > > +#     migration channel, directly exec's a new version of QEMU on the
> > > +#     same host, replacing the original process while retaining its
> > > +#     PID, and loads state from the channel.  Guest RAM is preserved
> > > +#     in place.  Devices and their pinned pages are also preserved for
> > > +#     VFIO and IOMMUFD.
> > > +#
> > > +#     Old QEMU starts new QEMU by exec'ing the command specified by
> > > +#     the @cpr-exec-command parameter.  The command may be a direct
> > > +#     invocation of new QEMU, or may be a non-QEMU command that exec's
> > > +#     the new QEMU binary.
> > > +#
> > > +#     Because old QEMU terminates when new QEMU starts, one cannot
> > > +#     stream data between the two, so the channel must be a type,
> > > +#     such as a file, that accepts all data before old QEMU exits.
> > > +#     Otherwise, old QEMU may quietly block writing to the channel.
> > 
> > The CPR channel (in case of exec mode) is persisted via env var.  Why not
> > do that too for the main migration stream?
> > 
> > Does it has something to do with the size of the binary chunk to store all
> > device states (and some private mem)?  Or other concerns?
> 
> It was not necessary to add code for a new way to move migration data for
> the main stream when the existing code and interface works just fine.  One
> of the design principles pushed on me was to make cpr look as much like live
> migration as possible, and cpr-exec does that.  It has no issues juggling
> 2 streams, and no delayed start of the monitor. cpr-transfer is actually the
> oddball.
>  > It just feels like it would look cleaner for cpr-exec to not need -incoming
> > XXX at all, e.g. if the series already used envvar anyway, we can use that
> > too so new QEMU would know it's cpr-exec incoming migration, without
> > -incoming parameter at all.
> > >> +#
> > > +#     Memory-backend objects must have the share=on attribute, but
> > > +#     memory-backend-epc is not supported.  The VM must be started
> > > +#     with the '-machine aux-ram-share=on' option.
> > > +#
> > > +#     (since 10.2)
> > >   ##
> > >   { 'enum': 'MigMode',
> > > -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> > > +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
> > >   ##
> > >   # @ZeroPageDetection:
> > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > > index aaeec02..e99e48e 100644
> > > --- a/include/migration/cpr.h
> > > +++ b/include/migration/cpr.h
> > > @@ -54,6 +54,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
> > >   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> > >   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
> > > +void cpr_exec_init(void);
> > >   QEMUFile *cpr_exec_output(Error **errp);
> > >   QEMUFile *cpr_exec_input(Error **errp);
> > >   void cpr_exec_persist_state(QEMUFile *f);
> > > diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> > > index 2c32e9c..7d0429f 100644
> > > --- a/migration/cpr-exec.c
> > > +++ b/migration/cpr-exec.c
> > > @@ -6,15 +6,20 @@
> > >   #include "qemu/osdep.h"
> > >   #include "qemu/cutils.h"
> > > +#include "qemu/error-report.h"
> > >   #include "qemu/memfd.h"
> > >   #include "qapi/error.h"
> > >   #include "io/channel-file.h"
> > >   #include "io/channel-socket.h"
> > > +#include "block/block-global-state.h"
> > > +#include "qemu/main-loop.h"
> > >   #include "migration/cpr.h"
> > >   #include "migration/qemu-file.h"
> > > +#include "migration/migration.h"
> > >   #include "migration/misc.h"
> > >   #include "migration/vmstate.h"
> > >   #include "system/runstate.h"
> > > +#include "trace.h"
> > >   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
> > > @@ -92,3 +97,72 @@ QEMUFile *cpr_exec_input(Error **errp)
> > >       lseek(mfd, 0, SEEK_SET);
> > >       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
> > >   }
> > > +
> > > +static bool preserve_fd(int fd)
> > > +{
> > > +    qemu_clear_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > > +static bool unpreserve_fd(int fd)
> > > +{
> > > +    qemu_set_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > > +static void cpr_exec(char **argv)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +    Error *err = NULL;
> > > +
> > > +    /*
> > > +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> > > +     * earlier because they should not persist across miscellaneous fork and
> > > +     * exec calls that are performed during normal operation.
> > > +     */
> > > +    cpr_walk_fd(preserve_fd);
> > > +
> > > +    trace_cpr_exec();
> > > +    execvp(argv[0], argv);
> > > +
> > > +    cpr_walk_fd(unpreserve_fd);
> > > +
> > > +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
> > > +    error_report_err(error_copy(err));
> > 
> > Feel free to ignore my question in the other patch, so we dump some errors
> > here.. which makes sense.
> > 
> > > +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> > 
> > This is indeed FAILED migration, however it seems to imply it can catch
> > whatever possible failures that incoming could have.  Strictly speaking
> > this is not migration failure, but exec failure..  Maybe we need a comment
> > above this one explaining that we won't be able to capture any migration
> > issues, it's too late after exec() succeeded, so there's higher risk of
> > crashing the VM.
> 
> exec() can fail if the user provided a bogus cpr-exec-command, in which case
> recovery is possible.  exec() should never fail for valid exec arguments,
> unless the system is very sick and running out of resources, in which case
> all bets are off.

I really don't expect that to fail... bogus cpr-exec-command is more or
less a programming bug.  After all, I don't expect normal QEMU users would
use cpr-exec without a proper mgmt providing cpr-exec-command.

Adding some comment here on what the FAILED can capture (and what cannot)?

> 
> > Luckily we still are on the same host, so things like mismatched kernel
> > versions at least won't crash this migration.. aka not as easy to fail a
> > migration as cross- hosts indeed. But still, I'd say I agree with Vladimir
> > that this is a major flaw of the design if so.
> > 
> > > +    migrate_set_error(s, err);
> > > +
> > > +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
> > > +
> > > +    err = NULL;
> > > +    if (!migration_block_activate(&err)) {
> > > +        /* error was already reported */
> > > +        return;
> > > +    }
> > > +
> > > +    if (runstate_is_live(s->vm_old_state)) {
> > > +        vm_start();
> > > +    }
> > > +}
> > > +
> > > +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> > > +                             Error **errp)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +
> > > +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> > > +        assert(s->state == MIGRATION_STATUS_COMPLETED);
> > > +        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_command);
> > > +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> > > +        cpr_exec_unpersist_state();
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > > +void cpr_exec_init(void)
> > > +{
> > > +    static NotifierWithReturn exec_notifier;
> > > +
> > > +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
> > > +                                MIG_MODE_CPR_EXEC);
> > 
> > Why using a notifier?  IMHO exec() is something important enough to not be
> > hiding in a notifier..  and CPR is already a major part of migration in the
> > framework, IMHO it'll be cleaner to invoke any CPR request in the migration
> > subsystem.  AFAIU notifiers are normally only for outside migration/ purposes.
> 
> This minimizes the number of control flow conditionals in the core migration code.
> That's a good thing, and I thought you would like it.
> 
> The alternative is to add code right after notifiers are called to check the
> mode, and call cpr_exec_notifier.  Seems silly when we have this generic
> mechanism to define callouts to occur at well-defined points during execution.
> 
> Note that cpr_exec_notifier does not directly call exec.  It posts the exec
> request.  It also recovers if cpr failed.

OK, I don't think I feel strongly on this one.

Initially I was concerned at least on some of the notifiers not invoked,
which looks to be completely random.  But I kind of agree you chose the
spot late enough so whatever should really have been done before an exec(),
should hopefully be processed already, maybe while we do or around
vm_stop() phase.

Feel free to keep it then if nobody else asks.

> 
> > > +}
> > > diff --git a/migration/cpr.c b/migration/cpr.c
> > > index 021bd6a..2078d05 100644
> > > --- a/migration/cpr.c
> > > +++ b/migration/cpr.c
> > > @@ -198,6 +198,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >       if (mode == MIG_MODE_CPR_TRANSFER) {
> > >           g_assert(channel);
> > >           f = cpr_transfer_output(channel, errp);
> > > +    } else if (mode == MIG_MODE_CPR_EXEC) {
> > > +        f = cpr_exec_output(errp);
> > >       } else {
> > >           return 0;
> > >       }
> > > @@ -215,6 +217,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >           return ret;
> > >       }
> > > +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> > > +        cpr_exec_persist_state(f);
> > > +    }
> > > +
> > >       /*
> > >        * Close the socket only partially so we can later detect when the other
> > >        * end closes by getting a HUP event.
> > > @@ -226,6 +232,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >       return 0;
> > >   }
> > > +static bool unpreserve_fd(int fd)
> > > +{
> > > +    qemu_set_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > >   int cpr_state_load(MigrationChannel *channel, Error **errp)
> > >   {
> > >       int ret;
> > > @@ -237,6 +249,12 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
> > >           mode = MIG_MODE_CPR_TRANSFER;
> > >           cpr_set_incoming_mode(mode);
> > >           f = cpr_transfer_input(channel, errp);
> > > +    } else if (cpr_exec_has_state()) {
> > > +        mode = MIG_MODE_CPR_EXEC;
> > > +        f = cpr_exec_input(errp);
> > > +        if (channel) {
> > > +            warn_report("ignoring cpr channel for migration mode cpr-exec");
> > 
> > This looks like dead code?  channel can't be set when reaching here, AFAIU..
> 
> The user could define a cpr channel in qemu command line arguments, and it would
> reach here.  In that case the user is confused, but I warn instead of abort, to
> keep new QEMU alive.  I perform this sanity check here, rather than at top level,
> because I have localized awareness of cpr_exec state to here.

The code (after this patch applied) looks like this:

    if (channel) {                                            <------- [*]
        mode = MIG_MODE_CPR_TRANSFER;
        cpr_set_incoming_mode(mode);
        f = cpr_transfer_input(channel, errp);
    } else if (cpr_exec_has_state()) {
        mode = MIG_MODE_CPR_EXEC;
        f = cpr_exec_input(errp);
        if (channel) {
            warn_report("ignoring cpr channel for migration mode cpr-exec");
        }
    } else {
        return 0;
    }

IIUC [*] will capture any channel!=NULL case.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 7/9] migration: cpr-exec mode
  2025-09-09 19:27       ` Peter Xu
@ 2025-09-12 14:49         ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-12 14:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On 9/9/2025 3:27 PM, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 02:10:14PM -0400, Steven Sistare wrote:
>> On 9/9/2025 12:32 PM, Peter Xu wrote:
>>> On Thu, Aug 14, 2025 at 10:17:21AM -0700, Steve Sistare wrote:
>>>> Add the cpr-exec migration mode.  Usage:
>>>>     qemu-system-$arch -machine aux-ram-share=on ...
>>>>     migrate_set_parameter mode cpr-exec
>>>>     migrate_set_parameter cpr-exec-command \
>>>>       <arg1> <arg2> ... -incoming <uri-1> \
>>>>     migrate -d <uri-1>
>>>>
>>>> The migrate command stops the VM, saves state to uri-1,
>>>> directly exec's a new version of QEMU on the same host,
>>>> replacing the original process while retaining its PID, and
>>>> loads state from uri-1.  Guest RAM is preserved in place,
>>>> albeit with new virtual addresses.
>>>>
>>>> The new QEMU process is started by exec'ing the command
>>>> specified by the @cpr-exec-command parameter.  The first word of
>>>> the command is the binary, and the remaining words are its
>>>> arguments.  The command may be a direct invocation of new QEMU,
>>>> or may be a non-QEMU command that exec's the new QEMU binary.
>>>>
>>>> This mode creates a second migration channel that is not visible
>>>> to the user.  At the start of migration, old QEMU saves CPR state
>>>> to the second channel, and at the end of migration, it tells the
>>>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>>>> objects are created.
>>>>
>>>> Because old QEMU terminates when new QEMU starts, one cannot
>>>> stream data between the two, so uri-1 must be a type,
>>>> such as a file, that accepts all data before old QEMU exits.
>>>> Otherwise, old QEMU may quietly block writing to the channel.
>>>>
>>>> Memory-backend objects must have the share=on attribute, but
>>>> memory-backend-epc is not supported.  The VM must be started with
>>>> the '-machine aux-ram-share=on' option, which allows anonymous
>>>> memory to be transferred in place to the new process.  The memfds
>>>> are kept open across exec by clearing the close-on-exec flag, their
>>>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>>
>>> Some generic questions around exec..
>>>
>>> How do we know we can already safely kill all threads?
>>>
>>> IIUC vcpu threads must be all stopped.  I wonder if we want to assert that
>>> in the exec helper below.
>>>
>>> What about rest threads?  RCU threads should be for freeing resources,
>>> looks ok if to be ignored.  But others?
>>
>> These threads are dormant, just as they are in the post migration state.
>> There is no difference.  They can be safely killed, just as they can be
>> post migration.
>>
>>> Or would process states still matter in some cases? e.g. when QEMU is
>>> talking to another vhost-user, or vfio-user, or virtio-fs, or ... whatever
>>> other process, then suddenly the other process doesn't recognize this QEMU
>>> anymore?
>>
>> These cases need more development to work with cpr.  The external process
>> can be used by new qemu if the socket connection (fd) is preserved in new QEMU.
>>
>>> What about file locks or similiar shared locks that can be running in an
>>> iothread?  Is it possible that old QEMU took some shared locks, suddenly
>>> qemu exec(), then the lock is never released?
>>
>> Same as the post-migrate state.
> 
> IIUC the difference is "migrate" for cpr-transfer triggers migration only;
> another "quit" required to gracefully stop the src QEMU instance from mgmt.
> But for cpr-exec, it's attached to migration cleanup -> exec in a roll.
> 
> I'm not sure if things can be missing within the period.  For example,
> libvirt may have logic making sure "quit" runs only after dest QEMU evicts
> some event.  But I confess I don't have an explicit example of what would
> cause issues, so it's a pure question.
> 
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    qapi/migration.json       | 25 +++++++++++++++-
>>>>    include/migration/cpr.h   |  1 +
>>>>    migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>    migration/cpr.c           | 26 ++++++++++++++++-
>>>>    migration/migration.c     | 10 ++++++-
>>>>    migration/ram.c           |  1 +
>>>>    migration/vmstate-types.c |  8 +++++
>>>>    migration/trace-events    |  1 +
>>>>    8 files changed, 143 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index ea410fd..cbc90e8 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -694,9 +694,32 @@
>>>>    #     until you issue the `migrate-incoming` command.
>>>>    #
>>>>    #     (since 10.0)
>>>> +#
>>>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>>>> +#     migration channel, directly exec's a new version of QEMU on the
>>>> +#     same host, replacing the original process while retaining its
>>>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>>>> +#     in place.  Devices and their pinned pages are also preserved for
>>>> +#     VFIO and IOMMUFD.
>>>> +#
>>>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>>>> +#     the @cpr-exec-command parameter.  The command may be a direct
>>>> +#     invocation of new QEMU, or may be a non-QEMU command that exec's
>>>> +#     the new QEMU binary.
>>>> +#
>>>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>>>> +#     stream data between the two, so the channel must be a type,
>>>> +#     such as a file, that accepts all data before old QEMU exits.
>>>> +#     Otherwise, old QEMU may quietly block writing to the channel.
>>>
>>> The CPR channel (in case of exec mode) is persisted via env var.  Why not
>>> do that too for the main migration stream?
>>>
>>> Does it has something to do with the size of the binary chunk to store all
>>> device states (and some private mem)?  Or other concerns?
>>
>> It was not necessary to add code for a new way to move migration data for
>> the main stream when the existing code and interface works just fine.  One
>> of the design principles pushed on me was to make cpr look as much like live
>> migration as possible, and cpr-exec does that.  It has no issues juggling
>> 2 streams, and no delayed start of the monitor. cpr-transfer is actually the
>> oddball.
>>   > It just feels like it would look cleaner for cpr-exec to not need -incoming
>>> XXX at all, e.g. if the series already used envvar anyway, we can use that
>>> too so new QEMU would know it's cpr-exec incoming migration, without
>>> -incoming parameter at all.
>>>>> +#
>>>> +#     Memory-backend objects must have the share=on attribute, but
>>>> +#     memory-backend-epc is not supported.  The VM must be started
>>>> +#     with the '-machine aux-ram-share=on' option.
>>>> +#
>>>> +#     (since 10.2)
>>>>    ##
>>>>    { 'enum': 'MigMode',
>>>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>>>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>>>    ##
>>>>    # @ZeroPageDetection:
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index aaeec02..e99e48e 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -54,6 +54,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
>>>>    QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>>    QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>>> +void cpr_exec_init(void);
>>>>    QEMUFile *cpr_exec_output(Error **errp);
>>>>    QEMUFile *cpr_exec_input(Error **errp);
>>>>    void cpr_exec_persist_state(QEMUFile *f);
>>>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>>>> index 2c32e9c..7d0429f 100644
>>>> --- a/migration/cpr-exec.c
>>>> +++ b/migration/cpr-exec.c
>>>> @@ -6,15 +6,20 @@
>>>>    #include "qemu/osdep.h"
>>>>    #include "qemu/cutils.h"
>>>> +#include "qemu/error-report.h"
>>>>    #include "qemu/memfd.h"
>>>>    #include "qapi/error.h"
>>>>    #include "io/channel-file.h"
>>>>    #include "io/channel-socket.h"
>>>> +#include "block/block-global-state.h"
>>>> +#include "qemu/main-loop.h"
>>>>    #include "migration/cpr.h"
>>>>    #include "migration/qemu-file.h"
>>>> +#include "migration/migration.h"
>>>>    #include "migration/misc.h"
>>>>    #include "migration/vmstate.h"
>>>>    #include "system/runstate.h"
>>>> +#include "trace.h"
>>>>    #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>>> @@ -92,3 +97,72 @@ QEMUFile *cpr_exec_input(Error **errp)
>>>>        lseek(mfd, 0, SEEK_SET);
>>>>        return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>>>    }
>>>> +
>>>> +static bool preserve_fd(int fd)
>>>> +{
>>>> +    qemu_clear_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> +    qemu_set_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void cpr_exec(char **argv)
>>>> +{
>>>> +    MigrationState *s = migrate_get_current();
>>>> +    Error *err = NULL;
>>>> +
>>>> +    /*
>>>> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
>>>> +     * earlier because they should not persist across miscellaneous fork and
>>>> +     * exec calls that are performed during normal operation.
>>>> +     */
>>>> +    cpr_walk_fd(preserve_fd);
>>>> +
>>>> +    trace_cpr_exec();
>>>> +    execvp(argv[0], argv);
>>>> +
>>>> +    cpr_walk_fd(unpreserve_fd);
>>>> +
>>>> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>>>> +    error_report_err(error_copy(err));
>>>
>>> Feel free to ignore my question in the other patch, so we dump some errors
>>> here.. which makes sense.
>>>
>>>> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>>>
>>> This is indeed FAILED migration, however it seems to imply it can catch
>>> whatever possible failures that incoming could have.  Strictly speaking
>>> this is not migration failure, but exec failure..  Maybe we need a comment
>>> above this one explaining that we won't be able to capture any migration
>>> issues, it's too late after exec() succeeded, so there's higher risk of
>>> crashing the VM.
>>
>> exec() can fail if the user provided a bogus cpr-exec-command, in which case
>> recovery is possible.  exec() should never fail for valid exec arguments,
>> unless the system is very sick and running out of resources, in which case
>> all bets are off.
> 
> I really don't expect that to fail... bogus cpr-exec-command is more or
> less a programming bug.  After all, I don't expect normal QEMU users would
> use cpr-exec without a proper mgmt providing cpr-exec-command.
> 
> Adding some comment here on what the FAILED can capture (and what cannot)?

Will do.

>>> Luckily we still are on the same host, so things like mismatched kernel
>>> versions at least won't crash this migration.. aka not as easy to fail a
>>> migration as cross- hosts indeed. But still, I'd say I agree with Vladimir
>>> that this is a major flaw of the design if so.
>>>
>>>> +    migrate_set_error(s, err);
>>>> +
>>>> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>>>> +
>>>> +    err = NULL;
>>>> +    if (!migration_block_activate(&err)) {
>>>> +        /* error was already reported */
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (runstate_is_live(s->vm_old_state)) {
>>>> +        vm_start();
>>>> +    }
>>>> +}
>>>> +
>>>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>>> +                             Error **errp)
>>>> +{
>>>> +    MigrationState *s = migrate_get_current();
>>>> +
>>>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>>> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
>>>> +        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_command);
>>>> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>>>> +        cpr_exec_unpersist_state();
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void cpr_exec_init(void)
>>>> +{
>>>> +    static NotifierWithReturn exec_notifier;
>>>> +
>>>> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>>>> +                                MIG_MODE_CPR_EXEC);
>>>
>>> Why using a notifier?  IMHO exec() is something important enough to not be
>>> hiding in a notifier..  and CPR is already a major part of migration in the
>>> framework, IMHO it'll be cleaner to invoke any CPR request in the migration
>>> subsystem.  AFAIU notifiers are normally only for outside migration/ purposes.
>>
>> This minimizes the number of control flow conditionals in the core migration code.
>> That's a good thing, and I thought you would like it.
>>
>> The alternative is to add code right after notifiers are called to check the
>> mode, and call cpr_exec_notifier.  Seems silly when we have this generic
>> mechanism to define callouts to occur at well-defined points during execution.
>>
>> Note that cpr_exec_notifier does not directly call exec.  It posts the exec
>> request.  It also recovers if cpr failed.
> 
> OK, I don't think I feel strongly on this one.
> 
> Initially I was concerned at least on some of the notifiers not invoked,
> which looks to be completely random.  But I kind of agree you chose the
> spot late enough so whatever should really have been done before an exec(),
> should hopefully be processed already, maybe while we do or around
> vm_stop() phase.
> 
> Feel free to keep it then if nobody else asks.
> 
>>
>>>> +}
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index 021bd6a..2078d05 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -198,6 +198,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>        if (mode == MIG_MODE_CPR_TRANSFER) {
>>>>            g_assert(channel);
>>>>            f = cpr_transfer_output(channel, errp);
>>>> +    } else if (mode == MIG_MODE_CPR_EXEC) {
>>>> +        f = cpr_exec_output(errp);
>>>>        } else {
>>>>            return 0;
>>>>        }
>>>> @@ -215,6 +217,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>            return ret;
>>>>        }
>>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>>> +        cpr_exec_persist_state(f);
>>>> +    }
>>>> +
>>>>        /*
>>>>         * Close the socket only partially so we can later detect when the other
>>>>         * end closes by getting a HUP event.
>>>> @@ -226,6 +232,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>        return 0;
>>>>    }
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> +    qemu_set_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>> +
>>>>    int cpr_state_load(MigrationChannel *channel, Error **errp)
>>>>    {
>>>>        int ret;
>>>> @@ -237,6 +249,12 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
>>>>            mode = MIG_MODE_CPR_TRANSFER;
>>>>            cpr_set_incoming_mode(mode);
>>>>            f = cpr_transfer_input(channel, errp);
>>>> +    } else if (cpr_exec_has_state()) {
>>>> +        mode = MIG_MODE_CPR_EXEC;
>>>> +        f = cpr_exec_input(errp);
>>>> +        if (channel) {
>>>> +            warn_report("ignoring cpr channel for migration mode cpr-exec");
>>>
>>> This looks like dead code?  channel can't be set when reaching here, AFAIU..
>>
>> The user could define a cpr channel in qemu command line arguments, and it would
>> reach here.  In that case the user is confused, but I warn instead of abort, to
>> keep new QEMU alive.  I perform this sanity check here, rather than at top level,
>> because I have localized awareness of cpr_exec state to here.
> 
> The code (after this patch applied) looks like this:
> 
>      if (channel) {                                            <------- [*]
>          mode = MIG_MODE_CPR_TRANSFER;
>          cpr_set_incoming_mode(mode);
>          f = cpr_transfer_input(channel, errp);
>      } else if (cpr_exec_has_state()) {
>          mode = MIG_MODE_CPR_EXEC;
>          f = cpr_exec_input(errp);
>          if (channel) {
>              warn_report("ignoring cpr channel for migration mode cpr-exec");
>          }
>      } else {
>          return 0;
>      }
> 
> IIUC [*] will capture any channel!=NULL case.

Oops, my bad, thanks.  I will re-arrange it:

     if (cpr_exec_has_state()) {
         mode = MIG_MODE_CPR_EXEC;
         f = cpr_exec_input(errp);
         if (channel) {
             warn_report("ignoring cpr channel for migration mode cpr-exec");
         }
     } else if (channel) {
         mode = MIG_MODE_CPR_TRANSFER;
     ...

- Steve




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 7/9] migration: cpr-exec mode
  2025-08-14 17:17 ` [PATCH V3 7/9] migration: cpr-exec mode Steve Sistare
  2025-09-09 16:32   ` Peter Xu
@ 2025-09-11 15:09   ` Markus Armbruster
  2025-09-12 14:49     ` Steven Sistare
  1 sibling, 1 reply; 47+ messages in thread
From: Markus Armbruster @ 2025-09-11 15:09 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Peter Xu, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert

Steve Sistare <steven.sistare@oracle.com> writes:

> Add the cpr-exec migration mode.  Usage:
>   qemu-system-$arch -machine aux-ram-share=on ...
>   migrate_set_parameter mode cpr-exec
>   migrate_set_parameter cpr-exec-command \
>     <arg1> <arg2> ... -incoming <uri-1> \
>   migrate -d <uri-1>
>
> The migrate command stops the VM, saves state to uri-1,
> directly exec's a new version of QEMU on the same host,
> replacing the original process while retaining its PID, and
> loads state from uri-1.  Guest RAM is preserved in place,
> albeit with new virtual addresses.
>
> The new QEMU process is started by exec'ing the command
> specified by the @cpr-exec-command parameter.  The first word of
> the command is the binary, and the remaining words are its
> arguments.  The command may be a direct invocation of new QEMU,
> or may be a non-QEMU command that exec's the new QEMU binary.
>
> This mode creates a second migration channel that is not visible
> to the user.  At the start of migration, old QEMU saves CPR state
> to the second channel, and at the end of migration, it tells the
> main loop to call cpr_exec.  New QEMU loads CPR state early, before
> objects are created.
>
> Because old QEMU terminates when new QEMU starts, one cannot
> stream data between the two, so uri-1 must be a type,
> such as a file, that accepts all data before old QEMU exits.
> Otherwise, old QEMU may quietly block writing to the channel.
>
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported.  The VM must be started with
> the '-machine aux-ram-share=on' option, which allows anonymous
> memory to be transferred in place to the new process.  The memfds
> are kept open across exec by clearing the close-on-exec flag, their
> values are saved in CPR state, and they are mmap'd in new QEMU.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  qapi/migration.json       | 25 +++++++++++++++-
>  include/migration/cpr.h   |  1 +
>  migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
>  migration/cpr.c           | 26 ++++++++++++++++-
>  migration/migration.c     | 10 ++++++-
>  migration/ram.c           |  1 +
>  migration/vmstate-types.c |  8 +++++
>  migration/trace-events    |  1 +
>  8 files changed, 143 insertions(+), 3 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index ea410fd..cbc90e8 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -694,9 +694,32 @@
>  #     until you issue the `migrate-incoming` command.
>  #
>  #     (since 10.0)
> +#
> +# @cpr-exec: The migrate command stops the VM, saves state to the
> +#     migration channel, directly exec's a new version of QEMU on the
> +#     same host, replacing the original process while retaining its
> +#     PID, and loads state from the channel.  Guest RAM is preserved
> +#     in place.  Devices and their pinned pages are also preserved for
> +#     VFIO and IOMMUFD.
> +#
> +#     Old QEMU starts new QEMU by exec'ing the command specified by
> +#     the @cpr-exec-command parameter.  The command may be a direct
> +#     invocation of new QEMU, or may be a non-QEMU command that exec's
> +#     the new QEMU binary.

Not sure we need the last sentence.

If we keep it, maybe say "a wrapper script" instead of "a non-QEMU
command".

> +#
> +#     Because old QEMU terminates when new QEMU starts, one cannot
> +#     stream data between the two, so the channel must be a type,
> +#     such as a file, that accepts all data before old QEMU exits.
> +#     Otherwise, old QEMU may quietly block writing to the channel.
> +#
> +#     Memory-backend objects must have the share=on attribute, but
> +#     memory-backend-epc is not supported.  The VM must be started
> +#     with the '-machine aux-ram-share=on' option.

I assume violations of this constraint fail cleanly.

> +#
> +#     (since 10.2)
>  ##
>  { 'enum': 'MigMode',
> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>  
>  ##
>  # @ZeroPageDetection:

Acked-by: Markus Armbruster <armbru@redhat.com>

[...]



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 7/9] migration: cpr-exec mode
  2025-09-11 15:09   ` Markus Armbruster
@ 2025-09-12 14:49     ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-12 14:49 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Fabiano Rosas, Peter Xu, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert

On 9/11/2025 11:09 AM, Markus Armbruster wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Add the cpr-exec migration mode.  Usage:
>>    qemu-system-$arch -machine aux-ram-share=on ...
>>    migrate_set_parameter mode cpr-exec
>>    migrate_set_parameter cpr-exec-command \
>>      <arg1> <arg2> ... -incoming <uri-1> \
>>    migrate -d <uri-1>
>>
>> The migrate command stops the VM, saves state to uri-1,
>> directly exec's a new version of QEMU on the same host,
>> replacing the original process while retaining its PID, and
>> loads state from uri-1.  Guest RAM is preserved in place,
>> albeit with new virtual addresses.
>>
>> The new QEMU process is started by exec'ing the command
>> specified by the @cpr-exec-command parameter.  The first word of
>> the command is the binary, and the remaining words are its
>> arguments.  The command may be a direct invocation of new QEMU,
>> or may be a non-QEMU command that exec's the new QEMU binary.
>>
>> This mode creates a second migration channel that is not visible
>> to the user.  At the start of migration, old QEMU saves CPR state
>> to the second channel, and at the end of migration, it tells the
>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>> objects are created.
>>
>> Because old QEMU terminates when new QEMU starts, one cannot
>> stream data between the two, so uri-1 must be a type,
>> such as a file, that accepts all data before old QEMU exits.
>> Otherwise, old QEMU may quietly block writing to the channel.
>>
>> Memory-backend objects must have the share=on attribute, but
>> memory-backend-epc is not supported.  The VM must be started with
>> the '-machine aux-ram-share=on' option, which allows anonymous
>> memory to be transferred in place to the new process.  The memfds
>> are kept open across exec by clearing the close-on-exec flag, their
>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   qapi/migration.json       | 25 +++++++++++++++-
>>   include/migration/cpr.h   |  1 +
>>   migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/cpr.c           | 26 ++++++++++++++++-
>>   migration/migration.c     | 10 ++++++-
>>   migration/ram.c           |  1 +
>>   migration/vmstate-types.c |  8 +++++
>>   migration/trace-events    |  1 +
>>   8 files changed, 143 insertions(+), 3 deletions(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index ea410fd..cbc90e8 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -694,9 +694,32 @@
>>   #     until you issue the `migrate-incoming` command.
>>   #
>>   #     (since 10.0)
>> +#
>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>> +#     migration channel, directly exec's a new version of QEMU on the
>> +#     same host, replacing the original process while retaining its
>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>> +#     in place.  Devices and their pinned pages are also preserved for
>> +#     VFIO and IOMMUFD.
>> +#
>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>> +#     the @cpr-exec-command parameter.  The command may be a direct
>> +#     invocation of new QEMU, or may be a non-QEMU command that exec's
>> +#     the new QEMU binary.
> 
> Not sure we need the last sentence.
> 
> If we keep it, maybe say "a wrapper script" instead of "a non-QEMU
> command".

I prefer to keep it because the point is not obvious, and I had some
discussions about it in previous versions of the series.  I will
rewrite as:
   or may be a wrapper that exec's the new QEMU binary.

>> +#
>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>> +#     stream data between the two, so the channel must be a type,
>> +#     such as a file, that accepts all data before old QEMU exits.
>> +#     Otherwise, old QEMU may quietly block writing to the channel.
>> +#
>> +#     Memory-backend objects must have the share=on attribute, but
>> +#     memory-backend-epc is not supported.  The VM must be started
>> +#     with the '-machine aux-ram-share=on' option.
> 
> I assume violations of this constraint fail cleanly.

Yes. Migration blockers are added, and print a clear message.

>> +#
>> +#     (since 10.2)
>>   ##
>>   { 'enum': 'MigMode',
>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>   
>>   ##
>>   # @ZeroPageDetection:
> 
> Acked-by: Markus Armbruster <armbru@redhat.com>

Thanks! - steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 8/9] migration: cpr-exec docs
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (6 preceding siblings ...)
  2025-08-14 17:17 ` [PATCH V3 7/9] migration: cpr-exec mode Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-09-15 20:36   ` Fabiano Rosas
  2025-08-14 17:17 ` [PATCH V3 9/9] vfio: cpr-exec mode Steve Sistare
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

Update developer documentation for cpr-exec mode.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 docs/devel/migration/CPR.rst | 103 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
index 0a0fd4f..abc9a90 100644
--- a/docs/devel/migration/CPR.rst
+++ b/docs/devel/migration/CPR.rst
@@ -5,7 +5,7 @@ CPR is the umbrella name for a set of migration modes in which the
 VM is migrated to a new QEMU instance on the same host.  It is
 intended for use when the goal is to update host software components
 that run the VM, such as QEMU or even the host kernel.  At this time,
-the cpr-reboot and cpr-transfer modes are available.
+the cpr-reboot, cpr-transfer, and cpr-exec modes are available.
 
 Because QEMU is restarted on the same host, with access to the same
 local devices, CPR is allowed in certain cases where normal migration
@@ -324,3 +324,104 @@ descriptors from old to new QEMU.  In the future, descriptors for
 vhost, and char devices could be transferred,
 preserving those devices and their kernel state without interruption,
 even if they do not explicitly support live migration.
+
+cpr-exec mode
+-------------
+
+In this mode, QEMU stops the VM, writes VM state to the migration
+URI, and directly exec's a new version of QEMU on the same host,
+replacing the original process while retaining its PID.  Guest RAM is
+preserved in place, albeit with new virtual addresses.  The user
+completes the migration by specifying the ``-incoming`` option, and
+by issuing the ``migrate-incoming`` command if necessary; see details
+below.
+
+This mode supports VFIO/IOMMUFD devices by preserving device descriptors
+and hence kernel state across the exec, even for devices that do not
+support live migration.
+
+Because the old and new QEMU instances are not active concurrently,
+the URI cannot be a type that streams data from one instance to the
+other.
+
+Usage
+^^^^^
+
+Arguments for the new QEMU process are taken from the
+@cpr-exec-args parameter.  The first argument should be the
+path of a new QEMU binary, or a prefix command that exec's the
+new QEMU binary, and the arguments should include the ''-incoming''
+option.
+
+Memory backend objects must have the ``share=on`` attribute.
+The VM must be started with the ``-machine aux-ram-share=on`` option.
+
+Outgoing:
+  * Set the migration mode parameter to ``cpr-exec``.
+  * Set the ``cpr-exec-args`` parameter.
+  * Issue the ``migrate`` command.  It is recommended the the URI be
+    a ``file`` type, but one can use other types such as ``exec``,
+    provided the command captures all the data from the outgoing side,
+    and provides all the data to the incoming side.
+
+Incoming:
+  * You do not need to explicitly start new QEMU.  It is started as
+    a side effect of the migrate command above.
+  * If the VM was running when the outgoing ``migrate`` command was
+    issued, then QEMU automatically resumes VM execution.
+
+Example 1: incoming URI
+^^^^^^^^^^^^^^^^^^^^^^^
+
+In these examples, we simply restart the same version of QEMU, but in
+a real scenario one would set a new QEMU binary path in cpr-exec-args.
+
+::
+
+  # qemu-kvm -monitor stdio
+  -object memory-backend-memfd,id=ram0,size=4G
+  -machine memory-backend=ram0
+  -machine aux-ram-share=on
+  ...
+
+  QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  VM status: running
+  (qemu) migrate_set_parameter mode cpr-exec
+  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
+  (qemu) migrate -d file:vm.state
+  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  VM status: running
+
+Example 2: incoming defer
+^^^^^^^^^^^^^^^^^^^^^^^^^
+::
+
+  # qemu-kvm -monitor stdio
+  -object memory-backend-memfd,id=ram0,size=4G
+  -machine memory-backend=ram0
+  -machine aux-ram-share=on
+  ...
+
+  QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  VM status: running
+  (qemu) migrate_set_parameter mode cpr-exec
+  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming defer
+  (qemu) migrate -d file:vm.state
+  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
+  (qemu) info status
+  status: paused (inmigrate)
+  (qemu) migrate_incoming file:vm.state
+  (qemu) info status
+  VM status: running
+
+Caveats
+^^^^^^^
+
+cpr-exec mode may not be used with postcopy, background-snapshot,
+or COLO.
+
+cpr-exec mode requires permission to use the exec system call, which
+is denied by certain sandbox options, such as spawn.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 8/9] migration: cpr-exec docs
  2025-08-14 17:17 ` [PATCH V3 8/9] migration: cpr-exec docs Steve Sistare
@ 2025-09-15 20:36   ` Fabiano Rosas
  2025-09-19 15:28     ` Steven Sistare
  0 siblings, 1 reply; 47+ messages in thread
From: Fabiano Rosas @ 2025-09-15 20:36 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, Markus Armbruster, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Update developer documentation for cpr-exec mode.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

Just a typo below.

> ---
>  docs/devel/migration/CPR.rst | 103 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
> index 0a0fd4f..abc9a90 100644
> --- a/docs/devel/migration/CPR.rst
> +++ b/docs/devel/migration/CPR.rst
> @@ -5,7 +5,7 @@ CPR is the umbrella name for a set of migration modes in which the
>  VM is migrated to a new QEMU instance on the same host.  It is
>  intended for use when the goal is to update host software components
>  that run the VM, such as QEMU or even the host kernel.  At this time,
> -the cpr-reboot and cpr-transfer modes are available.
> +the cpr-reboot, cpr-transfer, and cpr-exec modes are available.
>  
>  Because QEMU is restarted on the same host, with access to the same
>  local devices, CPR is allowed in certain cases where normal migration
> @@ -324,3 +324,104 @@ descriptors from old to new QEMU.  In the future, descriptors for
>  vhost, and char devices could be transferred,
>  preserving those devices and their kernel state without interruption,
>  even if they do not explicitly support live migration.
> +
> +cpr-exec mode
> +-------------
> +
> +In this mode, QEMU stops the VM, writes VM state to the migration
> +URI, and directly exec's a new version of QEMU on the same host,
> +replacing the original process while retaining its PID.  Guest RAM is
> +preserved in place, albeit with new virtual addresses.  The user
> +completes the migration by specifying the ``-incoming`` option, and
> +by issuing the ``migrate-incoming`` command if necessary; see details
> +below.
> +
> +This mode supports VFIO/IOMMUFD devices by preserving device descriptors
> +and hence kernel state across the exec, even for devices that do not
> +support live migration.
> +
> +Because the old and new QEMU instances are not active concurrently,
> +the URI cannot be a type that streams data from one instance to the
> +other.
> +
> +Usage
> +^^^^^
> +
> +Arguments for the new QEMU process are taken from the
> +@cpr-exec-args parameter.  The first argument should be the
> +path of a new QEMU binary, or a prefix command that exec's the
> +new QEMU binary, and the arguments should include the ''-incoming''
> +option.
> +
> +Memory backend objects must have the ``share=on`` attribute.
> +The VM must be started with the ``-machine aux-ram-share=on`` option.
> +
> +Outgoing:
> +  * Set the migration mode parameter to ``cpr-exec``.
> +  * Set the ``cpr-exec-args`` parameter.
> +  * Issue the ``migrate`` command.  It is recommended the the URI be

s/the the/that the/

> +    a ``file`` type, but one can use other types such as ``exec``,
> +    provided the command captures all the data from the outgoing side,
> +    and provides all the data to the incoming side.
> +
> +Incoming:
> +  * You do not need to explicitly start new QEMU.  It is started as
> +    a side effect of the migrate command above.
> +  * If the VM was running when the outgoing ``migrate`` command was
> +    issued, then QEMU automatically resumes VM execution.
> +
> +Example 1: incoming URI
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +In these examples, we simply restart the same version of QEMU, but in
> +a real scenario one would set a new QEMU binary path in cpr-exec-args.
> +
> +::
> +
> +  # qemu-kvm -monitor stdio
> +  -object memory-backend-memfd,id=ram0,size=4G
> +  -machine memory-backend=ram0
> +  -machine aux-ram-share=on
> +  ...
> +
> +  QEMU 10.2.50 monitor - type 'help' for more information
> +  (qemu) info status
> +  VM status: running
> +  (qemu) migrate_set_parameter mode cpr-exec
> +  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
> +  (qemu) migrate -d file:vm.state
> +  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
> +  (qemu) info status
> +  VM status: running
> +
> +Example 2: incoming defer
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +::
> +
> +  # qemu-kvm -monitor stdio
> +  -object memory-backend-memfd,id=ram0,size=4G
> +  -machine memory-backend=ram0
> +  -machine aux-ram-share=on
> +  ...
> +
> +  QEMU 10.2.50 monitor - type 'help' for more information
> +  (qemu) info status
> +  VM status: running
> +  (qemu) migrate_set_parameter mode cpr-exec
> +  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming defer
> +  (qemu) migrate -d file:vm.state
> +  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
> +  (qemu) info status
> +  status: paused (inmigrate)
> +  (qemu) migrate_incoming file:vm.state
> +  (qemu) info status
> +  VM status: running
> +
> +Caveats
> +^^^^^^^
> +
> +cpr-exec mode may not be used with postcopy, background-snapshot,
> +or COLO.
> +
> +cpr-exec mode requires permission to use the exec system call, which
> +is denied by certain sandbox options, such as spawn.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 8/9] migration: cpr-exec docs
  2025-09-15 20:36   ` Fabiano Rosas
@ 2025-09-19 15:28     ` Steven Sistare
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Sistare @ 2025-09-19 15:28 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, Markus Armbruster, Paolo Bonzini, Eric Blake,
	Dr. David Alan Gilbert

On 9/15/2025 4:36 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Update developer documentation for cpr-exec mode.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 
> Just a typo below.

Thanks.
I will also fix: cpr-exec-args should be cpr-exec-command.

- Steve

>> ---
>>   docs/devel/migration/CPR.rst | 103 ++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 102 insertions(+), 1 deletion(-)
>>
>> diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
>> index 0a0fd4f..abc9a90 100644
>> --- a/docs/devel/migration/CPR.rst
>> +++ b/docs/devel/migration/CPR.rst
>> @@ -5,7 +5,7 @@ CPR is the umbrella name for a set of migration modes in which the
>>   VM is migrated to a new QEMU instance on the same host.  It is
>>   intended for use when the goal is to update host software components
>>   that run the VM, such as QEMU or even the host kernel.  At this time,
>> -the cpr-reboot and cpr-transfer modes are available.
>> +the cpr-reboot, cpr-transfer, and cpr-exec modes are available.
>>   
>>   Because QEMU is restarted on the same host, with access to the same
>>   local devices, CPR is allowed in certain cases where normal migration
>> @@ -324,3 +324,104 @@ descriptors from old to new QEMU.  In the future, descriptors for
>>   vhost, and char devices could be transferred,
>>   preserving those devices and their kernel state without interruption,
>>   even if they do not explicitly support live migration.
>> +
>> +cpr-exec mode
>> +-------------
>> +
>> +In this mode, QEMU stops the VM, writes VM state to the migration
>> +URI, and directly exec's a new version of QEMU on the same host,
>> +replacing the original process while retaining its PID.  Guest RAM is
>> +preserved in place, albeit with new virtual addresses.  The user
>> +completes the migration by specifying the ``-incoming`` option, and
>> +by issuing the ``migrate-incoming`` command if necessary; see details
>> +below.
>> +
>> +This mode supports VFIO/IOMMUFD devices by preserving device descriptors
>> +and hence kernel state across the exec, even for devices that do not
>> +support live migration.
>> +
>> +Because the old and new QEMU instances are not active concurrently,
>> +the URI cannot be a type that streams data from one instance to the
>> +other.
>> +
>> +Usage
>> +^^^^^
>> +
>> +Arguments for the new QEMU process are taken from the
>> +@cpr-exec-args parameter.  The first argument should be the
>> +path of a new QEMU binary, or a prefix command that exec's the
>> +new QEMU binary, and the arguments should include the ''-incoming''
>> +option.
>> +
>> +Memory backend objects must have the ``share=on`` attribute.
>> +The VM must be started with the ``-machine aux-ram-share=on`` option.
>> +
>> +Outgoing:
>> +  * Set the migration mode parameter to ``cpr-exec``.
>> +  * Set the ``cpr-exec-args`` parameter.
>> +  * Issue the ``migrate`` command.  It is recommended the the URI be
> 
> s/the the/that the/
> 
>> +    a ``file`` type, but one can use other types such as ``exec``,
>> +    provided the command captures all the data from the outgoing side,
>> +    and provides all the data to the incoming side.
>> +
>> +Incoming:
>> +  * You do not need to explicitly start new QEMU.  It is started as
>> +    a side effect of the migrate command above.
>> +  * If the VM was running when the outgoing ``migrate`` command was
>> +    issued, then QEMU automatically resumes VM execution.
>> +
>> +Example 1: incoming URI
>> +^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +In these examples, we simply restart the same version of QEMU, but in
>> +a real scenario one would set a new QEMU binary path in cpr-exec-args.
>> +
>> +::
>> +
>> +  # qemu-kvm -monitor stdio
>> +  -object memory-backend-memfd,id=ram0,size=4G
>> +  -machine memory-backend=ram0
>> +  -machine aux-ram-share=on
>> +  ...
>> +
>> +  QEMU 10.2.50 monitor - type 'help' for more information
>> +  (qemu) info status
>> +  VM status: running
>> +  (qemu) migrate_set_parameter mode cpr-exec
>> +  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
>> +  (qemu) migrate -d file:vm.state
>> +  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
>> +  (qemu) info status
>> +  VM status: running
>> +
>> +Example 2: incoming defer
>> +^^^^^^^^^^^^^^^^^^^^^^^^^
>> +::
>> +
>> +  # qemu-kvm -monitor stdio
>> +  -object memory-backend-memfd,id=ram0,size=4G
>> +  -machine memory-backend=ram0
>> +  -machine aux-ram-share=on
>> +  ...
>> +
>> +  QEMU 10.2.50 monitor - type 'help' for more information
>> +  (qemu) info status
>> +  VM status: running
>> +  (qemu) migrate_set_parameter mode cpr-exec
>> +  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming defer
>> +  (qemu) migrate -d file:vm.state
>> +  (qemu) QEMU 10.2.50 monitor - type 'help' for more information
>> +  (qemu) info status
>> +  status: paused (inmigrate)
>> +  (qemu) migrate_incoming file:vm.state
>> +  (qemu) info status
>> +  VM status: running
>> +
>> +Caveats
>> +^^^^^^^
>> +
>> +cpr-exec mode may not be used with postcopy, background-snapshot,
>> +or COLO.
>> +
>> +cpr-exec mode requires permission to use the exec system call, which
>> +is denied by certain sandbox options, such as spawn.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V3 9/9] vfio: cpr-exec mode
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (7 preceding siblings ...)
  2025-08-14 17:17 ` [PATCH V3 8/9] migration: cpr-exec docs Steve Sistare
@ 2025-08-14 17:17 ` Steve Sistare
  2025-08-14 17:20   ` Steven Sistare
  2025-09-05 16:48 ` [PATCH V3 0/9] Live update: cpr-exec Peter Xu
  2025-09-08 17:02 ` Vladimir Sementsov-Ogievskiy
  10 siblings, 1 reply; 47+ messages in thread
From: Steve Sistare @ 2025-08-14 17:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Steve Sistare

All blockers and notifiers for cpr-transfer mode also apply to cpr-exec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container.c   |  3 ++-
 hw/vfio/cpr-iommufd.c |  3 ++-
 hw/vfio/cpr-legacy.c  |  9 +++++----
 hw/vfio/cpr.c         | 13 +++++++------
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 3e13fea..735b769 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -993,7 +993,8 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
         error_setg(&vbasedev->cpr.mdev_blocker,
                    "CPR does not support vfio mdev %s", vbasedev->name);
         if (migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, errp,
-                                      MIG_MODE_CPR_TRANSFER, -1) < 0) {
+                                      MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
+                                      -1) < 0) {
             goto hiod_unref_exit;
         }
     }
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 148a06d..e1f1854 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -159,7 +159,8 @@ bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
 
     if (!vfio_cpr_supported(be, cpr_blocker)) {
         return migrate_add_blocker_modes(cpr_blocker, errp,
-                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+                                         MIG_MODE_CPR_TRANSFER,
+                                         MIG_MODE_CPR_EXEC, -1) == 0;
     }
 
     vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 553b203..7c73439 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -176,16 +176,17 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
 
     if (!vfio_cpr_supported(container, cpr_blocker)) {
         return migrate_add_blocker_modes(cpr_blocker, errp,
-                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+                                         MIG_MODE_CPR_TRANSFER,
+                                         MIG_MODE_CPR_EXEC, -1) == 0;
     }
 
     vfio_cpr_add_kvm_notifier();
 
     vmstate_register(NULL, -1, &vfio_container_vmstate, container);
 
-    migration_add_notifier_mode(&container->cpr.transfer_notifier,
-                                vfio_cpr_fail_notifier,
-                                MIG_MODE_CPR_TRANSFER);
+    migration_add_notifier_modes(&container->cpr.transfer_notifier,
+                                 vfio_cpr_fail_notifier,
+                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
     return true;
 }
 
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index a831243..a176971 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -195,9 +195,10 @@ static int vfio_cpr_kvm_close_notifier(NotifierWithReturn *notifier,
 void vfio_cpr_add_kvm_notifier(void)
 {
     if (!kvm_close_notifier.notify) {
-        migration_add_notifier_mode(&kvm_close_notifier,
-                                    vfio_cpr_kvm_close_notifier,
-                                    MIG_MODE_CPR_TRANSFER);
+        migration_add_notifier_modes(&kvm_close_notifier,
+                                     vfio_cpr_kvm_close_notifier,
+                                     MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
+                                     -1);
     }
 }
 
@@ -282,9 +283,9 @@ static int vfio_cpr_pci_notifier(NotifierWithReturn *notifier,
 
 void vfio_cpr_pci_register_device(VFIOPCIDevice *vdev)
 {
-    migration_add_notifier_mode(&vdev->cpr.transfer_notifier,
-                                vfio_cpr_pci_notifier,
-                                MIG_MODE_CPR_TRANSFER);
+    migration_add_notifier_modes(&vdev->cpr.transfer_notifier,
+                                 vfio_cpr_pci_notifier,
+                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
 }
 
 void vfio_cpr_pci_unregister_device(VFIOPCIDevice *vdev)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 9/9] vfio: cpr-exec mode
  2025-08-14 17:17 ` [PATCH V3 9/9] vfio: cpr-exec mode Steve Sistare
@ 2025-08-14 17:20   ` Steven Sistare
  2025-09-19 15:35     ` Steven Sistare
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-08-14 17:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson,
	Cedric Le Goater

cc Cedric and Alex.

This is the only patch of the series "Live update: cpr-exec" that touches vfio.

- Steve

On 8/14/2025 1:17 PM, Steve Sistare wrote:
> All blockers and notifiers for cpr-transfer mode also apply to cpr-exec.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container.c   |  3 ++-
>   hw/vfio/cpr-iommufd.c |  3 ++-
>   hw/vfio/cpr-legacy.c  |  9 +++++----
>   hw/vfio/cpr.c         | 13 +++++++------
>   4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 3e13fea..735b769 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -993,7 +993,8 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
>           error_setg(&vbasedev->cpr.mdev_blocker,
>                      "CPR does not support vfio mdev %s", vbasedev->name);
>           if (migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, errp,
> -                                      MIG_MODE_CPR_TRANSFER, -1) < 0) {
> +                                      MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
> +                                      -1) < 0) {
>               goto hiod_unref_exit;
>           }
>       }
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> index 148a06d..e1f1854 100644
> --- a/hw/vfio/cpr-iommufd.c
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -159,7 +159,8 @@ bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
>   
>       if (!vfio_cpr_supported(be, cpr_blocker)) {
>           return migrate_add_blocker_modes(cpr_blocker, errp,
> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> +                                         MIG_MODE_CPR_TRANSFER,
> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>       }
>   
>       vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index 553b203..7c73439 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -176,16 +176,17 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>   
>       if (!vfio_cpr_supported(container, cpr_blocker)) {
>           return migrate_add_blocker_modes(cpr_blocker, errp,
> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> +                                         MIG_MODE_CPR_TRANSFER,
> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>       }
>   
>       vfio_cpr_add_kvm_notifier();
>   
>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>   
> -    migration_add_notifier_mode(&container->cpr.transfer_notifier,
> -                                vfio_cpr_fail_notifier,
> -                                MIG_MODE_CPR_TRANSFER);
> +    migration_add_notifier_modes(&container->cpr.transfer_notifier,
> +                                 vfio_cpr_fail_notifier,
> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>       return true;
>   }
>   
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index a831243..a176971 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -195,9 +195,10 @@ static int vfio_cpr_kvm_close_notifier(NotifierWithReturn *notifier,
>   void vfio_cpr_add_kvm_notifier(void)
>   {
>       if (!kvm_close_notifier.notify) {
> -        migration_add_notifier_mode(&kvm_close_notifier,
> -                                    vfio_cpr_kvm_close_notifier,
> -                                    MIG_MODE_CPR_TRANSFER);
> +        migration_add_notifier_modes(&kvm_close_notifier,
> +                                     vfio_cpr_kvm_close_notifier,
> +                                     MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
> +                                     -1);
>       }
>   }
>   
> @@ -282,9 +283,9 @@ static int vfio_cpr_pci_notifier(NotifierWithReturn *notifier,
>   
>   void vfio_cpr_pci_register_device(VFIOPCIDevice *vdev)
>   {
> -    migration_add_notifier_mode(&vdev->cpr.transfer_notifier,
> -                                vfio_cpr_pci_notifier,
> -                                MIG_MODE_CPR_TRANSFER);
> +    migration_add_notifier_modes(&vdev->cpr.transfer_notifier,
> +                                 vfio_cpr_pci_notifier,
> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>   }
>   
>   void vfio_cpr_pci_unregister_device(VFIOPCIDevice *vdev)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 9/9] vfio: cpr-exec mode
  2025-08-14 17:20   ` Steven Sistare
@ 2025-09-19 15:35     ` Steven Sistare
  2025-09-19 16:30       ` Cédric Le Goater
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-09-19 15:35 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson, qemu-devel

This still needs review - steve

On 8/14/2025 1:20 PM, Steven Sistare wrote:
> cc Cedric and Alex.
> 
> This is the only patch of the series "Live update: cpr-exec" that touches vfio.
> 
> - Steve
> 
> On 8/14/2025 1:17 PM, Steve Sistare wrote:
>> All blockers and notifiers for cpr-transfer mode also apply to cpr-exec.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container.c   |  3 ++-
>>   hw/vfio/cpr-iommufd.c |  3 ++-
>>   hw/vfio/cpr-legacy.c  |  9 +++++----
>>   hw/vfio/cpr.c         | 13 +++++++------
>>   4 files changed, 16 insertions(+), 12 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 3e13fea..735b769 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -993,7 +993,8 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
>>           error_setg(&vbasedev->cpr.mdev_blocker,
>>                      "CPR does not support vfio mdev %s", vbasedev->name);
>>           if (migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, errp,
>> -                                      MIG_MODE_CPR_TRANSFER, -1) < 0) {
>> +                                      MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
>> +                                      -1) < 0) {
>>               goto hiod_unref_exit;
>>           }
>>       }
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index 148a06d..e1f1854 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -159,7 +159,8 @@ bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
>>       if (!vfio_cpr_supported(be, cpr_blocker)) {
>>           return migrate_add_blocker_modes(cpr_blocker, errp,
>> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>> +                                         MIG_MODE_CPR_TRANSFER,
>> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>>       }
>>       vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index 553b203..7c73439 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -176,16 +176,17 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>       if (!vfio_cpr_supported(container, cpr_blocker)) {
>>           return migrate_add_blocker_modes(cpr_blocker, errp,
>> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>> +                                         MIG_MODE_CPR_TRANSFER,
>> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>>       }
>>       vfio_cpr_add_kvm_notifier();
>>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> -    migration_add_notifier_mode(&container->cpr.transfer_notifier,
>> -                                vfio_cpr_fail_notifier,
>> -                                MIG_MODE_CPR_TRANSFER);
>> +    migration_add_notifier_modes(&container->cpr.transfer_notifier,
>> +                                 vfio_cpr_fail_notifier,
>> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>>       return true;
>>   }
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index a831243..a176971 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -195,9 +195,10 @@ static int vfio_cpr_kvm_close_notifier(NotifierWithReturn *notifier,
>>   void vfio_cpr_add_kvm_notifier(void)
>>   {
>>       if (!kvm_close_notifier.notify) {
>> -        migration_add_notifier_mode(&kvm_close_notifier,
>> -                                    vfio_cpr_kvm_close_notifier,
>> -                                    MIG_MODE_CPR_TRANSFER);
>> +        migration_add_notifier_modes(&kvm_close_notifier,
>> +                                     vfio_cpr_kvm_close_notifier,
>> +                                     MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
>> +                                     -1);
>>       }
>>   }
>> @@ -282,9 +283,9 @@ static int vfio_cpr_pci_notifier(NotifierWithReturn *notifier,
>>   void vfio_cpr_pci_register_device(VFIOPCIDevice *vdev)
>>   {
>> -    migration_add_notifier_mode(&vdev->cpr.transfer_notifier,
>> -                                vfio_cpr_pci_notifier,
>> -                                MIG_MODE_CPR_TRANSFER);
>> +    migration_add_notifier_modes(&vdev->cpr.transfer_notifier,
>> +                                 vfio_cpr_pci_notifier,
>> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>>   }
>>   void vfio_cpr_pci_unregister_device(VFIOPCIDevice *vdev)
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 9/9] vfio: cpr-exec mode
  2025-09-19 15:35     ` Steven Sistare
@ 2025-09-19 16:30       ` Cédric Le Goater
  0 siblings, 0 replies; 47+ messages in thread
From: Cédric Le Goater @ 2025-09-19 16:30 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Alex Williamson, qemu-devel

On 9/19/25 17:35, Steven Sistare wrote:
> This still needs review - steve

Steve,

please CC: us on the whole series next time. I will catch on the
emails next week. That said, I don't see any blocker.

Thanks,

C.



> 
> On 8/14/2025 1:20 PM, Steven Sistare wrote:
>> cc Cedric and Alex.
>>
>> This is the only patch of the series "Live update: cpr-exec" that touches vfio.
>>
>> - Steve
>>
>> On 8/14/2025 1:17 PM, Steve Sistare wrote:
>>> All blockers and notifiers for cpr-transfer mode also apply to cpr-exec.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   hw/vfio/container.c   |  3 ++-
>>>   hw/vfio/cpr-iommufd.c |  3 ++-
>>>   hw/vfio/cpr-legacy.c  |  9 +++++----
>>>   hw/vfio/cpr.c         | 13 +++++++------
>>>   4 files changed, 16 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>> index 3e13fea..735b769 100644
>>> --- a/hw/vfio/container.c
>>> +++ b/hw/vfio/container.c
>>> @@ -993,7 +993,8 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
>>>           error_setg(&vbasedev->cpr.mdev_blocker,
>>>                      "CPR does not support vfio mdev %s", vbasedev->name);
>>>           if (migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, errp,
>>> -                                      MIG_MODE_CPR_TRANSFER, -1) < 0) {
>>> +                                      MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
>>> +                                      -1) < 0) {
>>>               goto hiod_unref_exit;
>>>           }
>>>       }
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index 148a06d..e1f1854 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -159,7 +159,8 @@ bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
>>>       if (!vfio_cpr_supported(be, cpr_blocker)) {
>>>           return migrate_add_blocker_modes(cpr_blocker, errp,
>>> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>>> +                                         MIG_MODE_CPR_TRANSFER,
>>> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>>>       }
>>>       vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>> index 553b203..7c73439 100644
>>> --- a/hw/vfio/cpr-legacy.c
>>> +++ b/hw/vfio/cpr-legacy.c
>>> @@ -176,16 +176,17 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>>       if (!vfio_cpr_supported(container, cpr_blocker)) {
>>>           return migrate_add_blocker_modes(cpr_blocker, errp,
>>> -                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>>> +                                         MIG_MODE_CPR_TRANSFER,
>>> +                                         MIG_MODE_CPR_EXEC, -1) == 0;
>>>       }
>>>       vfio_cpr_add_kvm_notifier();
>>>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>>> -    migration_add_notifier_mode(&container->cpr.transfer_notifier,
>>> -                                vfio_cpr_fail_notifier,
>>> -                                MIG_MODE_CPR_TRANSFER);
>>> +    migration_add_notifier_modes(&container->cpr.transfer_notifier,
>>> +                                 vfio_cpr_fail_notifier,
>>> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>>>       return true;
>>>   }
>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>>> index a831243..a176971 100644
>>> --- a/hw/vfio/cpr.c
>>> +++ b/hw/vfio/cpr.c
>>> @@ -195,9 +195,10 @@ static int vfio_cpr_kvm_close_notifier(NotifierWithReturn *notifier,
>>>   void vfio_cpr_add_kvm_notifier(void)
>>>   {
>>>       if (!kvm_close_notifier.notify) {
>>> -        migration_add_notifier_mode(&kvm_close_notifier,
>>> -                                    vfio_cpr_kvm_close_notifier,
>>> -                                    MIG_MODE_CPR_TRANSFER);
>>> +        migration_add_notifier_modes(&kvm_close_notifier,
>>> +                                     vfio_cpr_kvm_close_notifier,
>>> +                                     MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
>>> +                                     -1);
>>>       }
>>>   }
>>> @@ -282,9 +283,9 @@ static int vfio_cpr_pci_notifier(NotifierWithReturn *notifier,
>>>   void vfio_cpr_pci_register_device(VFIOPCIDevice *vdev)
>>>   {
>>> -    migration_add_notifier_mode(&vdev->cpr.transfer_notifier,
>>> -                                vfio_cpr_pci_notifier,
>>> -                                MIG_MODE_CPR_TRANSFER);
>>> +    migration_add_notifier_modes(&vdev->cpr.transfer_notifier,
>>> +                                 vfio_cpr_pci_notifier,
>>> +                                 MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC, -1);
>>>   }
>>>   void vfio_cpr_pci_unregister_device(VFIOPCIDevice *vdev)
>>
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (8 preceding siblings ...)
  2025-08-14 17:17 ` [PATCH V3 9/9] vfio: cpr-exec mode Steve Sistare
@ 2025-09-05 16:48 ` Peter Xu
  2025-09-05 17:09   ` Dr. David Alan Gilbert
  2025-09-09 14:36   ` Steven Sistare
  2025-09-08 17:02 ` Vladimir Sementsov-Ogievskiy
  10 siblings, 2 replies; 47+ messages in thread
From: Peter Xu @ 2025-09-05 16:48 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

Add Vladimir and Dan.

On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.  
> 
> The new user-visible interfaces are:
>   * cpr-exec (MigMode migration parameter)
>   * cpr-exec-command (migration parameter)
> 
> cpr-exec mode is similar in most respects to cpr-transfer mode, with the 
> primary difference being that old QEMU directly exec's new QEMU.  The user
> specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command.
> 
> Why?
> 
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources.  By contrast, cpr-transfer mode
> requires a new container to be created on the same host as the target of
> the CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.
> Avoiding over commitment requires extra work in the management layer.

Can we spell out what are these resources?

CPR definitely relies on completely shared memory.  That's already not a
concern.

CPR resolves resources that are bound to devices like VFIO by passing over
FDs, these are not over commited either.

Is it accounting QEMU/KVM process overhead?  That would really be trivial,
IMHO, but maybe something else?

> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.

We discussed about this one.  Personally I still cannot understand why this
is a concern if the agents can be trivially started as a new instance.  But
I admit I may not know the whole picture.  To me, the above point is more
persuasive, but I'll need to understand which part that is over-commited
that can be a problem.

After all, cloud hosts should preserve some extra memory anyway to make
sure dynamic resources allocations all the time (e.g., when live migration
starts, KVM pgtables can drastically increase if huge pages are enabled,
for PAGE_SIZE trackings), I assumed the over-commit portion should be less
that those.. and when it's also temporary (src QEMU will release all
resources after live upgrade) then it looks manageable.

> 
> How?
> 
> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> and by sending the unique name and value of each descriptor to new QEMU
> via CPR state.
> 
> CPR state cannot be sent over the normal migration channel, because devices
> and backends are created prior to reading the channel, so this mode sends
> CPR state over a second migration channel that is not visible to the user.
> New QEMU reads the second channel prior to creating devices or backends.
> 
> The exec itself is trivial.  After writing to the migration channels, the
> migration code calls a new main-loop hook to perform the exec.
> 
> Example:
> 
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would use a new QEMU binary path in cpr-exec-command.
> 
>   # qemu-kvm -monitor stdio
>   -object memory-backend-memfd,id=ram0,size=1G
>   -machine memory-backend=ram0 -machine aux-ram-share=on ...
> 
>   QEMU 10.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
>   (qemu) migrate_set_parameter mode cpr-exec
>   (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>   (qemu) migrate -d file:vm.state
>   (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
> 
> Steve Sistare (9):
>   migration: multi-mode notifier
>   migration: add cpr_walk_fd
>   oslib: qemu_clear_cloexec
>   vl: helper to request exec
>   migration: cpr-exec-command parameter
>   migration: cpr-exec save and load
>   migration: cpr-exec mode
>   migration: cpr-exec docs
>   vfio: cpr-exec mode

The other thing is, as Vladimir is working on (looks like) a cleaner way of
passing FDs fully relying on unix sockets, I want to understand better on
the relationships of his work and the exec model.

I still personally think we should always stick with unix sockets, but I'm
open to be convinced on above limitations.  If exec is better than
cpr-transfer in any way, the hope is more people can and should adopt it.

We also have no answer yet on how cpr-exec can resolve container world with
seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
downside instead.  Better mention that in the cover letter.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-05 16:48 ` [PATCH V3 0/9] Live update: cpr-exec Peter Xu
@ 2025-09-05 17:09   ` Dr. David Alan Gilbert
  2025-09-05 17:48     ` Peter Xu
  2025-09-09 14:36   ` Steven Sistare
  1 sibling, 1 reply; 47+ messages in thread
From: Dr. David Alan Gilbert @ 2025-09-05 17:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Markus Armbruster,
	Paolo Bonzini, Eric Blake, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

* Peter Xu (peterx@redhat.com) wrote:
> Add Vladimir and Dan.
> 
> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> > This patch series adds the live migration cpr-exec mode.  
> > 
> > The new user-visible interfaces are:
> >   * cpr-exec (MigMode migration parameter)
> >   * cpr-exec-command (migration parameter)
> > 
> > cpr-exec mode is similar in most respects to cpr-transfer mode, with the 
> > primary difference being that old QEMU directly exec's new QEMU.  The user
> > specifies the command to exec new QEMU in the migration parameter
> > cpr-exec-command.
> > 
> > Why?
> > 
> > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > container and its assigned resources.  By contrast, cpr-transfer mode
> > requires a new container to be created on the same host as the target of
> > the CPR operation.  Resources must be reserved for the new container, while
> > the old container still reserves resources until the operation completes.
> > Avoiding over commitment requires extra work in the management layer.
> 
> Can we spell out what are these resources?
> 
> CPR definitely relies on completely shared memory.  That's already not a
> concern.
> 
> CPR resolves resources that are bound to devices like VFIO by passing over
> FDs, these are not over commited either.
> 
> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> IMHO, but maybe something else?
> 
> > This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> > is that the container may include agents with their own connections to the
> > outside world, and such connections remain intact if the container is reused.
> 
> We discussed about this one.  Personally I still cannot understand why this
> is a concern if the agents can be trivially started as a new instance.  But
> I admit I may not know the whole picture.  To me, the above point is more
> persuasive, but I'll need to understand which part that is over-commited
> that can be a problem.

> After all, cloud hosts should preserve some extra memory anyway to make
> sure dynamic resources allocations all the time (e.g., when live migration
> starts, KVM pgtables can drastically increase if huge pages are enabled,
> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> that those.. and when it's also temporary (src QEMU will release all
> resources after live upgrade) then it looks manageable.

k8s used to find it very hard to change the amount of memory allocated to a
container after launch (although I heard that's getting fixed); so you'd
need more excess at the start even if your peek during hand over is only
very short.

Dave
> 
> > 
> > How?
> > 
> > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > and by sending the unique name and value of each descriptor to new QEMU
> > via CPR state.
> > 
> > CPR state cannot be sent over the normal migration channel, because devices
> > and backends are created prior to reading the channel, so this mode sends
> > CPR state over a second migration channel that is not visible to the user.
> > New QEMU reads the second channel prior to creating devices or backends.
> > 
> > The exec itself is trivial.  After writing to the migration channels, the
> > migration code calls a new main-loop hook to perform the exec.
> > 
> > Example:
> > 
> > In this example, we simply restart the same version of QEMU, but in
> > a real scenario one would use a new QEMU binary path in cpr-exec-command.
> > 
> >   # qemu-kvm -monitor stdio
> >   -object memory-backend-memfd,id=ram0,size=1G
> >   -machine memory-backend=ram0 -machine aux-ram-share=on ...
> > 
> >   QEMU 10.1.50 monitor - type 'help' for more information
> >   (qemu) info status
> >   VM status: running
> >   (qemu) migrate_set_parameter mode cpr-exec
> >   (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
> >   (qemu) migrate -d file:vm.state
> >   (qemu) QEMU 10.1.50 monitor - type 'help' for more information
> >   (qemu) info status
> >   VM status: running
> > 
> > Steve Sistare (9):
> >   migration: multi-mode notifier
> >   migration: add cpr_walk_fd
> >   oslib: qemu_clear_cloexec
> >   vl: helper to request exec
> >   migration: cpr-exec-command parameter
> >   migration: cpr-exec save and load
> >   migration: cpr-exec mode
> >   migration: cpr-exec docs
> >   vfio: cpr-exec mode
> 
> The other thing is, as Vladimir is working on (looks like) a cleaner way of
> passing FDs fully relying on unix sockets, I want to understand better on
> the relationships of his work and the exec model.
> 
> I still personally think we should always stick with unix sockets, but I'm
> open to be convinced on above limitations.  If exec is better than
> cpr-transfer in any way, the hope is more people can and should adopt it.
> 
> We also have no answer yet on how cpr-exec can resolve container world with
> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
> downside instead.  Better mention that in the cover letter.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-05 17:09   ` Dr. David Alan Gilbert
@ 2025-09-05 17:48     ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2025-09-05 17:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Markus Armbruster,
	Paolo Bonzini, Eric Blake, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On Fri, Sep 05, 2025 at 05:09:05PM +0000, Dr. David Alan Gilbert wrote:
> k8s used to find it very hard to change the amount of memory allocated to a
> container after launch (although I heard that's getting fixed); so you'd
> need more excess at the start even if your peek during hand over is only
> very short.

When kubevirt will need to support cpr, it needs to do live migration as
usual, normally by creating a separate container to put dest QEMU.  So the
hope is there's no need to change the memory setup.

I think it's not yet possible to start two QEMUs in one container after
all, because QEMU, in case of kubevirt, is always paired with a libvirt
instance. And AFAICT libvirt still doesn't support two instances appear in
the same container..  So another container should be required to trigger a
live migration, for CPR or not.

PS: I never fully understood why that's a challenge btw, especially on mem
growing not shrinking.  For CPU resources we have the same issue that
container cannot easily hot plug CPU resources into one container, that
made multifd almost useless for kubevirt when people use dedicated CPU
topology, it means all multifd threads will be run either on one physical
core (together with all the rest of QEMU mgmt threads, like main thread),
or directly run on vCPU threads which is even worse..

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-05 16:48 ` [PATCH V3 0/9] Live update: cpr-exec Peter Xu
  2025-09-05 17:09   ` Dr. David Alan Gilbert
@ 2025-09-09 14:36   ` Steven Sistare
  2025-09-09 15:24     ` Peter Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-09-09 14:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On 9/5/2025 12:48 PM, Peter Xu wrote:
> Add Vladimir and Dan.
> 
> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>> This patch series adds the live migration cpr-exec mode.
>>
>> The new user-visible interfaces are:
>>    * cpr-exec (MigMode migration parameter)
>>    * cpr-exec-command (migration parameter)
>>
>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>> primary difference being that old QEMU directly exec's new QEMU.  The user
>> specifies the command to exec new QEMU in the migration parameter
>> cpr-exec-command.
>>
>> Why?
>>
>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>> container and its assigned resources.  By contrast, cpr-transfer mode
>> requires a new container to be created on the same host as the target of
>> the CPR operation.  Resources must be reserved for the new container, while
>> the old container still reserves resources until the operation completes.
>> Avoiding over commitment requires extra work in the management layer.
> 
> Can we spell out what are these resources?
> 
> CPR definitely relies on completely shared memory.  That's already not a
> concern.
> 
> CPR resolves resources that are bound to devices like VFIO by passing over
> FDs, these are not over commited either.
> 
> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> IMHO, but maybe something else?

Accounting is one issue, and it is not trivial.  Another is arranging exclusive
use of a set of CPUs, the same set for the old and new container, concurrently.
Another is avoiding namespace conflicts, the kind that make localhost migration
difficult.

>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>> is that the container may include agents with their own connections to the
>> outside world, and such connections remain intact if the container is reused.
> 
> We discussed about this one.  Personally I still cannot understand why this
> is a concern if the agents can be trivially started as a new instance.  But
> I admit I may not know the whole picture.  To me, the above point is more
> persuasive, but I'll need to understand which part that is over-commited
> that can be a problem.

Agents can be restarted, but that would sever the connection to the outside
world.  With cpr-transfer or any local migration, you would need agents
outside of old and new containers that persist.

With cpr-exec, connections can be preserved without requiring the end user
to reconnect, and can be done trivially, by preserving chardevs.  With that
support in qemu, the management layer does nothing extra to preserve them.
chardev support is not part of this series but is part of my vision,
and makes exec mode even more compelling.

Management layers have a lot of code and complexity to manage live migration,
resources, and connections.  It requires modification to support cpr-transfer.
All that can be bypassed with exec mode.  Less complexity, less maintainance,
and  fewer points of failure.  I know this because I implemented exec mode in
OCI at Oracle, and we use it in production.
> After all, cloud hosts should preserve some extra memory anyway to make
> sure dynamic resources allocations all the time (e.g., when live migration
> starts, KVM pgtables can drastically increase if huge pages are enabled,
> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> that those.. and when it's also temporary (src QEMU will release all
> resources after live upgrade) then it looks manageable. >>
>> How?
>>
>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>> and by sending the unique name and value of each descriptor to new QEMU
>> via CPR state.
>>
>> CPR state cannot be sent over the normal migration channel, because devices
>> and backends are created prior to reading the channel, so this mode sends
>> CPR state over a second migration channel that is not visible to the user.
>> New QEMU reads the second channel prior to creating devices or backends.
>>
>> The exec itself is trivial.  After writing to the migration channels, the
>> migration code calls a new main-loop hook to perform the exec.
>>
>> Example:
>>
>> In this example, we simply restart the same version of QEMU, but in
>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>
>>    # qemu-kvm -monitor stdio
>>    -object memory-backend-memfd,id=ram0,size=1G
>>    -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>
>>    QEMU 10.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>    (qemu) migrate_set_parameter mode cpr-exec
>>    (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>    (qemu) migrate -d file:vm.state
>>    (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>
>> Steve Sistare (9):
>>    migration: multi-mode notifier
>>    migration: add cpr_walk_fd
>>    oslib: qemu_clear_cloexec
>>    vl: helper to request exec
>>    migration: cpr-exec-command parameter
>>    migration: cpr-exec save and load
>>    migration: cpr-exec mode
>>    migration: cpr-exec docs
>>    vfio: cpr-exec mode
> 
> The other thing is, as Vladimir is working on (looks like) a cleaner way of
> passing FDs fully relying on unix sockets, I want to understand better on
> the relationships of his work and the exec model.

His work is based on my work -- the ability to embed a file descriptor in a
migration stream with a VMSTATE_FD declaration -- so it is compatible.

The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
integer and embedding that in the data stream.  See the changes in vmstate-types.c
in [PATCH V3 7/9] migration: cpr-exec mode.

Thus cpr-exec will still preserve tap devices via Vladimir's code.
> I still personally think we should always stick with unix sockets, but I'm
> open to be convinced on above limitations.  If exec is better than
> cpr-transfer in any way, the hope is more people can and should adopt it.

Various people and companies have expressed interest in CPR and want to explore
cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
give people the option.  And Oracle continues to use cpr-exec mode.

There is no downside to supporting cpr-exec mode.  It is astonishing how much
code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
this series is factored into specific cpr-exec files and functions, code that
will never run for any other reason.  There are very few conditionals in common
code that do something different for exec mode.
> We also have no answer yet on how cpr-exec can resolve container world with
> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
> downside instead.  Better mention that in the cover letter.
The key is limiting the contents of the container, so exec only has a limited
and known safe set of things to target.  I'll add that to the cover letter.

- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-09 14:36   ` Steven Sistare
@ 2025-09-09 15:24     ` Peter Xu
  2025-09-09 16:03       ` Steven Sistare
  2025-09-09 16:41       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Xu @ 2025-09-09 15:24 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
> On 9/5/2025 12:48 PM, Peter Xu wrote:
> > Add Vladimir and Dan.
> > 
> > On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> > > This patch series adds the live migration cpr-exec mode.
> > > 
> > > The new user-visible interfaces are:
> > >    * cpr-exec (MigMode migration parameter)
> > >    * cpr-exec-command (migration parameter)
> > > 
> > > cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> > > primary difference being that old QEMU directly exec's new QEMU.  The user
> > > specifies the command to exec new QEMU in the migration parameter
> > > cpr-exec-command.
> > > 
> > > Why?
> > > 
> > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > > container and its assigned resources.  By contrast, cpr-transfer mode
> > > requires a new container to be created on the same host as the target of
> > > the CPR operation.  Resources must be reserved for the new container, while
> > > the old container still reserves resources until the operation completes.
> > > Avoiding over commitment requires extra work in the management layer.
> > 
> > Can we spell out what are these resources?
> > 
> > CPR definitely relies on completely shared memory.  That's already not a
> > concern.
> > 
> > CPR resolves resources that are bound to devices like VFIO by passing over
> > FDs, these are not over commited either.
> > 
> > Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> > IMHO, but maybe something else?
> 
> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
> use of a set of CPUs, the same set for the old and new container, concurrently.
> Another is avoiding namespace conflicts, the kind that make localhost migration
> difficult.
> 
> > > This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> > > is that the container may include agents with their own connections to the
> > > outside world, and such connections remain intact if the container is reused.
> > 
> > We discussed about this one.  Personally I still cannot understand why this
> > is a concern if the agents can be trivially started as a new instance.  But
> > I admit I may not know the whole picture.  To me, the above point is more
> > persuasive, but I'll need to understand which part that is over-commited
> > that can be a problem.
> 
> Agents can be restarted, but that would sever the connection to the outside
> world.  With cpr-transfer or any local migration, you would need agents
> outside of old and new containers that persist.
> 
> With cpr-exec, connections can be preserved without requiring the end user
> to reconnect, and can be done trivially, by preserving chardevs.  With that
> support in qemu, the management layer does nothing extra to preserve them.
> chardev support is not part of this series but is part of my vision,
> and makes exec mode even more compelling.
> 
> Management layers have a lot of code and complexity to manage live migration,
> resources, and connections.  It requires modification to support cpr-transfer.
> All that can be bypassed with exec mode.  Less complexity, less maintainance,
> and  fewer points of failure.  I know this because I implemented exec mode in
> OCI at Oracle, and we use it in production.

I wonders how this part works in Vladimir's use case.

> > After all, cloud hosts should preserve some extra memory anyway to make
> > sure dynamic resources allocations all the time (e.g., when live migration
> > starts, KVM pgtables can drastically increase if huge pages are enabled,
> > for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> > that those.. and when it's also temporary (src QEMU will release all
> > resources after live upgrade) then it looks manageable. >>
> > > How?
> > > 
> > > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > > and by sending the unique name and value of each descriptor to new QEMU
> > > via CPR state.
> > > 
> > > CPR state cannot be sent over the normal migration channel, because devices
> > > and backends are created prior to reading the channel, so this mode sends
> > > CPR state over a second migration channel that is not visible to the user.
> > > New QEMU reads the second channel prior to creating devices or backends.
> > > 
> > > The exec itself is trivial.  After writing to the migration channels, the
> > > migration code calls a new main-loop hook to perform the exec.
> > > 
> > > Example:
> > > 
> > > In this example, we simply restart the same version of QEMU, but in
> > > a real scenario one would use a new QEMU binary path in cpr-exec-command.
> > > 
> > >    # qemu-kvm -monitor stdio
> > >    -object memory-backend-memfd,id=ram0,size=1G
> > >    -machine memory-backend=ram0 -machine aux-ram-share=on ...
> > > 
> > >    QEMU 10.1.50 monitor - type 'help' for more information
> > >    (qemu) info status
> > >    VM status: running
> > >    (qemu) migrate_set_parameter mode cpr-exec
> > >    (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
> > >    (qemu) migrate -d file:vm.state
> > >    (qemu) QEMU 10.1.50 monitor - type 'help' for more information
> > >    (qemu) info status
> > >    VM status: running
> > > 
> > > Steve Sistare (9):
> > >    migration: multi-mode notifier
> > >    migration: add cpr_walk_fd
> > >    oslib: qemu_clear_cloexec
> > >    vl: helper to request exec
> > >    migration: cpr-exec-command parameter
> > >    migration: cpr-exec save and load
> > >    migration: cpr-exec mode
> > >    migration: cpr-exec docs
> > >    vfio: cpr-exec mode
> > 
> > The other thing is, as Vladimir is working on (looks like) a cleaner way of
> > passing FDs fully relying on unix sockets, I want to understand better on
> > the relationships of his work and the exec model.
> 
> His work is based on my work -- the ability to embed a file descriptor in a
> migration stream with a VMSTATE_FD declaration -- so it is compatible.
> 
> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
> integer and embedding that in the data stream.  See the changes in vmstate-types.c
> in [PATCH V3 7/9] migration: cpr-exec mode.
> 
> Thus cpr-exec will still preserve tap devices via Vladimir's code.
> > I still personally think we should always stick with unix sockets, but I'm
> > open to be convinced on above limitations.  If exec is better than
> > cpr-transfer in any way, the hope is more people can and should adopt it.
> 
> Various people and companies have expressed interest in CPR and want to explore
> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
> give people the option.  And Oracle continues to use cpr-exec mode.

How does cpr-exec guarantees everything will go smoothly with no failure
after the exec?  Essentially, this is Vladimir's question 1.  Feel free to
answer there, because there's also question 2 (which we used to cover some
but maybe not as much).

The other thing I don't remember if we discussed, on how cpr-exec manages
device hotplugs. Say, what happens if there are devices hot plugged (via
QMP) then cpr-exec migration happens?

Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
cmdlines and append them?  How to guarantee src/dst device topology match
exactly the same with the new cmdline?

> 
> There is no downside to supporting cpr-exec mode.  It is astonishing how much
> code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
> this series is factored into specific cpr-exec files and functions, code that
> will never run for any other reason.  There are very few conditionals in common
> code that do something different for exec mode.
> > We also have no answer yet on how cpr-exec can resolve container world with
> > seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
> > downside instead.  Better mention that in the cover letter.
> The key is limiting the contents of the container, so exec only has a limited
> and known safe set of things to target.  I'll add that to the cover letter.

Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-09 15:24     ` Peter Xu
@ 2025-09-09 16:03       ` Steven Sistare
  2025-09-09 18:37         ` Peter Xu
  2025-09-09 16:41       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-09-09 16:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On 9/9/2025 11:24 AM, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
>> On 9/5/2025 12:48 PM, Peter Xu wrote:
>>> Add Vladimir and Dan.
>>>
>>> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>>>> This patch series adds the live migration cpr-exec mode.
>>>>
>>>> The new user-visible interfaces are:
>>>>     * cpr-exec (MigMode migration parameter)
>>>>     * cpr-exec-command (migration parameter)
>>>>
>>>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>>>> primary difference being that old QEMU directly exec's new QEMU.  The user
>>>> specifies the command to exec new QEMU in the migration parameter
>>>> cpr-exec-command.
>>>>
>>>> Why?
>>>>
>>>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>>>> container and its assigned resources.  By contrast, cpr-transfer mode
>>>> requires a new container to be created on the same host as the target of
>>>> the CPR operation.  Resources must be reserved for the new container, while
>>>> the old container still reserves resources until the operation completes.
>>>> Avoiding over commitment requires extra work in the management layer.
>>>
>>> Can we spell out what are these resources?
>>>
>>> CPR definitely relies on completely shared memory.  That's already not a
>>> concern.
>>>
>>> CPR resolves resources that are bound to devices like VFIO by passing over
>>> FDs, these are not over commited either.
>>>
>>> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
>>> IMHO, but maybe something else?
>>
>> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
>> use of a set of CPUs, the same set for the old and new container, concurrently.
>> Another is avoiding namespace conflicts, the kind that make localhost migration
>> difficult.
>>
>>>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>>>> is that the container may include agents with their own connections to the
>>>> outside world, and such connections remain intact if the container is reused.
>>>
>>> We discussed about this one.  Personally I still cannot understand why this
>>> is a concern if the agents can be trivially started as a new instance.  But
>>> I admit I may not know the whole picture.  To me, the above point is more
>>> persuasive, but I'll need to understand which part that is over-commited
>>> that can be a problem.
>>
>> Agents can be restarted, but that would sever the connection to the outside
>> world.  With cpr-transfer or any local migration, you would need agents
>> outside of old and new containers that persist.
>>
>> With cpr-exec, connections can be preserved without requiring the end user
>> to reconnect, and can be done trivially, by preserving chardevs.  With that
>> support in qemu, the management layer does nothing extra to preserve them.
>> chardev support is not part of this series but is part of my vision,
>> and makes exec mode even more compelling.
>>
>> Management layers have a lot of code and complexity to manage live migration,
>> resources, and connections.  It requires modification to support cpr-transfer.
>> All that can be bypassed with exec mode.  Less complexity, less maintainance,
>> and  fewer points of failure.  I know this because I implemented exec mode in
>> OCI at Oracle, and we use it in production.
> 
> I wonders how this part works in Vladimir's use case.
> 
>>> After all, cloud hosts should preserve some extra memory anyway to make
>>> sure dynamic resources allocations all the time (e.g., when live migration
>>> starts, KVM pgtables can drastically increase if huge pages are enabled,
>>> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
>>> that those.. and when it's also temporary (src QEMU will release all
>>> resources after live upgrade) then it looks manageable. >>
>>>> How?
>>>>
>>>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>>>> and by sending the unique name and value of each descriptor to new QEMU
>>>> via CPR state.
>>>>
>>>> CPR state cannot be sent over the normal migration channel, because devices
>>>> and backends are created prior to reading the channel, so this mode sends
>>>> CPR state over a second migration channel that is not visible to the user.
>>>> New QEMU reads the second channel prior to creating devices or backends.
>>>>
>>>> The exec itself is trivial.  After writing to the migration channels, the
>>>> migration code calls a new main-loop hook to perform the exec.
>>>>
>>>> Example:
>>>>
>>>> In this example, we simply restart the same version of QEMU, but in
>>>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>>>
>>>>     # qemu-kvm -monitor stdio
>>>>     -object memory-backend-memfd,id=ram0,size=1G
>>>>     -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>>>
>>>>     QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>     (qemu) migrate_set_parameter mode cpr-exec
>>>>     (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>>>     (qemu) migrate -d file:vm.state
>>>>     (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>
>>>> Steve Sistare (9):
>>>>     migration: multi-mode notifier
>>>>     migration: add cpr_walk_fd
>>>>     oslib: qemu_clear_cloexec
>>>>     vl: helper to request exec
>>>>     migration: cpr-exec-command parameter
>>>>     migration: cpr-exec save and load
>>>>     migration: cpr-exec mode
>>>>     migration: cpr-exec docs
>>>>     vfio: cpr-exec mode
>>>
>>> The other thing is, as Vladimir is working on (looks like) a cleaner way of
>>> passing FDs fully relying on unix sockets, I want to understand better on
>>> the relationships of his work and the exec model.
>>
>> His work is based on my work -- the ability to embed a file descriptor in a
>> migration stream with a VMSTATE_FD declaration -- so it is compatible.
>>
>> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
>> integer and embedding that in the data stream.  See the changes in vmstate-types.c
>> in [PATCH V3 7/9] migration: cpr-exec mode.
>>
>> Thus cpr-exec will still preserve tap devices via Vladimir's code.
>>> I still personally think we should always stick with unix sockets, but I'm
>>> open to be convinced on above limitations.  If exec is better than
>>> cpr-transfer in any way, the hope is more people can and should adopt it.
>>
>> Various people and companies have expressed interest in CPR and want to explore
>> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
>> give people the option.  And Oracle continues to use cpr-exec mode.
> 
> How does cpr-exec guarantees everything will go smoothly with no failure
> after the exec?  Essentially, this is Vladimir's question 1.  

Live migration can fail if dirty memory copy does not converge.  CPR does not.
cpr-transfer can fail if it fails to create a new container.  cpr-exec cannot.
cpr-transfer can fail to allocate resources.  cpr-exec needs less.

cpr-exec failure is almost always due to a QEMU bug.  For example, a new feature
has been added to new QEMU, and is *not* forced to false in a compatibility entry
for the old machine model. We do our best to find and fix those before going into
production. In production, the success rate is high. That is one reason I like the
mode so much.

> Feel free to
> answer there, because there's also question 2 (which we used to cover some
> but maybe not as much).

Question 2 is about minimizing downtime by starting new QEMU while old QEMU
is still running.  That is true, but the savings are small.
> The other thing I don't remember if we discussed, on how cpr-exec manages
> device hotplugs. Say, what happens if there are devices hot plugged (via
> QMP) then cpr-exec migration happens?One method: start new qemu with the original command-line arguments plus -S, then
mgmt re-sends the hot plug commands to the qemu monitor.  Same as for live
migration.
> Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
> cmdlines and append them?  
That also works, and is a technique I have used to reduce guest pause time.

> How to guarantee src/dst device topology match
> exactly the same with the new cmdline?

That is up to the mgmt layer, to know how QEMU was originally started, and
what has been hot plugged afterwards.  The fast qom-list-get command that
I recently added can help here.

- Steve
>> There is no downside to supporting cpr-exec mode.  It is astonishing how much
>> code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
>> this series is factored into specific cpr-exec files and functions, code that
>> will never run for any other reason.  There are very few conditionals in common
>> code that do something different for exec mode.
>>> We also have no answer yet on how cpr-exec can resolve container world with
>>> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
>>> downside instead.  Better mention that in the cover letter.
>> The key is limiting the contents of the container, so exec only has a limited
>> and known safe set of things to target.  I'll add that to the cover letter.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-09 16:03       ` Steven Sistare
@ 2025-09-09 18:37         ` Peter Xu
  2025-09-12 14:50           ` Steven Sistare
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-09-09 18:37 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On Tue, Sep 09, 2025 at 12:03:11PM -0400, Steven Sistare wrote:
> On 9/9/2025 11:24 AM, Peter Xu wrote:
> > On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
> > > On 9/5/2025 12:48 PM, Peter Xu wrote:
> > > > Add Vladimir and Dan.
> > > > 
> > > > On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> > > > > This patch series adds the live migration cpr-exec mode.
> > > > > 
> > > > > The new user-visible interfaces are:
> > > > >     * cpr-exec (MigMode migration parameter)
> > > > >     * cpr-exec-command (migration parameter)
> > > > > 
> > > > > cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> > > > > primary difference being that old QEMU directly exec's new QEMU.  The user
> > > > > specifies the command to exec new QEMU in the migration parameter
> > > > > cpr-exec-command.
> > > > > 
> > > > > Why?
> > > > > 
> > > > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > > > > container and its assigned resources.  By contrast, cpr-transfer mode
> > > > > requires a new container to be created on the same host as the target of
> > > > > the CPR operation.  Resources must be reserved for the new container, while
> > > > > the old container still reserves resources until the operation completes.
> > > > > Avoiding over commitment requires extra work in the management layer.
> > > > 
> > > > Can we spell out what are these resources?
> > > > 
> > > > CPR definitely relies on completely shared memory.  That's already not a
> > > > concern.
> > > > 
> > > > CPR resolves resources that are bound to devices like VFIO by passing over
> > > > FDs, these are not over commited either.
> > > > 
> > > > Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> > > > IMHO, but maybe something else?
> > > 
> > > Accounting is one issue, and it is not trivial.  Another is arranging exclusive
> > > use of a set of CPUs, the same set for the old and new container, concurrently.
> > > Another is avoiding namespace conflicts, the kind that make localhost migration
> > > difficult.
> > > 
> > > > > This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> > > > > is that the container may include agents with their own connections to the
> > > > > outside world, and such connections remain intact if the container is reused.
> > > > 
> > > > We discussed about this one.  Personally I still cannot understand why this
> > > > is a concern if the agents can be trivially started as a new instance.  But
> > > > I admit I may not know the whole picture.  To me, the above point is more
> > > > persuasive, but I'll need to understand which part that is over-commited
> > > > that can be a problem.
> > > 
> > > Agents can be restarted, but that would sever the connection to the outside
> > > world.  With cpr-transfer or any local migration, you would need agents
> > > outside of old and new containers that persist.
> > > 
> > > With cpr-exec, connections can be preserved without requiring the end user
> > > to reconnect, and can be done trivially, by preserving chardevs.  With that
> > > support in qemu, the management layer does nothing extra to preserve them.
> > > chardev support is not part of this series but is part of my vision,
> > > and makes exec mode even more compelling.
> > > 
> > > Management layers have a lot of code and complexity to manage live migration,
> > > resources, and connections.  It requires modification to support cpr-transfer.
> > > All that can be bypassed with exec mode.  Less complexity, less maintainance,
> > > and  fewer points of failure.  I know this because I implemented exec mode in
> > > OCI at Oracle, and we use it in production.
> > 
> > I wonders how this part works in Vladimir's use case.
> > 
> > > > After all, cloud hosts should preserve some extra memory anyway to make
> > > > sure dynamic resources allocations all the time (e.g., when live migration
> > > > starts, KVM pgtables can drastically increase if huge pages are enabled,
> > > > for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> > > > that those.. and when it's also temporary (src QEMU will release all
> > > > resources after live upgrade) then it looks manageable. >>
> > > > > How?
> > > > > 
> > > > > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > > > > and by sending the unique name and value of each descriptor to new QEMU
> > > > > via CPR state.
> > > > > 
> > > > > CPR state cannot be sent over the normal migration channel, because devices
> > > > > and backends are created prior to reading the channel, so this mode sends
> > > > > CPR state over a second migration channel that is not visible to the user.
> > > > > New QEMU reads the second channel prior to creating devices or backends.
> > > > > 
> > > > > The exec itself is trivial.  After writing to the migration channels, the
> > > > > migration code calls a new main-loop hook to perform the exec.
> > > > > 
> > > > > Example:
> > > > > 
> > > > > In this example, we simply restart the same version of QEMU, but in
> > > > > a real scenario one would use a new QEMU binary path in cpr-exec-command.
> > > > > 
> > > > >     # qemu-kvm -monitor stdio
> > > > >     -object memory-backend-memfd,id=ram0,size=1G
> > > > >     -machine memory-backend=ram0 -machine aux-ram-share=on ...
> > > > > 
> > > > >     QEMU 10.1.50 monitor - type 'help' for more information
> > > > >     (qemu) info status
> > > > >     VM status: running
> > > > >     (qemu) migrate_set_parameter mode cpr-exec
> > > > >     (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
> > > > >     (qemu) migrate -d file:vm.state
> > > > >     (qemu) QEMU 10.1.50 monitor - type 'help' for more information
> > > > >     (qemu) info status
> > > > >     VM status: running
> > > > > 
> > > > > Steve Sistare (9):
> > > > >     migration: multi-mode notifier
> > > > >     migration: add cpr_walk_fd
> > > > >     oslib: qemu_clear_cloexec
> > > > >     vl: helper to request exec
> > > > >     migration: cpr-exec-command parameter
> > > > >     migration: cpr-exec save and load
> > > > >     migration: cpr-exec mode
> > > > >     migration: cpr-exec docs
> > > > >     vfio: cpr-exec mode
> > > > 
> > > > The other thing is, as Vladimir is working on (looks like) a cleaner way of
> > > > passing FDs fully relying on unix sockets, I want to understand better on
> > > > the relationships of his work and the exec model.
> > > 
> > > His work is based on my work -- the ability to embed a file descriptor in a
> > > migration stream with a VMSTATE_FD declaration -- so it is compatible.
> > > 
> > > The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
> > > integer and embedding that in the data stream.  See the changes in vmstate-types.c
> > > in [PATCH V3 7/9] migration: cpr-exec mode.
> > > 
> > > Thus cpr-exec will still preserve tap devices via Vladimir's code.
> > > > I still personally think we should always stick with unix sockets, but I'm
> > > > open to be convinced on above limitations.  If exec is better than
> > > > cpr-transfer in any way, the hope is more people can and should adopt it.
> > > 
> > > Various people and companies have expressed interest in CPR and want to explore
> > > cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
> > > give people the option.  And Oracle continues to use cpr-exec mode.
> > 
> > How does cpr-exec guarantees everything will go smoothly with no failure
> > after the exec?  Essentially, this is Vladimir's question 1.
> 
> Live migration can fail if dirty memory copy does not converge.  CPR does not.

As we're comparing cpr-transfer and cpr-exec, this one doesn't really count, AFAIU.

> cpr-transfer can fail if it fails to create a new container.  cpr-exec cannot.
> cpr-transfer can fail to allocate resources.  cpr-exec needs less.

These two could happen in very occpied hosts indeed, but is it really that
common an issue when ignoring the whole guest memory section after all?

> 
> cpr-exec failure is almost always due to a QEMU bug.  For example, a new feature
> has been added to new QEMU, and is *not* forced to false in a compatibility entry
> for the old machine model. We do our best to find and fix those before going into
> production. In production, the success rate is high. That is one reason I like the
> mode so much.

Yes, but this is still a major issue.  The problem is I don't think we have
good way to provide 100% coverage on the code base covering all kinds of
migrations.

After all, we have tons of needed() fields in VMSD, we need to always be
prepared that the migration stream can change from time to time with
exactly the same device setup, and some of them may prone to put() failures
on the other side.

After all, live migration was designed to be fine with such, so at least VM
won't crash on src if anything happens.

Precopy always does that, we're trying to make postcopy do the same, which
Juraj is working on, so that postcopy can FAIL and rollback to src too if
device state doesn't apply all fine.

It's still not uncommon to have guest OS / driver behavior change causing
some corner case migration failures but only when applying the states.

That's IMHO a high risk even if low possibility.

> 
> > Feel free to
> > answer there, because there's also question 2 (which we used to cover some
> > but maybe not as much).
> 
> Question 2 is about minimizing downtime by starting new QEMU while old QEMU
> is still running.  That is true, but the savings are small.

I thought we discussed about this, and it should be known to have at least
below two major part of things that will increase downtime (either directly
accounted into downtime, or slow down vcpus later)?

  - Process pgtable, aka, QEMU's view of guest mem
  - EPT pgtable, aka, vCPU's view of guest mem

Populating these should normally take time when VM becomes huge, while
cpr-transfer can still benefit on pre-populations before switchover.

IIUC that's a known issue, but please correct me if I remembered it wrong.
I think it means this issue is more severe with larger VMs, which is a
trade-off.  It's just that I don't know what else might be relevant.

Personally I don't think this is a blocker for cpr-exec, but we should IMHO
record the differences.  It would be best, IMHO, to have a section in
cpr.rst to discuss this, helping user decide which to choose when both
benefits from CPR in general.

Meanwhile, just to mention unit test for cpr-exec is still missing.

> > The other thing I don't remember if we discussed, on how cpr-exec manages
> > device hotplugs. Say, what happens if there are devices hot plugged (via
> > QMP) then cpr-exec migration happens?One method: start new qemu with the original command-line arguments plus -S, then
> mgmt re-sends the hot plug commands to the qemu monitor.  Same as for live
> migration.
> > Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
> > cmdlines and append them?
> That also works, and is a technique I have used to reduce guest pause time.
> 
> > How to guarantee src/dst device topology match
> > exactly the same with the new cmdline?
> 
> That is up to the mgmt layer, to know how QEMU was originally started, and
> what has been hot plugged afterwards.  The fast qom-list-get command that
> I recently added can help here.

I see.  If you think that is the best way to consume cpr-exec, would you
add a small section into the doc patch for it as well?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-09 18:37         ` Peter Xu
@ 2025-09-12 14:50           ` Steven Sistare
  2025-09-12 15:44             ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-09-12 14:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On 9/9/2025 2:37 PM, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 12:03:11PM -0400, Steven Sistare wrote:
>> On 9/9/2025 11:24 AM, Peter Xu wrote:
>>> On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
>>>> On 9/5/2025 12:48 PM, Peter Xu wrote:
>>>>> Add Vladimir and Dan.
>>>>>
>>>>> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>>>>>> This patch series adds the live migration cpr-exec mode.
>>>>>>
>>>>>> The new user-visible interfaces are:
>>>>>>      * cpr-exec (MigMode migration parameter)
>>>>>>      * cpr-exec-command (migration parameter)
>>>>>>
>>>>>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>>>>>> primary difference being that old QEMU directly exec's new QEMU.  The user
>>>>>> specifies the command to exec new QEMU in the migration parameter
>>>>>> cpr-exec-command.
>>>>>>
>>>>>> Why?
>>>>>>
>>>>>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>>>>>> container and its assigned resources.  By contrast, cpr-transfer mode
>>>>>> requires a new container to be created on the same host as the target of
>>>>>> the CPR operation.  Resources must be reserved for the new container, while
>>>>>> the old container still reserves resources until the operation completes.
>>>>>> Avoiding over commitment requires extra work in the management layer.
>>>>>
>>>>> Can we spell out what are these resources?
>>>>>
>>>>> CPR definitely relies on completely shared memory.  That's already not a
>>>>> concern.
>>>>>
>>>>> CPR resolves resources that are bound to devices like VFIO by passing over
>>>>> FDs, these are not over commited either.
>>>>>
>>>>> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
>>>>> IMHO, but maybe something else?
>>>>
>>>> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
>>>> use of a set of CPUs, the same set for the old and new container, concurrently.
>>>> Another is avoiding namespace conflicts, the kind that make localhost migration
>>>> difficult.
>>>>
>>>>>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>>>>>> is that the container may include agents with their own connections to the
>>>>>> outside world, and such connections remain intact if the container is reused.
>>>>>
>>>>> We discussed about this one.  Personally I still cannot understand why this
>>>>> is a concern if the agents can be trivially started as a new instance.  But
>>>>> I admit I may not know the whole picture.  To me, the above point is more
>>>>> persuasive, but I'll need to understand which part that is over-commited
>>>>> that can be a problem.
>>>>
>>>> Agents can be restarted, but that would sever the connection to the outside
>>>> world.  With cpr-transfer or any local migration, you would need agents
>>>> outside of old and new containers that persist.
>>>>
>>>> With cpr-exec, connections can be preserved without requiring the end user
>>>> to reconnect, and can be done trivially, by preserving chardevs.  With that
>>>> support in qemu, the management layer does nothing extra to preserve them.
>>>> chardev support is not part of this series but is part of my vision,
>>>> and makes exec mode even more compelling.
>>>>
>>>> Management layers have a lot of code and complexity to manage live migration,
>>>> resources, and connections.  It requires modification to support cpr-transfer.
>>>> All that can be bypassed with exec mode.  Less complexity, less maintainance,
>>>> and  fewer points of failure.  I know this because I implemented exec mode in
>>>> OCI at Oracle, and we use it in production.
>>>
>>> I wonders how this part works in Vladimir's use case.
>>>
>>>>> After all, cloud hosts should preserve some extra memory anyway to make
>>>>> sure dynamic resources allocations all the time (e.g., when live migration
>>>>> starts, KVM pgtables can drastically increase if huge pages are enabled,
>>>>> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
>>>>> that those.. and when it's also temporary (src QEMU will release all
>>>>> resources after live upgrade) then it looks manageable. >>
>>>>>> How?
>>>>>>
>>>>>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>>>>>> and by sending the unique name and value of each descriptor to new QEMU
>>>>>> via CPR state.
>>>>>>
>>>>>> CPR state cannot be sent over the normal migration channel, because devices
>>>>>> and backends are created prior to reading the channel, so this mode sends
>>>>>> CPR state over a second migration channel that is not visible to the user.
>>>>>> New QEMU reads the second channel prior to creating devices or backends.
>>>>>>
>>>>>> The exec itself is trivial.  After writing to the migration channels, the
>>>>>> migration code calls a new main-loop hook to perform the exec.
>>>>>>
>>>>>> Example:
>>>>>>
>>>>>> In this example, we simply restart the same version of QEMU, but in
>>>>>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>>>>>
>>>>>>      # qemu-kvm -monitor stdio
>>>>>>      -object memory-backend-memfd,id=ram0,size=1G
>>>>>>      -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>>>>>
>>>>>>      QEMU 10.1.50 monitor - type 'help' for more information
>>>>>>      (qemu) info status
>>>>>>      VM status: running
>>>>>>      (qemu) migrate_set_parameter mode cpr-exec
>>>>>>      (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>>>>>      (qemu) migrate -d file:vm.state
>>>>>>      (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>>>>>      (qemu) info status
>>>>>>      VM status: running
>>>>>>
>>>>>> Steve Sistare (9):
>>>>>>      migration: multi-mode notifier
>>>>>>      migration: add cpr_walk_fd
>>>>>>      oslib: qemu_clear_cloexec
>>>>>>      vl: helper to request exec
>>>>>>      migration: cpr-exec-command parameter
>>>>>>      migration: cpr-exec save and load
>>>>>>      migration: cpr-exec mode
>>>>>>      migration: cpr-exec docs
>>>>>>      vfio: cpr-exec mode
>>>>>
>>>>> The other thing is, as Vladimir is working on (looks like) a cleaner way of
>>>>> passing FDs fully relying on unix sockets, I want to understand better on
>>>>> the relationships of his work and the exec model.
>>>>
>>>> His work is based on my work -- the ability to embed a file descriptor in a
>>>> migration stream with a VMSTATE_FD declaration -- so it is compatible.
>>>>
>>>> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
>>>> integer and embedding that in the data stream.  See the changes in vmstate-types.c
>>>> in [PATCH V3 7/9] migration: cpr-exec mode.
>>>>
>>>> Thus cpr-exec will still preserve tap devices via Vladimir's code.
>>>>> I still personally think we should always stick with unix sockets, but I'm
>>>>> open to be convinced on above limitations.  If exec is better than
>>>>> cpr-transfer in any way, the hope is more people can and should adopt it.
>>>>
>>>> Various people and companies have expressed interest in CPR and want to explore
>>>> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
>>>> give people the option.  And Oracle continues to use cpr-exec mode.
>>>
>>> How does cpr-exec guarantees everything will go smoothly with no failure
>>> after the exec?  Essentially, this is Vladimir's question 1.
>>
>> Live migration can fail if dirty memory copy does not converge.  CPR does not.
> 
> As we're comparing cpr-transfer and cpr-exec, this one doesn't really count, AFAIU.
> 
>> cpr-transfer can fail if it fails to create a new container.  cpr-exec cannot.
>> cpr-transfer can fail to allocate resources.  cpr-exec needs less.
> 
> These two could happen in very occpied hosts indeed, but is it really that
> common an issue when ignoring the whole guest memory section after all?

Conventional wisdom holds that in migration scenarios, we must have the option
to fall back to the source if the target fails.  In all the above, I point out
the reasons behind this wisdom, and that many of those reasons do not apply
for cpr-exec.
>> cpr-exec failure is almost always due to a QEMU bug.  For example, a new feature
>> has been added to new QEMU, and is *not* forced to false in a compatibility entry
>> for the old machine model. We do our best to find and fix those before going into
>> production. In production, the success rate is high. That is one reason I like the
>> mode so much.
> 
> Yes, but this is still a major issue.  The problem is I don't think we have
> good way to provide 100% coverage on the code base covering all kinds of
> migrations.
> 
> After all, we have tons of needed() fields in VMSD, we need to always be
> prepared that the migration stream can change from time to time with
> exactly the same device setup, and some of them may prone to put() failures
> on the other side.
> 
> After all, live migration was designed to be fine with such, so at least VM
> won't crash on src if anything happens.
> 
> Precopy always does that, we're trying to make postcopy do the same, which
> Juraj is working on, so that postcopy can FAIL and rollback to src too if
> device state doesn't apply all fine.
> 
> It's still not uncommon to have guest OS / driver behavior change causing
> some corner case migration failures but only when applying the states.
> 
> That's IMHO a high risk even if low possibility.

No question, bugs are a risk and will occur.

>>> Feel free to
>>> answer there, because there's also question 2 (which we used to cover some
>>> but maybe not as much).
>>
>> Question 2 is about minimizing downtime by starting new QEMU while old QEMU
>> is still running.  That is true, but the savings are small.
> 
> I thought we discussed about this, and it should be known to have at least
> below two major part of things that will increase downtime (either directly
> accounted into downtime, or slow down vcpus later)?
> 
>    - Process pgtable, aka, QEMU's view of guest mem
>    - EPT pgtable, aka, vCPU's view of guest mem
> 
> Populating these should normally take time when VM becomes huge, while
> cpr-transfer can still benefit on pre-populations before switchover.
> 
> IIUC that's a known issue, but please correct me if I remembered it wrong.
> I think it means this issue is more severe with larger VMs, which is a
> trade-off.  It's just that I don't know what else might be relevant.
> 
> Personally I don't think this is a blocker for cpr-exec, but we should IMHO
> record the differences.  It would be best, IMHO, to have a section in
> cpr.rst to discuss this, helping user decide which to choose when both
> benefits from CPR in general.

Will do.

> Meanwhile, just to mention unit test for cpr-exec is still missing.

I will rebase and post it after receiving all comments for V3.
I think we are almost there.
>>> The other thing I don't remember if we discussed, on how cpr-exec manages
>>> device hotplugs. Say, what happens if there are devices hot plugged (via
>>> QMP) then cpr-exec migration happens?One method: start new qemu with the original command-line arguments plus -S, then
>> mgmt re-sends the hot plug commands to the qemu monitor.  Same as for live
>> migration.
>>> Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
>>> cmdlines and append them?
>> That also works, and is a technique I have used to reduce guest pause time.
>>
>>> How to guarantee src/dst device topology match
>>> exactly the same with the new cmdline?
>>
>> That is up to the mgmt layer, to know how QEMU was originally started, and
>> what has been hot plugged afterwards.  The fast qom-list-get command that
>> I recently added can help here.
> 
> I see.  If you think that is the best way to consume cpr-exec, would you
> add a small section into the doc patch for it as well?

It is not related to cpr-exec.  It is related to hot plug, for any migration
type scenario, so it does not fit in the cpr-exec docs.

- Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-12 14:50           ` Steven Sistare
@ 2025-09-12 15:44             ` Peter Xu
  2025-09-19 17:16               ` Steven Sistare
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-09-12 15:44 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On Fri, Sep 12, 2025 at 10:50:34AM -0400, Steven Sistare wrote:
> > > > How to guarantee src/dst device topology match
> > > > exactly the same with the new cmdline?
> > > 
> > > That is up to the mgmt layer, to know how QEMU was originally started, and
> > > what has been hot plugged afterwards.  The fast qom-list-get command that
> > > I recently added can help here.
> > 
> > I see.  If you think that is the best way to consume cpr-exec, would you
> > add a small section into the doc patch for it as well?
> 
> It is not related to cpr-exec.  It is related to hot plug, for any migration
> type scenario, so it does not fit in the cpr-exec docs.

IMHO it matters.. With cpr-transfer, QMP hot plugs works and will not
contribute to downtime.  cpr-exec also works, but will contribute to
downtime.

We could, in the comparison section between cpr-exec v.s. cpr-transfer,
mention the potential difference on device hot plugs (out of many other
differences), then also mention that there's an option to reduce downtime
for cpr-exec due to hot-plug by converting QMP hot plugs into cmdlines
leveraging qom-list-get and other facilities.  From there we could further
link to a special small section describing the usage of qom-list-get, or
stop there.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-12 15:44             ` Peter Xu
@ 2025-09-19 17:16               ` Steven Sistare
  2025-09-23 14:37                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Sistare @ 2025-09-19 17:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé

On 9/12/2025 11:44 AM, Peter Xu wrote:
> On Fri, Sep 12, 2025 at 10:50:34AM -0400, Steven Sistare wrote:
>>>>> How to guarantee src/dst device topology match
>>>>> exactly the same with the new cmdline?
>>>>
>>>> That is up to the mgmt layer, to know how QEMU was originally started, and
>>>> what has been hot plugged afterwards.  The fast qom-list-get command that
>>>> I recently added can help here.
>>>
>>> I see.  If you think that is the best way to consume cpr-exec, would you
>>> add a small section into the doc patch for it as well?
>>
>> It is not related to cpr-exec.  It is related to hot plug, for any migration
>> type scenario, so it does not fit in the cpr-exec docs.
> 
> IMHO it matters.. With cpr-transfer, QMP hot plugs works and will not
> contribute to downtime.

I don't follow.  The guest is not resumed until after all devices that were
present in old QEMU are hot plugged in new QEMU, regardless of mode.

> cpr-exec also works, but will contribute to
> downtime.
> 
> We could, in the comparison section between cpr-exec v.s. cpr-transfer,
> mention the potential difference on device hot plugs (out of many other
> differences), then also mention that there's an option to reduce downtime
> for cpr-exec due to hot-plug by converting QMP hot plugs into cmdlines
> leveraging qom-list-get and other facilities.  From there we could further
> link to a special small section describing the usage of qom-list-get, or
> stop there.

To hot plug a device, *or* to add it to the new QEMU command line, the manager
must know that the device was added sometime after old QEMU started, and
qom-list-get can help with that, by examining old QEMU initially and again
immediately before the update, then performing a diff.  But again, this
is independent of mode.

- Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-19 17:16               ` Steven Sistare
@ 2025-09-23 14:37                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2025-09-23 14:37 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Daniel P. Berrangé

On 19.09.25 20:16, Steven Sistare wrote:
> On 9/12/2025 11:44 AM, Peter Xu wrote:
>> On Fri, Sep 12, 2025 at 10:50:34AM -0400, Steven Sistare wrote:
>>>>>> How to guarantee src/dst device topology match
>>>>>> exactly the same with the new cmdline?
>>>>>
>>>>> That is up to the mgmt layer, to know how QEMU was originally started, and
>>>>> what has been hot plugged afterwards.  The fast qom-list-get command that
>>>>> I recently added can help here.
>>>>
>>>> I see.  If you think that is the best way to consume cpr-exec, would you
>>>> add a small section into the doc patch for it as well?
>>>
>>> It is not related to cpr-exec.  It is related to hot plug, for any migration
>>> type scenario, so it does not fit in the cpr-exec docs.
>>
>> IMHO it matters.. With cpr-transfer, QMP hot plugs works and will not
>> contribute to downtime.
> 
> I don't follow.  The guest is not resumed until after all devices that were
> present in old QEMU are hot plugged in new QEMU, regardless of mode.

Yes, but in case of cpr-transfer, source is still running at time when we do adding
devices to target through QMP. So, downtime is not started until we say "migrate-incoming".

> 
>> cpr-exec also works, but will contribute to
>> downtime.
>>
>> We could, in the comparison section between cpr-exec v.s. cpr-transfer,
>> mention the potential difference on device hot plugs (out of many other
>> differences), then also mention that there's an option to reduce downtime
>> for cpr-exec due to hot-plug by converting QMP hot plugs into cmdlines
>> leveraging qom-list-get and other facilities.  From there we could further
>> link to a special small section describing the usage of qom-list-get, or
>> stop there.
> 
> To hot plug a device, *or* to add it to the new QEMU command line, the manager
> must know that the device was added sometime after old QEMU started, and
> qom-list-get can help with that, by examining old QEMU initially and again
> immediately before the update, then performing a diff.  But again, this
> is independent of mode.
> 
> - Steve


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-09-09 15:24     ` Peter Xu
  2025-09-09 16:03       ` Steven Sistare
@ 2025-09-09 16:41       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2025-09-09 16:41 UTC (permalink / raw)
  To: Peter Xu, Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert, Daniel P. Berrangé

On 09.09.25 18:24, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
>> On 9/5/2025 12:48 PM, Peter Xu wrote:
>>> Add Vladimir and Dan.
>>>
>>> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>>>> This patch series adds the live migration cpr-exec mode.
>>>>
>>>> The new user-visible interfaces are:
>>>>     * cpr-exec (MigMode migration parameter)
>>>>     * cpr-exec-command (migration parameter)
>>>>
>>>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>>>> primary difference being that old QEMU directly exec's new QEMU.  The user
>>>> specifies the command to exec new QEMU in the migration parameter
>>>> cpr-exec-command.
>>>>
>>>> Why?
>>>>
>>>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>>>> container and its assigned resources.  By contrast, cpr-transfer mode
>>>> requires a new container to be created on the same host as the target of
>>>> the CPR operation.  Resources must be reserved for the new container, while
>>>> the old container still reserves resources until the operation completes.
>>>> Avoiding over commitment requires extra work in the management layer.
>>>
>>> Can we spell out what are these resources?
>>>
>>> CPR definitely relies on completely shared memory.  That's already not a
>>> concern.
>>>
>>> CPR resolves resources that are bound to devices like VFIO by passing over
>>> FDs, these are not over commited either.
>>>
>>> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
>>> IMHO, but maybe something else?
>>
>> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
>> use of a set of CPUs, the same set for the old and new container, concurrently.
>> Another is avoiding namespace conflicts, the kind that make localhost migration
>> difficult.
>>
>>>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>>>> is that the container may include agents with their own connections to the
>>>> outside world, and such connections remain intact if the container is reused.
>>>
>>> We discussed about this one.  Personally I still cannot understand why this
>>> is a concern if the agents can be trivially started as a new instance.  But
>>> I admit I may not know the whole picture.  To me, the above point is more
>>> persuasive, but I'll need to understand which part that is over-commited
>>> that can be a problem.
>>
>> Agents can be restarted, but that would sever the connection to the outside
>> world.  With cpr-transfer or any local migration, you would need agents
>> outside of old and new containers that persist.
>>
>> With cpr-exec, connections can be preserved without requiring the end user
>> to reconnect, and can be done trivially, by preserving chardevs.  With that
>> support in qemu, the management layer does nothing extra to preserve them.
>> chardev support is not part of this series but is part of my vision,
>> and makes exec mode even more compelling.
>>
>> Management layers have a lot of code and complexity to manage live migration,
>> resources, and connections.  It requires modification to support cpr-transfer.
>> All that can be bypassed with exec mode.  Less complexity, less maintainance,
>> and  fewer points of failure.  I know this because I implemented exec mode in
>> OCI at Oracle, and we use it in production.
> 
> I wonders how this part works in Vladimir's use case.


For now, we don't have live-update with fd-passing, I'm working on it. But we do
have working live-update with starting second QEMU process.

I hope, that finally support for fd-passing in management layer will only
need three steps:

- use unix-socket as migration channel
- enable new migration capability (and probably some options to enable feature per device)
- opt-out the code [1], which implements logic of switching TAP and disk for new QEMU instance

And I don't think we want to remove this logic [1] completely, as we may want to do
normal migration without fds at some moment, for example to change the backend. Or
to jump-over some theoretical future problems with fds passing (that's a new experimental
feature, there may be bugs, or even future incompatible changes (until it become stable).

cpr-transfer needs additional steps:

- more complex interface to setup two migration channels
- tricky logic about unavailable QMP for target process at start

Still, that's possible.

> 
>>> After all, cloud hosts should preserve some extra memory anyway to make
>>> sure dynamic resources allocations all the time (e.g., when live migration
>>> starts, KVM pgtables can drastically increase if huge pages are enabled,
>>> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
>>> that those.. and when it's also temporary (src QEMU will release all
>>> resources after live upgrade) then it looks manageable. >>
>>>> How?
>>>>
>>>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>>>> and by sending the unique name and value of each descriptor to new QEMU
>>>> via CPR state.
>>>>
>>>> CPR state cannot be sent over the normal migration channel, because devices
>>>> and backends are created prior to reading the channel, so this mode sends
>>>> CPR state over a second migration channel that is not visible to the user.
>>>> New QEMU reads the second channel prior to creating devices or backends.
>>>>
>>>> The exec itself is trivial.  After writing to the migration channels, the
>>>> migration code calls a new main-loop hook to perform the exec.
>>>>
>>>> Example:
>>>>
>>>> In this example, we simply restart the same version of QEMU, but in
>>>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>>>
>>>>     # qemu-kvm -monitor stdio
>>>>     -object memory-backend-memfd,id=ram0,size=1G
>>>>     -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>>>
>>>>     QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>     (qemu) migrate_set_parameter mode cpr-exec
>>>>     (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>>>     (qemu) migrate -d file:vm.state
>>>>     (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>
>>>> Steve Sistare (9):
>>>>     migration: multi-mode notifier
>>>>     migration: add cpr_walk_fd
>>>>     oslib: qemu_clear_cloexec
>>>>     vl: helper to request exec
>>>>     migration: cpr-exec-command parameter
>>>>     migration: cpr-exec save and load
>>>>     migration: cpr-exec mode
>>>>     migration: cpr-exec docs
>>>>     vfio: cpr-exec mode
>>>
>>> The other thing is, as Vladimir is working on (looks like) a cleaner way of
>>> passing FDs fully relying on unix sockets, I want to understand better on
>>> the relationships of his work and the exec model.
>>
>> His work is based on my work -- the ability to embed a file descriptor in a
>> migration stream with a VMSTATE_FD declaration -- so it is compatible.
>>
>> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
>> integer and embedding that in the data stream.  See the changes in vmstate-types.c
>> in [PATCH V3 7/9] migration: cpr-exec mode.
>>
>> Thus cpr-exec will still preserve tap devices via Vladimir's code.
>>> I still personally think we should always stick with unix sockets, but I'm
>>> open to be convinced on above limitations.  If exec is better than
>>> cpr-transfer in any way, the hope is more people can and should adopt it.
>>
>> Various people and companies have expressed interest in CPR and want to explore
>> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
>> give people the option.  And Oracle continues to use cpr-exec mode.
> 
> How does cpr-exec guarantees everything will go smoothly with no failure
> after the exec?  Essentially, this is Vladimir's question 1.  Feel free to
> answer there, because there's also question 2 (which we used to cover some
> but maybe not as much).
> 
> The other thing I don't remember if we discussed, on how cpr-exec manages
> device hotplugs. Say, what happens if there are devices hot plugged (via
> QMP) then cpr-exec migration happens?
> 
> Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
> cmdlines and append them?  How to guarantee src/dst device topology match
> exactly the same with the new cmdline?
> 

Seems, we discussed.

As I understand, it should work the same way like for normal migration:
we add -incoming defer to cpr-exec-command, and after exec we can add
our infrastructure through QMP interface, and than run "migrate-incoming".

Still, that would be done during downtime, unlike cpr-transfer, where source
is still running during target QMP setup.

So exec mode works more like migrating to file, and than restore from it.

Maybe, we may have a mediator program, which gets migration stream and fds
from source QEMU together with fds, than we start new QEMU process in same
container, and it gets the incoming migration stream together with fds
from the mediator?

Probably target QEMU itself may be used as this mediator, but we'll need an
option to buferise somehow the incoming migration state (together with fds),
and only start to apply it when source QEMU is closed.

How much exec mode would differ from such setup?

>>
>> There is no downside to supporting cpr-exec mode.  It is astonishing how much
>> code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
>> this series is factored into specific cpr-exec files and functions, code that
>> will never run for any other reason.  There are very few conditionals in common
>> code that do something different for exec mode.
>>> We also have no answer yet on how cpr-exec can resolve container world with
>>> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
>>> downside instead.  Better mention that in the cover letter.
>> The key is limiting the contents of the container, so exec only has a limited
>> and known safe set of things to target.  I'll add that to the cover letter.
> 
> Thanks.
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V3 0/9] Live update: cpr-exec
  2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
                   ` (9 preceding siblings ...)
  2025-09-05 16:48 ` [PATCH V3 0/9] Live update: cpr-exec Peter Xu
@ 2025-09-08 17:02 ` Vladimir Sementsov-Ogievskiy
  10 siblings, 0 replies; 47+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2025-09-08 17:02 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Fabiano Rosas, Peter Xu, Markus Armbruster, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

On 14.08.25 20:17, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.
> 
> The new user-visible interfaces are:
>    * cpr-exec (MigMode migration parameter)
>    * cpr-exec-command (migration parameter)
> 
> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> primary difference being that old QEMU directly exec's new QEMU.  The user
> specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command.
> 
> Why?
> 
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources.  By contrast, cpr-transfer mode
> requires a new container to be created on the same host as the target of
> the CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.
> Avoiding over commitment requires extra work in the management layer.
> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.
> 

My two cents:

We considered a possibility to switch to cpr-exec, and even more,
we thought about some kind of loading new version of QEMU binary to running
QEMU process (like library) and switching to it. But finally decided to
keep our current approach (starting new QEMU in a separate process) and
use CPR transfer (and finally come to my current in-list proposals of
just migrating all fds in main migration channel).

First, we don't run QEMU in docker, so probably we don't encounter some
problems around it. The real problem for us is migration downtime for
switching network and disk.

Still, why we don't want cpr-exec? Two reasons:

1. It seems, that current approach is more safe against different errors during
migration: we have more chances just to say "cont" on source process, if something
goes wrong.

2. It seems, that with second process we do have more possibilities to minimize
downtime, as we can do some initializations in a new QEMU process _before_ migration
(when second process starts, the first is still running).

I also thought about, could we do a kind of "exec", but still be able to avoid [2]?
This leads to an idea of loading new qemu binary to the running process (like library),
and .. start executing it in parallel with the old one? But that looks like trying
to reinvent processes again, which is obviously bad idea.

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-09-23 14:38 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
2025-08-14 17:17 ` [PATCH V3 1/9] migration: multi-mode notifier Steve Sistare
2025-08-19 13:09   ` Fabiano Rosas
2025-09-09 15:43   ` Peter Xu
2025-09-09 16:40     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 2/9] migration: add cpr_walk_fd Steve Sistare
2025-09-09 15:45   ` Peter Xu
2025-08-14 17:17 ` [PATCH V3 3/9] oslib: qemu_clear_cloexec Steve Sistare
2025-08-14 17:17 ` [PATCH V3 4/9] vl: helper to request exec Steve Sistare
2025-09-09 15:51   ` Peter Xu
2025-09-12 14:49     ` Steven Sistare
2025-09-15 16:35       ` Peter Xu
2025-09-19 15:27         ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 5/9] migration: cpr-exec-command parameter Steve Sistare
2025-09-08 16:07   ` Daniel P. Berrangé
2025-09-09 15:22     ` Steven Sistare
2025-09-11 15:10   ` Markus Armbruster
2025-09-12 14:48     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 6/9] migration: cpr-exec save and load Steve Sistare
2025-09-19 15:35   ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 7/9] migration: cpr-exec mode Steve Sistare
2025-09-09 16:32   ` Peter Xu
2025-09-09 18:10     ` Steven Sistare
2025-09-09 19:27       ` Peter Xu
2025-09-12 14:49         ` Steven Sistare
2025-09-11 15:09   ` Markus Armbruster
2025-09-12 14:49     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 8/9] migration: cpr-exec docs Steve Sistare
2025-09-15 20:36   ` Fabiano Rosas
2025-09-19 15:28     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 9/9] vfio: cpr-exec mode Steve Sistare
2025-08-14 17:20   ` Steven Sistare
2025-09-19 15:35     ` Steven Sistare
2025-09-19 16:30       ` Cédric Le Goater
2025-09-05 16:48 ` [PATCH V3 0/9] Live update: cpr-exec Peter Xu
2025-09-05 17:09   ` Dr. David Alan Gilbert
2025-09-05 17:48     ` Peter Xu
2025-09-09 14:36   ` Steven Sistare
2025-09-09 15:24     ` Peter Xu
2025-09-09 16:03       ` Steven Sistare
2025-09-09 18:37         ` Peter Xu
2025-09-12 14:50           ` Steven Sistare
2025-09-12 15:44             ` Peter Xu
2025-09-19 17:16               ` Steven Sistare
2025-09-23 14:37                 ` Vladimir Sementsov-Ogievskiy
2025-09-09 16:41       ` Vladimir Sementsov-Ogievskiy
2025-09-08 17:02 ` Vladimir Sementsov-Ogievskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).