qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V1 00/26] Live update: cpr-exec
@ 2024-04-29 15:55 Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 01/26] oslib: qemu_clear_cloexec Steve Sistare
                   ` (29 more replies)
  0 siblings, 30 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
stops the VM, writes VM state to the migration URI, and directly exec's a
new version of QEMU on the same host, replacing the original process while
retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
addresses.  The user completes the migration by specifying the -incoming
option, and by issuing the migrate-incoming command if necessary.  This
saves and restores VM state, with minimal guest pause time, so that QEMU may
be updated to a new version in between.

The new interfaces are:
  * cpr-exec (MigMode migration parameter)
  * cpr-exec-args (migration parameter)
  * memfd-alloc=on (command-line option for -machine)
  * only-migratable-modes (command-line argument)

The caller sets the mode parameter before invoking the migrate command.

Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
The first argument should be the path of a new QEMU binary, or a prefix
command that exec's the new QEMU binary, and the arguments should include
the -incoming option.

Memory backend objects must have the share=on attribute, and must be mmap'able
in the new QEMU process.  For example, memory-backend-file is acceptable,
but memory-backend-ram is not.

QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
implicit RAM blocks (those not explicitly described by a memory-backend
object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
and even guest RAM when it is specified without without reference to a
memory-backend object.   The memfds are kept open across exec, their values
are saved in vmstate which is retrieved after exec, and they are re-mmap'd.

The '-only-migratable-modes cpr-exec' option guarantees that the
configuration supports cpr-exec.  QEMU will exit at start time if not.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would set a new QEMU binary path in cpr-exec-args.

  # qemu-kvm -monitor stdio -object
  memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
  -m 4G -machine memfd-alloc=on ...

  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

cpr-exec mode preserves attributes of outgoing devices that must be known
before the device is created on the incoming side, such as the memfd descriptor
number, but currently the migration stream is read after all devices are
created.  To solve this problem, I add two VMStateDescription options:
precreate and factory.  precreate objects are saved to their own migration
stream, distinct from the main stream, and are read early by incoming QEMU,
before devices are created.  Factory objects are allocated on demand, without
relying on a pre-registered object's opaque address, which is necessary
because the devices to which the state will apply have not been created yet
and hence have not registered an opaque address to receive the state.

This patch series implements a minimal version of cpr-exec.  Future series
will add support for:
  * vfio
  * chardev's without loss of connectivity
  * vhost
  * fine-grained seccomp controls
  * hostmem-memfd
  * cpr-exec migration test


Steve Sistare (26):
  oslib: qemu_clear_cloexec
  vl: helper to request re-exec
  migration: SAVEVM_FOREACH
  migration: delete unused parameter mis
  migration: precreate vmstate
  migration: precreate vmstate for exec
  migration: VMStateId
  migration: vmstate_info_void_ptr
  migration: vmstate_register_named
  migration: vmstate_unregister_named
  migration: vmstate_register at init time
  migration: vmstate factory object
  physmem: ram_block_create
  physmem: hoist guest_memfd creation
  physmem: hoist host memory allocation
  physmem: set ram block idstr earlier
  machine: memfd-alloc option
  migration: cpr-exec-args parameter
  physmem: preserve ram blocks for cpr
  migration: cpr-exec mode
  migration: migrate_add_blocker_mode
  migration: ram block cpr-exec blockers
  migration: misc cpr-exec blockers
  seccomp: cpr-exec blocker
  migration: fix mismatched GPAs during cpr-exec
  migration: only-migratable-modes

 accel/xen/xen-all.c            |   5 +
 backends/hostmem-epc.c         |  12 +-
 hmp-commands.hx                |   2 +-
 hw/core/machine.c              |  22 +++
 hw/core/qdev.c                 |   1 +
 hw/intc/apic_common.c          |   2 +-
 hw/vfio/migration.c            |   3 +-
 include/exec/cpu-common.h      |   3 +-
 include/exec/memory.h          |  15 ++
 include/exec/ramblock.h        |  10 +-
 include/hw/boards.h            |   1 +
 include/migration/blocker.h    |   7 +
 include/migration/cpr.h        |  14 ++
 include/migration/misc.h       |  11 ++
 include/migration/vmstate.h    | 133 +++++++++++++++-
 include/qemu/osdep.h           |   9 ++
 include/sysemu/runstate.h      |   3 +
 include/sysemu/seccomp.h       |   1 +
 include/sysemu/sysemu.h        |   1 -
 migration/cpr.c                | 131 ++++++++++++++++
 migration/meson.build          |   3 +
 migration/migration-hmp-cmds.c |  50 +++++-
 migration/migration.c          |  48 +++++-
 migration/migration.h          |   5 +-
 migration/options.c            |  13 ++
 migration/precreate.c          | 139 +++++++++++++++++
 migration/ram.c                |  16 +-
 migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
 migration/savevm.h             |   3 +
 migration/trace-events         |   7 +
 migration/vmstate-factory.c    |  78 ++++++++++
 migration/vmstate-types.c      |  24 +++
 migration/vmstate.c            |   3 +-
 qapi/migration.json            |  48 +++++-
 qemu-options.hx                |  22 ++-
 replay/replay.c                |   6 +
 stubs/migr-blocker.c           |   5 +
 stubs/vmstate.c                |  13 ++
 system/globals.c               |   1 -
 system/memory.c                |  19 ++-
 system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
 system/qemu-seccomp.c          |  10 +-
 system/runstate.c              |  29 ++++
 system/trace-events            |   4 +
 system/vl.c                    |  26 +++-
 target/s390x/cpu_models.c      |   4 +-
 util/oslib-posix.c             |   9 ++
 util/oslib-win32.c             |   4 +
 48 files changed, 1417 insertions(+), 210 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c
 create mode 100644 migration/precreate.c
 create mode 100644 migration/vmstate-factory.c

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH V1 01/26] oslib: qemu_clear_cloexec
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-06 23:27   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 02/26] vl: helper to request re-exec Steve Sistare
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define qemu_clear_cloexec, analogous to qemu_set_cloexec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 include/qemu/osdep.h | 9 +++++++++
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 22 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index c7053cd..b58f312 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -660,6 +660,15 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
 
 void qemu_set_cloexec(int fd);
 
+/*
+ * Clear FD_CLOEXEC for a descriptor.
+ *
+ * The caller must guarantee that no other fork+exec's occur before the
+ * exec that is intended to inherit this descriptor, eg by suspending CPUs
+ * and blocking monitor commands.
+ */
+void qemu_clear_cloexec(int fd);
+
 /* Return a dynamically allocated directory path that is appropriate for storing
  * local state.
  *
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e764416..614c3e5 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -272,6 +272,15 @@ int qemu_socketpair(int domain, int type, int protocol, int sv[2])
     return ret;
 }
 
+void qemu_clear_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 char *
 qemu_get_local_state_dir(void)
 {
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index b623830..c3e969a 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -222,6 +222,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clear_cloexec(int fd)
+{
+}
+
 int qemu_get_thread_id(void)
 {
     return GetCurrentThreadId();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 02/26] vl: helper to request re-exec
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 01/26] oslib: qemu_clear_cloexec Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 03/26] migration: SAVEVM_FOREACH Steve Sistare
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Add a qemu_system_exec_request() hook that causes the main loop to exit and
re-exec qemu using the specified arguments.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  3 +++
 system/runstate.c         | 29 +++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index 0117d24..cb669cf 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -80,6 +80,8 @@ typedef enum WakeupReason {
     QEMU_WAKEUP_REASON_OTHER,
 } WakeupReason;
 
+typedef void (*qemu_exec_func)(char **exec_argv);
+
 void qemu_system_reset_request(ShutdownCause reason);
 void qemu_system_suspend_request(void);
 void qemu_register_suspend_notifier(Notifier *notifier);
@@ -91,6 +93,7 @@ void qemu_register_wakeup_support(void);
 void qemu_system_shutdown_request_with_code(ShutdownCause reason,
                                             int exit_code);
 void qemu_system_shutdown_request(ShutdownCause reason);
+void qemu_system_exec_request(qemu_exec_func func, const strList *args);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
 void qemu_register_shutdown_notifier(Notifier *notifier);
diff --git a/system/runstate.c b/system/runstate.c
index cb4905a..0de0c6e 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -40,6 +40,7 @@
 #include "qapi/error.h"
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-events-run-state.h"
+#include "qapi/type-helpers.h"
 #include "qemu/accel.h"
 #include "qemu/error-report.h"
 #include "qemu/job.h"
@@ -401,6 +402,8 @@ static NotifierList wakeup_notifiers =
 static NotifierList shutdown_notifiers =
     NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
+qemu_exec_func exec_func;
+static char **exec_argv;
 
 ShutdownCause qemu_shutdown_requested_get(void)
 {
@@ -417,6 +420,11 @@ static int qemu_shutdown_requested(void)
     return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
 }
 
+static int qemu_exec_requested(void)
+{
+    return exec_argv != NULL;
+}
+
 static void qemu_kill_report(void)
 {
     if (!qtest_driver() && shutdown_signal) {
@@ -694,6 +702,23 @@ void qemu_system_shutdown_request(ShutdownCause reason)
     qemu_notify_event();
 }
 
+static void qemu_system_exec(void)
+{
+    exec_func(exec_argv);
+
+    /* exec failed */
+    g_strfreev(exec_argv);
+    exec_argv = NULL;
+    exec_func = NULL;
+}
+
+void qemu_system_exec_request(qemu_exec_func func, const strList *args)
+{
+    exec_func = func;
+    exec_argv = strv_from_str_list(args);
+    qemu_notify_event();
+}
+
 static void qemu_system_powerdown(void)
 {
     qapi_event_send_powerdown();
@@ -740,6 +765,10 @@ static bool main_loop_should_exit(int *status)
     if (qemu_suspend_requested()) {
         qemu_system_suspend();
     }
+    if (qemu_exec_requested()) {
+        qemu_system_exec();
+        return false;
+    }
     request = qemu_shutdown_requested();
     if (request) {
         qemu_kill_report();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 03/26] migration: SAVEVM_FOREACH
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 01/26] oslib: qemu_clear_cloexec Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 02/26] vl: helper to request re-exec Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-06 23:17   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 04/26] migration: delete unused parameter mis Steve Sistare
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define an abstraction SAVEVM_FOREACH to loop over all savevm state
handlers, and replace QTAILQ_FOREACH.  Define variants for ALL so
we can loop over all handlers vs a subset of handlers in a subsequent
patch, but at this time there is no distinction between the two.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/savevm.c | 55 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 4509482..6829ba3 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -237,6 +237,15 @@ static SaveState savevm_state = {
     .global_section_id = 0,
 };
 
+#define SAVEVM_FOREACH(se, entry)                                    \
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
+
+#define SAVEVM_FOREACH_ALL(se, entry)                                \
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
+
+#define SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se)                   \
+    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
+
 static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id);
 
 static bool should_validate_capability(int capability)
@@ -674,7 +683,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
     SaveStateEntry *se;
     uint32_t instance_id = 0;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH_ALL(se, entry) {
         if (strcmp(idstr, se->idstr) == 0
             && instance_id <= se->instance_id) {
             instance_id = se->instance_id + 1;
@@ -690,7 +699,7 @@ static int calculate_compat_instance_id(const char *idstr)
     SaveStateEntry *se;
     int instance_id = 0;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->compat) {
             continue;
         }
@@ -816,7 +825,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
     }
     pstrcat(id, sizeof(id), idstr);
 
-    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
+    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
         if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
             savevm_state_handler_remove(se);
             g_free(se->compat);
@@ -939,7 +948,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
 {
     SaveStateEntry *se, *new_se;
 
-    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
+    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
         if (se->vmsd == vmsd && se->opaque == opaque) {
             savevm_state_handler_remove(se);
             g_free(se->compat);
@@ -1223,7 +1232,7 @@ bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->vmsd && se->vmsd->unmigratable) {
             error_setg(errp, "State blocked by non-migratable device '%s'",
                        se->idstr);
@@ -1237,7 +1246,7 @@ void qemu_savevm_non_migratable_list(strList **reasons)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->vmsd && se->vmsd->unmigratable) {
             QAPI_LIST_PREPEND(*reasons,
                               g_strdup_printf("non-migratable device: %s",
@@ -1276,7 +1285,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->vmsd && se->vmsd->dev_unplug_pending &&
             se->vmsd->dev_unplug_pending(se->opaque)) {
             return true;
@@ -1291,7 +1300,7 @@ int qemu_savevm_state_prepare(Error **errp)
     SaveStateEntry *se;
     int ret;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->save_prepare) {
             continue;
         }
@@ -1321,7 +1330,7 @@ int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
     json_writer_start_array(ms->vmdesc, "devices");
 
     trace_savevm_state_setup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->vmsd && se->vmsd->early_setup) {
             ret = vmstate_save(f, se, ms->vmdesc, errp);
             if (ret) {
@@ -1365,7 +1374,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
 
     trace_savevm_state_resume_prepare();
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->resume_prepare) {
             continue;
         }
@@ -1396,7 +1405,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
     int ret;
 
     trace_savevm_state_iterate();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->save_live_iterate) {
             continue;
         }
@@ -1461,7 +1470,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
     SaveStateEntry *se;
     int ret;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->save_live_complete_postcopy) {
             continue;
         }
@@ -1495,7 +1504,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     SaveStateEntry *se;
     int ret;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops ||
             (in_postcopy && se->ops->has_postcopy &&
              se->ops->has_postcopy(se->opaque)) ||
@@ -1543,7 +1552,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
     Error *local_err = NULL;
     int ret;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->vmsd && se->vmsd->early_setup) {
             /* Already saved during qemu_savevm_state_setup(). */
             continue;
@@ -1649,7 +1658,7 @@ void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
     *must_precopy = 0;
     *can_postcopy = 0;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->state_pending_estimate) {
             continue;
         }
@@ -1670,7 +1679,7 @@ void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
     *must_precopy = 0;
     *can_postcopy = 0;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->state_pending_exact) {
             continue;
         }
@@ -1693,7 +1702,7 @@ void qemu_savevm_state_cleanup(void)
     }
 
     trace_savevm_state_cleanup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->ops && se->ops->save_cleanup) {
             se->ops->save_cleanup(se->opaque);
         }
@@ -1778,7 +1787,7 @@ int qemu_save_device_state(QEMUFile *f)
     }
     cpu_synchronize_all_states();
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         int ret;
 
         if (se->is_ram) {
@@ -1801,7 +1810,7 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH_ALL(se, entry) {
         if (!strcmp(se->idstr, idstr) &&
             (instance_id == se->instance_id ||
              instance_id == se->alias_id))
@@ -2680,7 +2689,7 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis,
     }
 
     trace_qemu_loadvm_state_section_partend(section_id);
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->load_section_id == section_id) {
             break;
         }
@@ -2755,7 +2764,7 @@ static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->switchover_ack_needed) {
             continue;
         }
@@ -2775,7 +2784,7 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
     int ret;
 
     trace_loadvm_state_setup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->load_setup) {
             continue;
         }
@@ -2801,7 +2810,7 @@ void qemu_loadvm_state_cleanup(void)
     SaveStateEntry *se;
 
     trace_loadvm_state_cleanup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->ops && se->ops->load_cleanup) {
             se->ops->load_cleanup(se->opaque);
         }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 04/26] migration: delete unused parameter mis
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (2 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 03/26] migration: SAVEVM_FOREACH Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-06 21:50   ` Fabiano Rosas
  2024-05-27 18:02   ` Peter Xu
  2024-04-29 15:55 ` [PATCH V1 05/26] migration: precreate vmstate Steve Sistare
                   ` (25 subsequent siblings)
  29 siblings, 2 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/savevm.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 6829ba3..9789823 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2591,8 +2591,7 @@ static bool check_section_footer(QEMUFile *f, SaveStateEntry *se)
 }
 
 static int
-qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis,
-                               uint8_t type)
+qemu_loadvm_section_start_full(QEMUFile *f, uint8_t type)
 {
     bool trace_downtime = (type == QEMU_VM_SECTION_FULL);
     uint32_t instance_id, version_id, section_id;
@@ -2670,8 +2669,7 @@ qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis,
 }
 
 static int
-qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis,
-                             uint8_t type)
+qemu_loadvm_section_part_end(QEMUFile *f, uint8_t type)
 {
     bool trace_downtime = (type == QEMU_VM_SECTION_END);
     int64_t start_ts, end_ts;
@@ -2906,14 +2904,14 @@ retry:
         switch (section_type) {
         case QEMU_VM_SECTION_START:
         case QEMU_VM_SECTION_FULL:
-            ret = qemu_loadvm_section_start_full(f, mis, section_type);
+            ret = qemu_loadvm_section_start_full(f, section_type);
             if (ret < 0) {
                 goto out;
             }
             break;
         case QEMU_VM_SECTION_PART:
         case QEMU_VM_SECTION_END:
-            ret = qemu_loadvm_section_part_end(f, mis, section_type);
+            ret = qemu_loadvm_section_part_end(f, section_type);
             if (ret < 0) {
                 goto out;
             }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 05/26] migration: precreate vmstate
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (3 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 04/26] migration: delete unused parameter mis Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-07 21:02   ` Fabiano Rosas
                     ` (2 more replies)
  2024-04-29 15:55 ` [PATCH V1 06/26] migration: precreate vmstate for exec Steve Sistare
                   ` (24 subsequent siblings)
  29 siblings, 3 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Provide the VMStateDescription precreate field to mark objects that must
be loaded on the incoming side before devices have been created, because
they provide properties that will be needed at creation time.  They will
be saved to and loaded from their own QEMUFile, via
qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
functions are not yet called in this patch.  Allow them to be called
before or after normal migration is active, when current_migration and
current_incoming are not valid.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h |  6 ++++
 migration/savevm.c          | 69 +++++++++++++++++++++++++++++++++++++++++----
 migration/savevm.h          |  3 ++
 3 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 294d2d8..4691334 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -198,6 +198,12 @@ struct VMStateDescription {
      * a QEMU_VM_SECTION_START section.
      */
     bool early_setup;
+
+    /*
+     * Send/receive this object in the precreate migration stream.
+     */
+    bool precreate;
+
     int version_id;
     int minimum_version_id;
     MigrationPriority priority;
diff --git a/migration/savevm.c b/migration/savevm.c
index 9789823..a30bcd9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -239,6 +239,7 @@ static SaveState savevm_state = {
 
 #define SAVEVM_FOREACH(se, entry)                                    \
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
+        if (!se->vmsd || !se->vmsd->precreate)
 
 #define SAVEVM_FOREACH_ALL(se, entry)                                \
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
@@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se,
     }
 }
 
+static bool send_section_footer(SaveStateEntry *se)
+{
+    return (se->vmsd && se->vmsd->precreate) ||
+           migrate_get_current()->send_section_footer;
+}
+
 /*
  * Write a footer onto device sections that catches cases misformatted device
  * sections.
  */
 static void save_section_footer(QEMUFile *f, SaveStateEntry *se)
 {
-    if (migrate_get_current()->send_section_footer) {
+    if (send_section_footer(se)) {
         qemu_put_byte(f, QEMU_VM_SECTION_FOOTER);
         qemu_put_be32(f, se->section_id);
     }
@@ -1319,6 +1326,52 @@ int qemu_savevm_state_prepare(Error **errp)
     return 0;
 }
 
+int qemu_savevm_precreate_save(QEMUFile *f, Error **errp)
+{
+    int ret;
+    SaveStateEntry *se;
+
+    qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
+    qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+    SAVEVM_FOREACH_ALL(se, entry) {
+        if (se->vmsd && se->vmsd->precreate) {
+            ret = vmstate_save(f, se, NULL, errp);
+            if (ret) {
+                qemu_file_set_error(f, ret);
+                return ret;
+            }
+        }
+    }
+    qemu_fflush(f);
+    return 0;
+}
+
+int qemu_savevm_precreate_load(QEMUFile *f, Error **errp)
+{
+    unsigned int v;
+    int ret;
+
+    v = qemu_get_be32(f);
+    if (v != QEMU_VM_FILE_MAGIC) {
+        error_setg(errp, "Not a migration stream");
+        return -EINVAL;
+    }
+
+    v = qemu_get_be32(f);
+    if (v != QEMU_VM_FILE_VERSION) {
+        error_setg(errp, "Unsupported migration stream version");
+        return -ENOTSUP;
+    }
+
+    ret = qemu_loadvm_state_main(f, NULL);
+    if (ret) {
+        error_setg_errno(errp, -ret, "qemu_savevm_precreate_load");
+    }
+
+    return ret;
+}
+
 int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
 {
     ERRP_GUARD();
@@ -2559,7 +2612,7 @@ static bool check_section_footer(QEMUFile *f, SaveStateEntry *se)
     uint8_t read_mark;
     uint32_t read_section_id;
 
-    if (!migrate_get_current()->send_section_footer) {
+    if (!send_section_footer(se)) {
         /* No footer to check */
         return true;
     }
@@ -2895,9 +2948,12 @@ retry:
     while (true) {
         section_type = qemu_get_byte(f);
 
-        ret = qemu_file_get_error_obj_any(f, mis->postcopy_qemufile_dst, NULL);
-        if (ret) {
-            break;
+        if (mis) {
+            ret = qemu_file_get_error_obj_any(f, mis->postcopy_qemufile_dst,
+                                              NULL);
+            if (ret) {
+                break;
+            }
         }
 
         trace_qemu_loadvm_state_section(section_type);
@@ -2936,6 +2992,9 @@ retry:
 out:
     if (ret < 0) {
         qemu_file_set_error(f, ret);
+        if (!mis) {
+            return ret;
+        }
 
         /* Cancel bitmaps incoming regardless of recovery */
         dirty_bitmap_mig_cancel_incoming();
diff --git a/migration/savevm.h b/migration/savevm.h
index 9ec96a9..6f207b5 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
 
+int qemu_savevm_precreate_save(QEMUFile *f, Error **errp);
+int qemu_savevm_precreate_load(QEMUFile *f, Error **errp);
+
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 06/26] migration: precreate vmstate for exec
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (4 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 05/26] migration: precreate vmstate Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-06 23:34   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 07/26] migration: VMStateId Steve Sistare
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Provide migration_precreate_save for saving precreate vmstate across exec.
Create a memfd, save its value in the environment, and serialize state
to it.  Reverse the process in migration_precreate_load.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/misc.h |   5 ++
 migration/meson.build    |   1 +
 migration/precreate.c    | 139 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 145 insertions(+)
 create mode 100644 migration/precreate.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index c9e200f..cf30351 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -56,6 +56,11 @@ AnnounceParameters *migrate_announce_params(void);
 
 void dump_vmstate_json_to_file(FILE *out_fp);
 
+/* migration/precreate.c */
+int migration_precreate_save(Error **errp);
+void migration_precreate_unsave(void);
+int migration_precreate_load(Error **errp);
+
 /* migration/migration.c */
 void migration_object_init(void);
 void migration_shutdown(void);
diff --git a/migration/meson.build b/migration/meson.build
index f76b1ba..50e7cb2 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -26,6 +26,7 @@ system_ss.add(files(
   'ram-compress.c',
   'options.c',
   'postcopy-ram.c',
+  'precreate.c',
   'savevm.c',
   'socket.c',
   'tls.c',
diff --git a/migration/precreate.c b/migration/precreate.c
new file mode 100644
index 0000000..0bf5e1f
--- /dev/null
+++ b/migration/precreate.c
@@ -0,0 +1,139 @@
+/*
+ * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "io/channel-file.h"
+#include "migration/misc.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+
+#define PRECREATE_STATE_NAME "QEMU_PRECREATE_STATE"
+
+static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    return qemu_file_new_input(ioc);
+}
+
+static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    return qemu_file_new_output(ioc);
+}
+
+static int memfd_create_named(const char *name, Error **errp)
+{
+    int mfd;
+    char val[16];
+
+    mfd = memfd_create(name, 0);
+    if (mfd < 0) {
+        error_setg_errno(errp, errno, "memfd_create failed");
+        return -1;
+    }
+
+    /* Remember mfd in environment for post-exec load */
+    qemu_clear_cloexec(mfd);
+    snprintf(val, sizeof(val), "%d", mfd);
+    g_setenv(name, val, 1);
+
+    return mfd;
+}
+
+static int memfd_find_named(const char *name, int *mfd_p, Error **errp)
+{
+    const char *val = g_getenv(name);
+
+    if (!val) {
+        *mfd_p = -1;
+        return 0;       /* No memfd was created, not an error */
+    }
+    g_unsetenv(name);
+    if (qemu_strtoi(val, NULL, 10, mfd_p)) {
+        error_setg(errp, "Bad %s env value %s", PRECREATE_STATE_NAME, val);
+        return -1;
+    }
+    lseek(*mfd_p, 0, SEEK_SET);
+    return 0;
+}
+
+static void memfd_delete_named(const char *name)
+{
+    int mfd;
+    const char *val = g_getenv(name);
+
+    if (val) {
+        g_unsetenv(name);
+        if (!qemu_strtoi(val, NULL, 10, &mfd)) {
+            close(mfd);
+        }
+    }
+}
+
+static QEMUFile *qemu_file_new_memfd_output(const char *name, Error **errp)
+{
+    int mfd = memfd_create_named(name, errp);
+
+    if (mfd < 0) {
+        return NULL;
+    }
+
+    return qemu_file_new_fd_output(mfd, name);
+}
+
+static QEMUFile *qemu_file_new_memfd_input(const char *name, Error **errp)
+{
+    int ret, mfd;
+
+    ret = memfd_find_named(name, &mfd, errp);
+    if (ret || mfd < 0) {
+        return NULL;
+    }
+
+    return qemu_file_new_fd_input(mfd, name);
+}
+
+int migration_precreate_save(Error **errp)
+{
+    QEMUFile *f = qemu_file_new_memfd_output(PRECREATE_STATE_NAME, errp);
+
+    if (!f) {
+        return -1;
+    } else if (qemu_savevm_precreate_save(f, errp)) {
+        memfd_delete_named(PRECREATE_STATE_NAME);
+        return -1;
+    } else {
+        /* Do not close f, as mfd must remain open. */
+        return 0;
+    }
+}
+
+void migration_precreate_unsave(void)
+{
+    memfd_delete_named(PRECREATE_STATE_NAME);
+}
+
+int migration_precreate_load(Error **errp)
+{
+    int ret;
+    QEMUFile *f = qemu_file_new_memfd_input(PRECREATE_STATE_NAME, errp);
+
+    if (!f) {
+        return -1;
+    }
+    ret = qemu_savevm_precreate_load(f, errp);
+    qemu_fclose(f);
+    g_unsetenv(PRECREATE_STATE_NAME);
+    return ret;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 07/26] migration: VMStateId
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (5 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 06/26] migration: precreate vmstate for exec Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-07 21:03   ` Fabiano Rosas
  2024-05-27 18:20   ` Peter Xu
  2024-04-29 15:55 ` [PATCH V1 08/26] migration: vmstate_info_void_ptr Steve Sistare
                   ` (22 subsequent siblings)
  29 siblings, 2 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define a type for the 256 byte id string to guarantee the same length is
used and enforced everywhere.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/ramblock.h     | 3 ++-
 include/migration/vmstate.h | 2 ++
 migration/savevm.c          | 8 ++++----
 migration/vmstate.c         | 3 ++-
 4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 0babd10..61deefe 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -23,6 +23,7 @@
 #include "cpu-common.h"
 #include "qemu/rcu.h"
 #include "exec/ramlist.h"
+#include "migration/vmstate.h"
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -35,7 +36,7 @@ struct RAMBlock {
     void (*resized)(const char*, uint64_t length, void *host);
     uint32_t flags;
     /* Protected by the BQL.  */
-    char idstr[256];
+    VMStateId idstr;
     /* RCU-enabled, writes protected by the ramlist lock */
     QLIST_ENTRY(RAMBlock) next;
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 4691334..a39c0e6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -1210,6 +1210,8 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
 
 bool vmstate_section_needed(const VMStateDescription *vmsd, void *opaque);
 
+typedef char (VMStateId)[256];
+
 #define  VMSTATE_INSTANCE_ID_ANY  -1
 
 /* Returns: 0 on success, -1 on failure */
diff --git a/migration/savevm.c b/migration/savevm.c
index a30bcd9..9b1a335 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -197,13 +197,13 @@ const VMStateInfo vmstate_info_timer = {
 
 
 typedef struct CompatEntry {
-    char idstr[256];
+    VMStateId idstr;
     int instance_id;
 } CompatEntry;
 
 typedef struct SaveStateEntry {
     QTAILQ_ENTRY(SaveStateEntry) entry;
-    char idstr[256];
+    VMStateId idstr;
     uint32_t instance_id;
     int alias_id;
     int version_id;
@@ -814,7 +814,7 @@ int register_savevm_live(const char *idstr,
 void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
 {
     SaveStateEntry *se, *new_se;
-    char id[256] = "";
+    VMStateId id = "";
 
     if (obj) {
         char *oid = vmstate_if_get_id(obj);
@@ -2650,7 +2650,7 @@ qemu_loadvm_section_start_full(QEMUFile *f, uint8_t type)
     uint32_t instance_id, version_id, section_id;
     int64_t start_ts, end_ts;
     SaveStateEntry *se;
-    char idstr[256];
+    VMStateId idstr;
     int ret;
 
     /* Read section start */
diff --git a/migration/vmstate.c b/migration/vmstate.c
index ef26f26..437f156 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -471,7 +471,8 @@ static int vmstate_subsection_load(QEMUFile *f, const VMStateDescription *vmsd,
     trace_vmstate_subsection_load(vmsd->name);
 
     while (qemu_peek_byte(f, 0) == QEMU_VM_SUBSECTION) {
-        char idstr[256], *idstr_ret;
+        VMStateId idstr;
+        char *idstr_ret;
         int ret;
         uint8_t version_id, len, size;
         const VMStateDescription *sub_vmsd;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 08/26] migration: vmstate_info_void_ptr
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (6 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 07/26] migration: VMStateId Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-07 21:33   ` Fabiano Rosas
  2024-05-27 18:31   ` Peter Xu
  2024-04-29 15:55 ` [PATCH V1 09/26] migration: vmstate_register_named Steve Sistare
                   ` (21 subsequent siblings)
  29 siblings, 2 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
can be saved in the migration stream.  This will be needed for CPR.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h | 15 +++++++++++++++
 migration/vmstate-types.c   | 24 ++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index a39c0e6..bb885d9 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -236,6 +236,7 @@ extern const VMStateInfo vmstate_info_uint8;
 extern const VMStateInfo vmstate_info_uint16;
 extern const VMStateInfo vmstate_info_uint32;
 extern const VMStateInfo vmstate_info_uint64;
+extern const VMStateInfo vmstate_info_void_ptr;
 
 /** Put this in the stream when migrating a null pointer.*/
 #define VMS_NULLPTR_MARKER (0x30U) /* '0' */
@@ -326,6 +327,17 @@ extern const VMStateInfo vmstate_info_qlist;
     .offset       = vmstate_offset_value(_state, _field, _type),     \
 }
 
+#define VMSTATE_SINGLE_TEST_NO_CHECK(_field, _state, _test,          \
+                                     _version, _info, _type) {       \
+    .name         = (stringify(_field)),                             \
+    .version_id   = (_version),                                      \
+    .field_exists = (_test),                                         \
+    .size         = sizeof(_type),                                   \
+    .info         = &(_info),                                        \
+    .flags        = VMS_SINGLE,                                      \
+    .offset       = offsetof(_state, _field)                         \
+}
+
 #define VMSTATE_SINGLE_FULL(_field, _state, _test, _version, _info,  \
                             _type, _err_hint) {                      \
     .name         = (stringify(_field)),                             \
@@ -952,6 +964,9 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64(_f, _s)                                        \
     VMSTATE_UINT64_V(_f, _s, 0)
 
+#define VMSTATE_VOID_PTR(_f, _s)                                      \
+    VMSTATE_SINGLE_TEST_NO_CHECK(_f, _s, NULL, 0, vmstate_info_void_ptr, void *)
+
 #ifdef CONFIG_LINUX
 
 #define VMSTATE_U8(_f, _s)                                         \
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index e83bfcc..097ecad 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -314,6 +314,30 @@ const VMStateInfo vmstate_info_uint64 = {
     .put  = put_uint64,
 };
 
+/* 64 bit pointer */
+
+static int get_void_ptr(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field)
+{
+    void **v = pv;
+    qemu_get_be64s(f, (uint64_t *)v);
+    return 0;
+}
+
+static int put_void_ptr(QEMUFile *f, void *pv, size_t size,
+                        const VMStateField *field, JSONWriter *vmdesc)
+{
+    void **v = pv;
+    qemu_put_be64s(f, (uint64_t *)v);
+    return 0;
+}
+
+const VMStateInfo vmstate_info_void_ptr = {
+    .name = "void_ptr",
+    .get  = get_void_ptr,
+    .put  = put_void_ptr,
+};
+
 static int get_nullptr(QEMUFile *f, void *pv, size_t size,
                        const VMStateField *field)
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 09/26] migration: vmstate_register_named
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (7 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 08/26] migration: vmstate_info_void_ptr Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-09 14:19   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 10/26] migration: vmstate_unregister_named Steve Sistare
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define vmstate_register_named which takes the instance name as its first
parameter, instead of generating the name from VMStateIf of the Object.
This will be needed to register objects that are not Objects.  Pass the
new name parameter to vmstate_register_with_alias_id.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/qdev.c              |  1 +
 hw/intc/apic_common.c       |  2 +-
 include/migration/vmstate.h | 19 +++++++++++++++++--
 migration/savevm.c          | 35 ++++++++++++++++++++++++-----------
 migration/trace-events      |  1 +
 stubs/vmstate.c             |  1 +
 6 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 00efaf1..b352e8a 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -535,6 +535,7 @@ static void device_set_realized(Object *obj, bool value, Error **errp)
                                                qdev_get_vmsd(dev), dev,
                                                dev->instance_id_alias,
                                                dev->alias_required_for_version,
+                                               NULL,
                                                &local_err) < 0) {
                 goto post_realize_fail;
             }
diff --git a/hw/intc/apic_common.c b/hw/intc/apic_common.c
index d8fc1e2..d6cd293 100644
--- a/hw/intc/apic_common.c
+++ b/hw/intc/apic_common.c
@@ -298,7 +298,7 @@ static void apic_common_realize(DeviceState *dev, Error **errp)
         instance_id = VMSTATE_INSTANCE_ID_ANY;
     }
     vmstate_register_with_alias_id(NULL, instance_id, &vmstate_apic_common,
-                                   s, -1, 0, NULL);
+                                   s, -1, 0, NULL, NULL);
 
     /* APIC LDR in x2APIC mode */
     s->extended_log_dest = ((s->initial_apic_id >> 4) << 16) |
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index bb885d9..22aa3c6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -1234,6 +1234,7 @@ int vmstate_register_with_alias_id(VMStateIf *obj, uint32_t instance_id,
                                    const VMStateDescription *vmsd,
                                    void *base, int alias_id,
                                    int required_for_version,
+                                   const char *instance_name,
                                    Error **errp);
 
 /**
@@ -1250,7 +1251,7 @@ static inline int vmstate_register(VMStateIf *obj, int instance_id,
                                    void *opaque)
 {
     return vmstate_register_with_alias_id(obj, instance_id, vmsd,
-                                          opaque, -1, 0, NULL);
+                                          opaque, -1, 0, NULL, NULL);
 }
 
 /**
@@ -1278,7 +1279,21 @@ static inline int vmstate_register_any(VMStateIf *obj,
                                        void *opaque)
 {
     return vmstate_register_with_alias_id(obj, VMSTATE_INSTANCE_ID_ANY, vmsd,
-                                          opaque, -1, 0, NULL);
+                                          opaque, -1, 0, NULL, NULL);
+}
+
+/**
+ * vmstate_register_named() - pass an instance_name explicitly instead of
+ * implicitly via VMStateIf get_id().  Needed to register a instance-specific
+ * VMSD for objects that are not Objects.
+ */
+static inline int vmstate_register_named(const char *instance_name,
+                                         int instance_id,
+                                         const VMStateDescription *vmsd,
+                                         void *opaque)
+{
+    return vmstate_register_with_alias_id(NULL, instance_id, vmsd, opaque,
+                                          -1, 0, instance_name, NULL);
 }
 
 void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
diff --git a/migration/savevm.c b/migration/savevm.c
index 9b1a335..86b4c87 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -889,10 +889,20 @@ int vmstate_replace_hack_for_ppc(VMStateIf *obj, int instance_id,
     return vmstate_register(obj, instance_id, vmsd, opaque);
 }
 
+static bool make_new_idstr(VMStateId idstr, const char *id, Error **errp)
+{
+    if (snprintf(idstr, sizeof(VMStateId), "%s/", id) >= sizeof(VMStateId)) {
+        error_setg(errp, "Path too long for VMState (%s)", id);
+        return false;
+    }
+    return true;
+}
+
 int vmstate_register_with_alias_id(VMStateIf *obj, uint32_t instance_id,
                                    const VMStateDescription *vmsd,
                                    void *opaque, int alias_id,
                                    int required_for_version,
+                                   const char *instance_name,
                                    Error **errp)
 {
     SaveStateEntry *se;
@@ -907,19 +917,17 @@ int vmstate_register_with_alias_id(VMStateIf *obj, uint32_t instance_id,
     se->vmsd = vmsd;
     se->alias_id = alias_id;
 
-    if (obj) {
-        char *id = vmstate_if_get_id(obj);
+    if (instance_name) {
+        if (!make_new_idstr(se->idstr, instance_name, errp)) {
+            goto err;
+        }
+
+    } else if (obj) {
+        g_autofree char *id = vmstate_if_get_id(obj);
         if (id) {
-            if (snprintf(se->idstr, sizeof(se->idstr), "%s/", id) >=
-                sizeof(se->idstr)) {
-                error_setg(errp, "Path too long for VMState (%s)", id);
-                g_free(id);
-                g_free(se);
-
-                return -1;
+            if (!make_new_idstr(se->idstr, id, errp)) {
+                goto err;
             }
-            g_free(id);
-
             se->compat = g_new0(CompatEntry, 1);
             pstrcpy(se->compat->idstr, sizeof(se->compat->idstr), vmsd->name);
             se->compat->instance_id = instance_id == VMSTATE_INSTANCE_ID_ANY ?
@@ -941,7 +949,12 @@ int vmstate_register_with_alias_id(VMStateIf *obj, uint32_t instance_id,
     }
     assert(!se->compat || se->instance_id == 0);
     savevm_state_handler_insert(se);
+    trace_vmstate_register(se->idstr, se->instance_id, (void *)vmsd, opaque);
     return 0;
+
+err:
+    g_free(se);
+    return -1;
 }
 
 void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
diff --git a/migration/trace-events b/migration/trace-events
index f0e1cb8..8647147 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -53,6 +53,7 @@ vmstate_downtime_checkpoint(const char *checkpoint) "%s"
 postcopy_pause_incoming(void) ""
 postcopy_pause_incoming_continued(void) ""
 postcopy_page_req_sync(void *host_addr) "sync page req %p"
+vmstate_register(const char *idstr, int id, void *vmsd, void *opaque) "%s, %d, vmsd %p, opaque %p"
 
 # vmstate.c
 vmstate_load_field_error(const char *field, int ret) "field \"%s\" load failed, ret = %d"
diff --git a/stubs/vmstate.c b/stubs/vmstate.c
index 8513d92..d67506e 100644
--- a/stubs/vmstate.c
+++ b/stubs/vmstate.c
@@ -6,6 +6,7 @@ int vmstate_register_with_alias_id(VMStateIf *obj,
                                    const VMStateDescription *vmsd,
                                    void *base, int alias_id,
                                    int required_for_version,
+                                   const char *instance_name,
                                    Error **errp)
 {
     return 0;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 10/26] migration: vmstate_unregister_named
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (8 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 09/26] migration: vmstate_register_named Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 11/26] migration: vmstate_register at init time Steve Sistare
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define an accessor to find vmstate state handler by name and id and
unregister it.  This is needed to unregister a specific instance of an
object that is not an Object, since it lacks the VMStateIf get_id hook.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h |  9 +++++++++
 migration/savevm.c          | 27 +++++++++++++++++++++++++++
 migration/trace-events      |  1 +
 stubs/vmstate.c             |  6 ++++++
 4 files changed, 43 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 22aa3c6..3d71b34 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -1299,6 +1299,15 @@ static inline int vmstate_register_named(const char *instance_name,
 void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
                         void *opaque);
 
+/**
+ * Delete the VMSD handler for the object with name "vmsd_name/instance_name"
+ * and matching instance_id.  If instance_id is VMSTATE_INSTANCE_ID_ANY,
+ * delete all instances matching name.
+ */
+void vmstate_unregister_named(const char *vmsd_name,
+                              const char *instance_name,
+                              int instance_id);
+
 void vmstate_register_ram(struct MemoryRegion *memory, DeviceState *dev);
 void vmstate_unregister_ram(struct MemoryRegion *memory, DeviceState *dev);
 void vmstate_register_ram_global(struct MemoryRegion *memory);
diff --git a/migration/savevm.c b/migration/savevm.c
index 86b4c87..cd2eabe 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -964,6 +964,8 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
 
     SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
         if (se->vmsd == vmsd && se->opaque == opaque) {
+            trace_vmstate_unregister(se->idstr, se->instance_id, (void *)vmsd,
+                                     opaque);
             savevm_state_handler_remove(se);
             g_free(se->compat);
             g_free(se);
@@ -971,6 +973,31 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
     }
 }
 
+void vmstate_unregister_named(const char *vmsd_name,
+                              const char *instance_name,
+                              int instance_id)
+{
+    SaveStateEntry *se, *new_se;
+    VMStateId idstr;
+
+    snprintf(idstr, sizeof(idstr), "%s/%s", vmsd_name, instance_name);
+
+    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
+        if (!strcmp(se->idstr, idstr) &&
+            (instance_id == VMSTATE_INSTANCE_ID_ANY ||
+             se->instance_id == instance_id)) {
+            trace_vmstate_unregister(idstr, se->instance_id, (void *)se->vmsd,
+                                     se->opaque);
+            savevm_state_handler_remove(se);
+            g_free(se->compat);
+            g_free(se);
+            if (instance_id != VMSTATE_INSTANCE_ID_ANY) {
+                return;
+            }
+        }
+    }
+}
+
 static int vmstate_load(QEMUFile *f, SaveStateEntry *se)
 {
     trace_vmstate_load(se->idstr, se->vmsd ? se->vmsd->name : "(old)");
diff --git a/migration/trace-events b/migration/trace-events
index 8647147..1e23238 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -54,6 +54,7 @@ postcopy_pause_incoming(void) ""
 postcopy_pause_incoming_continued(void) ""
 postcopy_page_req_sync(void *host_addr) "sync page req %p"
 vmstate_register(const char *idstr, int id, void *vmsd, void *opaque) "%s, %d, vmsd %p, opaque %p"
+vmstate_unregister(const char *idstr, int id, void *vmsd, void *opaque) "%s, %d, vmsd %p, opaque %p"
 
 # vmstate.c
 vmstate_load_field_error(const char *field, int ret) "field \"%s\" load failed, ret = %d"
diff --git a/stubs/vmstate.c b/stubs/vmstate.c
index d67506e..eff8be4 100644
--- a/stubs/vmstate.c
+++ b/stubs/vmstate.c
@@ -18,6 +18,12 @@ void vmstate_unregister(VMStateIf *obj,
 {
 }
 
+void vmstate_unregister_named(const char *vmsd_name,
+                              const char *instance_name,
+                              int instance_id)
+{
+}
+
 bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
 {
     return true;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 11/26] migration: vmstate_register at init time
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (9 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 10/26] migration: vmstate_unregister_named Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 12/26] migration: vmstate factory object Steve Sistare
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define vmstate_register_init to statically declare that a vmstate object
should be registered during qemu initialization, specifically, in the
call to vmstate_register_init_all.  This is needed to register objects
that are not Objects (and hence cannot use the DeviceClass vmsd hook),
without requiring that qemu call an object-specific initialization function.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h | 18 ++++++++++++++++++
 migration/savevm.c          | 32 ++++++++++++++++++++++++++++++++
 system/vl.c                 |  3 +++
 3 files changed, 53 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 3d71b34..8cb3d2b 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -1255,6 +1255,24 @@ static inline int vmstate_register(VMStateIf *obj, int instance_id,
 }
 
 /**
+ * vmstate_register_init() - statically declare a VMSD to be registered when
+ * QEMU calls vmstate_register_init_all.  This is useful for registering
+ * objects that are not Objects (and hence cannot use the DeviceClass vmsd
+ * hook).
+ */
+#define vmstate_register_init(_obj, _id, _vmsd, _opaque)                    \
+static void __attribute__((constructor)) vmstate_register_ ## _vmsd(void)   \
+{                                                                           \
+    vmstate_register_init_add(_obj, _id, &_vmsd, _opaque);                  \
+}
+
+void vmstate_register_init_add(VMStateIf *obj, int instance_id,
+                               const VMStateDescription *vmsd, void *opaque);
+
+void vmstate_register_init_all(void);
+
+
+/**
  * vmstate_replace_hack_for_ppc() - ppc used to abuse vmstate_register
  *
  * Don't even think about using this function in new code.
diff --git a/migration/savevm.c b/migration/savevm.c
index cd2eabe..ec48da9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -957,6 +957,38 @@ err:
     return -1;
 }
 
+typedef struct VMStateInit {
+    VMStateIf *obj;
+    int instance_id;
+    const VMStateDescription *vmsd;
+    void *opaque;
+    QLIST_ENTRY(VMStateInit) next;
+} VMStateInit;
+
+static QLIST_HEAD(, VMStateInit) vmstate_inits;
+
+void vmstate_register_init_add(VMStateIf *obj, int instance_id,
+                               const VMStateDescription *vmsd, void *opaque)
+{
+    VMStateInit *v = g_new0(VMStateInit, 1);
+
+    v->obj = obj;
+    v->instance_id = instance_id;
+    v->vmsd = vmsd;
+    v->opaque = opaque;
+    QLIST_INSERT_HEAD(&vmstate_inits, v, next);
+}
+
+void vmstate_register_init_all(void)
+{
+    VMStateInit *v, *tmp;
+
+    QLIST_FOREACH_SAFE(v, &vmstate_inits, next, tmp) {
+        vmstate_register(v->obj, v->instance_id, v->vmsd, v->opaque);
+        QLIST_REMOVE(v, next);
+    }
+}
+
 void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
                         void *opaque)
 {
diff --git a/system/vl.c b/system/vl.c
index c644222..7797206 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -78,6 +78,7 @@
 #include "hw/i386/pc.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
+#include "migration/vmstate.h"
 #include "sysemu/tpm.h"
 #include "sysemu/dma.h"
 #include "hw/audio/soundhw.h"
@@ -3663,6 +3664,8 @@ void qemu_init(int argc, char **argv)
 
     qemu_create_machine(machine_opts_dict);
 
+    vmstate_register_init_all();
+
     suspend_mux_open();
 
     qemu_disable_default_devices();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 12/26] migration: vmstate factory object
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (10 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 11/26] migration: vmstate_register at init time Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 13/26] physmem: ram_block_create Steve Sistare
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

During the precreate phase, we will load migration state that will be used
to create objects, but those objects have not been created and have not
registered a vmstate handler yet.  We don't know how many objects will be
created, or what their names and ids will be, so we don't know what vmstate
handlers to register.

To solve this problem, define the factory object.  A factory object
is added to the outgoing migration stream as usual, by registering a
vmsd and opaque pointer.  During incoming migration, it is allocated
on demand, without relying on a pre-registered object's opaque address.
Instead, register a factory that knows the object's size.  loadvm receives
an idstr which contains a factory name, finds the factory, allocates
the object, then loads the fields as usual.  The object is added to a
factory_objects list, tagged by name and id, to be found and claimed
later by object-specific code.

A factory is a registered VMStateDescription with factory=true and
instance_id VMSTATE_INSTANCE_ID_FACTORY.

A factory object is registered using the same VMStateDescription, but with
its own instance_name and instance_id.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h | 66 +++++++++++++++++++++++++++++++++++++-
 migration/meson.build       |  1 +
 migration/savevm.c          | 66 ++++++++++++++++++++++++++++++++++++--
 migration/trace-events      |  5 +++
 migration/vmstate-factory.c | 78 +++++++++++++++++++++++++++++++++++++++++++++
 stubs/vmstate.c             |  6 ++++
 6 files changed, 218 insertions(+), 4 deletions(-)
 create mode 100644 migration/vmstate-factory.c

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 8cb3d2b..00ad864 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -204,6 +204,11 @@ struct VMStateDescription {
      */
     bool precreate;
 
+    /*
+     * This VMSD is a factory or a factory object.
+     */
+    bool factory;
+
     int version_id;
     int minimum_version_id;
     MigrationPriority priority;
@@ -1228,6 +1233,17 @@ bool vmstate_section_needed(const VMStateDescription *vmsd, void *opaque);
 typedef char (VMStateId)[256];
 
 #define  VMSTATE_INSTANCE_ID_ANY  -1
+#define VMSTATE_INSTANCE_ID_FACTORY -2
+
+#include "qemu/queue.h"
+
+typedef struct FactoryObject {
+    char *factory_name;
+    char *instance_name;
+    int instance_id;
+    void *opaque;
+    QLIST_ENTRY(FactoryObject) next;
+} FactoryObject;
 
 /* Returns: 0 on success, -1 on failure */
 int vmstate_register_with_alias_id(VMStateIf *obj, uint32_t instance_id,
@@ -1266,6 +1282,10 @@ static void __attribute__((constructor)) vmstate_register_ ## _vmsd(void)   \
     vmstate_register_init_add(_obj, _id, &_vmsd, _opaque);                  \
 }
 
+#define vmstate_register_init_factory(_vmsd, _type)                         \
+    vmstate_register_init(NULL, VMSTATE_INSTANCE_ID_FACTORY,                \
+                          _vmsd, (void *)sizeof(_type))
+
 void vmstate_register_init_add(VMStateIf *obj, int instance_id,
                                const VMStateDescription *vmsd, void *opaque);
 
@@ -1301,8 +1321,20 @@ static inline int vmstate_register_any(VMStateIf *obj,
 }
 
 /**
+ * vmstate_register_factory() - register a factory name and size, needed to
+ * recognize incoming factory objects.
+ */
+static inline int vmstate_register_factory(const VMStateDescription *vmsd,
+                                           long size)
+{
+    return vmstate_register_with_alias_id(NULL, VMSTATE_INSTANCE_ID_FACTORY,
+                                          vmsd, (void *)size, -1, 0, NULL,
+                                          NULL);
+}
+
+/**
  * vmstate_register_named() - pass an instance_name explicitly instead of
- * implicitly via VMStateIf get_id().  Needed to register a instance-specific
+ * implicitly via VMStateIf get_id().  Needed to register an instance-specific
  * VMSD for objects that are not Objects.
  */
 static inline int vmstate_register_named(const char *instance_name,
@@ -1332,4 +1364,36 @@ void vmstate_register_ram_global(struct MemoryRegion *memory);
 
 bool vmstate_check_only_migratable(const VMStateDescription *vmsd);
 
+/*
+ * Add to the factory object list, called during loadvm.
+ */
+void vmstate_add_factory_object(const char *factory_name,
+                                const char *instance_name,
+                                int instance_id,
+                                void *opaque);
+
+/*
+ * Search for and return a factory object.
+ */
+void *vmstate_find_factory_object(const char *factory_name,
+                                  const char *instance_name,
+                                  int instance_id);
+
+/*
+ * Search for and return a factory object, removing it from the list.
+ */
+void *vmstate_claim_factory_object(const char *factory_name,
+                                   const char *instance_name,
+                                   int instance_id);
+
+typedef int (*vmstate_walk_factory_cb)(FactoryObject *obj, void *opaque);
+
+/*
+ * Search for registered factory objects (ie, outgoing)
+ * and call cb passing opaque.
+ */
+int vmstate_walk_factory_outgoing(const char *factory_name,
+                                  vmstate_walk_factory_cb cb,
+                                  void *opaque);
+
 #endif
diff --git a/migration/meson.build b/migration/meson.build
index 50e7cb2..e667b40 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -5,6 +5,7 @@ migration_files = files(
   'xbzrle.c',
   'vmstate-types.c',
   'vmstate.c',
+  'vmstate-factory.c',
   'qemu-file.c',
   'yank_functions.c',
 )
diff --git a/migration/savevm.c b/migration/savevm.c
index ec48da9..01ed78c 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1407,7 +1407,8 @@ int qemu_savevm_precreate_save(QEMUFile *f, Error **errp)
     qemu_put_be32(f, QEMU_VM_FILE_VERSION);
 
     SAVEVM_FOREACH_ALL(se, entry) {
-        if (se->vmsd && se->vmsd->precreate) {
+        if (se->vmsd && se->vmsd->precreate &&
+            se->instance_id != VMSTATE_INSTANCE_ID_FACTORY) {
             ret = vmstate_save(f, se, NULL, errp);
             if (ret) {
                 qemu_file_set_error(f, ret);
@@ -1951,6 +1952,45 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
     return NULL;
 }
 
+int vmstate_walk_factory_outgoing(const char *factory_name,
+                                  vmstate_walk_factory_cb cb, void *cb_data)
+{
+    SaveStateEntry *se, *new_se;
+    int ret, instance_len;
+    FactoryObject obj;
+    VMStateId idstr;
+    char *se_factory_name;
+
+    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
+        if (!se->vmsd || !se->vmsd->factory) {
+            continue;
+        }
+        if (se->instance_id == VMSTATE_INSTANCE_ID_FACTORY) {
+            /* This is the factory itself, not a generated instance */
+            continue;
+        }
+
+        se_factory_name = strrchr(se->idstr, '/');
+        if (factory_name && strcmp(se_factory_name + 1, factory_name)) {
+            continue;
+        }
+
+        strcpy(idstr, se->idstr);
+        instance_len = se_factory_name - se->idstr;
+        idstr[instance_len] = 0;
+        obj.factory_name = idstr + instance_len + 1;
+        obj.instance_name = idstr;
+        obj.instance_id = se->instance_id;
+        obj.opaque = se->opaque;
+
+        ret = cb(&obj, cb_data);
+        if (ret) {
+            return ret;
+        }
+    }
+    return 0;
+}
+
 enum LoadVMExitCodes {
     /* Allow a command to quit all layers of nested loadvm loops */
     LOADVM_QUIT     =  1,
@@ -2721,8 +2761,9 @@ qemu_loadvm_section_start_full(QEMUFile *f, uint8_t type)
     bool trace_downtime = (type == QEMU_VM_SECTION_FULL);
     uint32_t instance_id, version_id, section_id;
     int64_t start_ts, end_ts;
-    SaveStateEntry *se;
-    VMStateId idstr;
+    SaveStateEntry *se, new_se;
+    VMStateId idstr, instance_name;
+    char *factory_name = NULL;
     int ret;
 
     /* Read section start */
@@ -2744,8 +2785,22 @@ qemu_loadvm_section_start_full(QEMUFile *f, uint8_t type)
 
     trace_qemu_loadvm_state_section_startfull(section_id, idstr,
             instance_id, version_id);
+
     /* Find savevm section */
     se = find_se(idstr, instance_id);
+
+    if (se == NULL) {
+        pstrcpy(instance_name, sizeof(idstr), idstr);
+        factory_name = strrchr(instance_name, '/');
+        if (factory_name) {
+            *factory_name++ = 0;
+            se = find_se(factory_name, VMSTATE_INSTANCE_ID_FACTORY);
+            new_se = *se;
+            new_se.opaque = g_malloc((long)se->opaque);
+            se = &new_se;
+        }
+    }
+
     if (se == NULL) {
         error_report("Unknown savevm section or instance '%s' %"PRIu32". "
                      "Make sure that your current VM setup matches your "
@@ -2780,6 +2835,11 @@ qemu_loadvm_section_start_full(QEMUFile *f, uint8_t type)
         return ret;
     }
 
+    if (factory_name) {
+        vmstate_add_factory_object(factory_name, instance_name, instance_id,
+                                   se->opaque);
+    }
+
     if (trace_downtime) {
         end_ts = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
         trace_vmstate_downtime_load("non-iterable", se->idstr,
diff --git a/migration/trace-events b/migration/trace-events
index 1e23238..3b9c292 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -72,6 +72,11 @@ vmstate_subsection_save_loop(const char *name, const char *sub) "%s/%s"
 vmstate_subsection_save_top(const char *idstr) "%s"
 vmstate_field_exists(const char *vmsd, const char *name, int field_version, int version, int result) "%s:%s field_version %d version %d result %d"
 
+# vmstate-factory.c
+vmstate_add_factory_object(const char *factory_name, const char *idstr, int instance_id, void *opaque) " %s, %s, %d, %p"
+vmstate_find_factory_object(const char *factory_name, const char *instance_name, int instance_id, void *opaque) "%s, %s, %d -> %p"
+vmstate_claim_factory_object(const char *factory_name, const char *instance_name, int instance_id, void *opaque) "%s, %s, %d -> %p"
+
 # vmstate-types.c
 get_qtailq(const char *name, int version_id) "%s v%d"
 get_qtailq_end(const char *name, const char *reason, int val) "%s %s/%d"
diff --git a/migration/vmstate-factory.c b/migration/vmstate-factory.c
new file mode 100644
index 0000000..e425666
--- /dev/null
+++ b/migration/vmstate-factory.c
@@ -0,0 +1,78 @@
+/*
+ * Copyright (c) 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "migration/vmstate.h"
+#include "trace.h"
+
+static QLIST_HEAD(, FactoryObject) factory_objects;
+
+void vmstate_add_factory_object(const char *factory_name,
+                                const char *instance_name,
+                                int instance_id,
+                                void *opaque)
+{
+    FactoryObject *obj = g_new0(FactoryObject, 1);
+
+    obj->opaque = opaque;
+    obj->factory_name = g_strdup(factory_name);
+    obj->instance_name = g_strdup(instance_name);
+    obj->instance_id = instance_id;
+    QLIST_INSERT_HEAD(&factory_objects, obj, next);
+    trace_vmstate_add_factory_object(factory_name, instance_name, instance_id,
+                                     opaque);
+
+}
+
+#define object_match(obj, _factory_name, _instance_name, _instance_id) \
+    (!strcmp(obj->factory_name, _factory_name) &&                      \
+     !strcmp(obj->instance_name, _instance_name) &&                    \
+     obj->instance_id == _instance_id)
+
+static FactoryObject *find_object(const char *factory_name,
+                                  const char *instance_name,
+                                  int instance_id)
+{
+    FactoryObject *obj;
+
+    QLIST_FOREACH(obj, &factory_objects, next) {
+        if (object_match(obj, factory_name, instance_name, instance_id)) {
+            return obj;
+        }
+    }
+
+    return NULL;
+}
+
+void *vmstate_find_factory_object(const char *factory_name,
+                                  const char *instance_name,
+                                  int instance_id)
+{
+    FactoryObject *obj = find_object(factory_name, instance_name, instance_id);
+    void *opaque = obj ? obj->opaque : NULL;
+
+    trace_vmstate_find_factory_object(factory_name, instance_name, instance_id,
+                                      opaque);
+    return opaque;
+}
+
+void *vmstate_claim_factory_object(const char *factory_name,
+                                   const char *instance_name,
+                                   int instance_id)
+{
+    FactoryObject *obj = find_object(factory_name, instance_name, instance_id);
+    void *opaque = obj ? obj->opaque : NULL;
+
+    if (obj) {
+        g_free(obj->factory_name);
+        g_free(obj->instance_name);
+        QLIST_REMOVE(obj, next);
+    }
+
+    trace_vmstate_claim_factory_object(factory_name, instance_name, instance_id,
+                                       opaque);
+    return opaque;
+}
diff --git a/stubs/vmstate.c b/stubs/vmstate.c
index eff8be4..0e977e2 100644
--- a/stubs/vmstate.c
+++ b/stubs/vmstate.c
@@ -24,6 +24,12 @@ void vmstate_unregister_named(const char *vmsd_name,
 {
 }
 
+int vmstate_walk_factory_outgoing(const char *factory_name,
+                                  vmstate_walk_factory_cb cb, void *cb_data)
+{
+    return 1;
+}
+
 bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
 {
     return true;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 13/26] physmem: ram_block_create
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (11 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 12/26] migration: vmstate factory object Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-13 18:37   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 14/26] physmem: hoist guest_memfd creation Steve Sistare
                   ` (16 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Create a common subroutine to allocate a RAMBlock, de-duping the code to
populate its common fields.  Add a trace point for good measure.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 system/physmem.c    | 47 ++++++++++++++++++++++++++---------------------
 system/trace-events |  3 +++
 2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index c3d04ca..6216b14 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -52,6 +52,7 @@
 #include "sysemu/hw_accel.h"
 #include "sysemu/xen-mapcache.h"
 #include "trace/trace-root.h"
+#include "trace.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
 #include <linux/falloc.h>
@@ -1918,11 +1919,29 @@ out_free:
     }
 }
 
+static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,
+                                  ram_addr_t max_size, uint32_t ram_flags)
+{
+    RAMBlock *rb = g_malloc0(sizeof(*rb));
+
+    rb->used_length = size;
+    rb->max_length = max_size;
+    rb->fd = -1;
+    rb->flags = ram_flags;
+    rb->page_size = qemu_real_host_page_size();
+    rb->mr = mr;
+    rb->guest_memfd = -1;
+    trace_ram_block_create(rb->idstr, rb->flags, rb->fd, rb->used_length,
+                           rb->max_length, mr->align);
+    return rb;
+}
+
 #ifdef CONFIG_POSIX
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
                                  uint32_t ram_flags, int fd, off_t offset,
                                  Error **errp)
 {
+    void *host;
     RAMBlock *new_block;
     Error *local_err = NULL;
     int64_t file_size, file_align;
@@ -1962,19 +1981,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    new_block = g_malloc0(sizeof(*new_block));
-    new_block->mr = mr;
-    new_block->used_length = size;
-    new_block->max_length = size;
-    new_block->flags = ram_flags;
-    new_block->guest_memfd = -1;
-    new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
-                                     errp);
-    if (!new_block->host) {
+    new_block = ram_block_create(mr, size, size, ram_flags);
+    host = file_ram_alloc(new_block, size, fd, !file_size, offset, errp);
+    if (!host) {
         g_free(new_block);
         return NULL;
     }
 
+    new_block->host = host;
     ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
@@ -1982,7 +1996,6 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
     return new_block;
-
 }
 
 
@@ -2054,18 +2067,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     align = MAX(align, TARGET_PAGE_SIZE);
     size = ROUND_UP(size, align);
     max_size = ROUND_UP(max_size, align);
-
-    new_block = g_malloc0(sizeof(*new_block));
-    new_block->mr = mr;
-    new_block->resized = resized;
-    new_block->used_length = size;
-    new_block->max_length = max_size;
     assert(max_size >= size);
-    new_block->fd = -1;
-    new_block->guest_memfd = -1;
-    new_block->page_size = qemu_real_host_page_size();
-    new_block->host = host;
-    new_block->flags = ram_flags;
+    new_block = ram_block_create(mr, size, max_size, ram_flags);
+    new_block->resized = resized;
+
     ram_block_add(new_block, &local_err);
     if (local_err) {
         g_free(new_block);
diff --git a/system/trace-events b/system/trace-events
index 69c9044..f0a80ba 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -38,3 +38,6 @@ dirtylimit_state_finalize(void)
 dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
 dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
 dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
+
+# physmem.c
+ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 14/26] physmem: hoist guest_memfd creation
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (12 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 13/26] physmem: ram_block_create Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 15/26] physmem: hoist host memory allocation Steve Sistare
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Do not modify RAMBlocks in ram_block_add.  The block should be fully
formed before calling ram_block_add to add it to the block list.  This
will simplify error handling and be more modular.

Start by hoisting guest_memfd creation to the call sites.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 system/physmem.c | 85 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 48 insertions(+), 37 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 6216b14..ffcf012 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1803,13 +1803,34 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
+static int ram_block_create_guest_memfd(RAMBlock *rb, Error **errp)
+{
+    assert(kvm_enabled());
+
+    if (ram_block_discard_require(true) < 0) {
+        error_setg_errno(errp, errno,
+            "cannot set up private guest memory: discard currently blocked");
+        error_append_hint(errp, "Are you using assigned devices?\n");
+        return -1;
+    }
+
+    return kvm_create_guest_memfd(rb->max_length, 0, errp);
+}
+
+static void ram_block_destroy_guest_memfd(RAMBlock *rb)
+{
+    if (rb->guest_memfd >= 0) {
+        close(rb->guest_memfd);
+        ram_block_discard_require(false);
+    }
+}
+
 static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     const bool noreserve = qemu_ram_is_noreserve(new_block);
     const bool shared = qemu_ram_is_shared(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
-    bool free_on_error = false;
     ram_addr_t old_ram_size, new_ram_size;
     Error *err = NULL;
 
@@ -1839,26 +1860,6 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 return;
             }
             memory_try_enable_merging(new_block->host, new_block->max_length);
-            free_on_error = true;
-        }
-    }
-
-    if (new_block->flags & RAM_GUEST_MEMFD) {
-        assert(kvm_enabled());
-        assert(new_block->guest_memfd < 0);
-
-        if (ram_block_discard_require(true) < 0) {
-            error_setg_errno(errp, errno,
-                             "cannot set up private guest memory: discard currently blocked");
-            error_append_hint(errp, "Are you using assigned devices?\n");
-            goto out_free;
-        }
-
-        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
-                                                        0, errp);
-        if (new_block->guest_memfd < 0) {
-            qemu_mutex_unlock_ramlist();
-            goto out_free;
         }
     }
 
@@ -1910,17 +1911,11 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         ram_block_notify_add(new_block->host, new_block->used_length,
                              new_block->max_length);
     }
-    return;
-
-out_free:
-    if (free_on_error) {
-        qemu_anon_ram_free(new_block->host, new_block->max_length);
-        new_block->host = NULL;
-    }
 }
 
 static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,
-                                  ram_addr_t max_size, uint32_t ram_flags)
+                                  ram_addr_t max_size, uint32_t ram_flags,
+                                  Error **errp)
 {
     RAMBlock *rb = g_malloc0(sizeof(*rb));
 
@@ -1930,7 +1925,17 @@ static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,
     rb->flags = ram_flags;
     rb->page_size = qemu_real_host_page_size();
     rb->mr = mr;
-    rb->guest_memfd = -1;
+
+    if (ram_flags & RAM_GUEST_MEMFD) {
+        rb->guest_memfd = ram_block_create_guest_memfd(rb, errp);
+        if (rb->guest_memfd < 0) {
+            g_free(rb);
+            return NULL;
+        }
+    } else {
+        rb->guest_memfd = -1;
+    }
+
     trace_ram_block_create(rb->idstr, rb->flags, rb->fd, rb->used_length,
                            rb->max_length, mr->align);
     return rb;
@@ -1981,9 +1986,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    new_block = ram_block_create(mr, size, size, ram_flags);
+    new_block = ram_block_create(mr, size, size, ram_flags, errp);
+    if (!new_block) {
+        return NULL;
+    }
+
     host = file_ram_alloc(new_block, size, fd, !file_size, offset, errp);
     if (!host) {
+        ram_block_destroy_guest_memfd(new_block);
         g_free(new_block);
         return NULL;
     }
@@ -2068,11 +2078,16 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     size = ROUND_UP(size, align);
     max_size = ROUND_UP(max_size, align);
     assert(max_size >= size);
-    new_block = ram_block_create(mr, size, max_size, ram_flags);
+    new_block = ram_block_create(mr, size, max_size, ram_flags, errp);
+    if (!new_block) {
+        return NULL;
+    }
     new_block->resized = resized;
 
+    new_block->host = host;
     ram_block_add(new_block, &local_err);
     if (local_err) {
+        ram_block_destroy_guest_memfd(new_block);
         g_free(new_block);
         error_propagate(errp, local_err);
         return NULL;
@@ -2119,11 +2134,7 @@ static void reclaim_ramblock(RAMBlock *block)
         qemu_anon_ram_free(block->host, block->max_length);
     }
 
-    if (block->guest_memfd >= 0) {
-        close(block->guest_memfd);
-        ram_block_discard_require(false);
-    }
-
+    ram_block_destroy_guest_memfd(block);
     g_free(block);
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 15/26] physmem: hoist host memory allocation
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (13 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 14/26] physmem: hoist guest_memfd creation Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 16/26] physmem: set ram block idstr earlier Steve Sistare
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Hoist host memory allocation from ram_block_add.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 system/physmem.c | 80 +++++++++++++++++++++++++-------------------------------
 1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index ffcf012..b57462d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1825,44 +1825,40 @@ static void ram_block_destroy_guest_memfd(RAMBlock *rb)
     }
 }
 
-static void ram_block_add(RAMBlock *new_block, Error **errp)
+static void *ram_block_alloc_host(RAMBlock *rb, Error **errp)
+{
+    struct MemoryRegion *mr = rb->mr;
+    uint8_t *host = NULL;
+
+    if (xen_enabled()) {
+        xen_ram_alloc(rb->offset, rb->max_length, mr, errp);
+
+    } else {
+        host = qemu_anon_ram_alloc(rb->max_length, &mr->align,
+                                   qemu_ram_is_shared(rb),
+                                   qemu_ram_is_noreserve(rb));
+        if (!host) {
+            error_setg_errno(errp, errno, "cannot set up guest memory '%s'",
+                             rb->idstr);
+        }
+    }
+
+    if (host) {
+        memory_try_enable_merging(host, rb->max_length);
+    }
+    return host;
+}
+
+static void ram_block_add(RAMBlock *new_block)
 {
-    const bool noreserve = qemu_ram_is_noreserve(new_block);
-    const bool shared = qemu_ram_is_shared(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
     ram_addr_t old_ram_size, new_ram_size;
-    Error *err = NULL;
-
     old_ram_size = last_ram_page();
 
     qemu_mutex_lock_ramlist();
     new_block->offset = find_ram_offset(new_block->max_length);
 
-    if (!new_block->host) {
-        if (xen_enabled()) {
-            xen_ram_alloc(new_block->offset, new_block->max_length,
-                          new_block->mr, &err);
-            if (err) {
-                error_propagate(errp, err);
-                qemu_mutex_unlock_ramlist();
-                return;
-            }
-        } else {
-            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
-                                                  &new_block->mr->align,
-                                                  shared, noreserve);
-            if (!new_block->host) {
-                error_setg_errno(errp, errno,
-                                 "cannot set up guest memory '%s'",
-                                 memory_region_name(new_block->mr));
-                qemu_mutex_unlock_ramlist();
-                return;
-            }
-            memory_try_enable_merging(new_block->host, new_block->max_length);
-        }
-    }
-
     new_ram_size = MAX(old_ram_size,
               (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS);
     if (new_ram_size > old_ram_size) {
@@ -1948,7 +1944,6 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 {
     void *host;
     RAMBlock *new_block;
-    Error *local_err = NULL;
     int64_t file_size, file_align;
 
     /* Just support these ram flags by now. */
@@ -1999,12 +1994,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     }
 
     new_block->host = host;
-    ram_block_add(new_block, &local_err);
-    if (local_err) {
-        g_free(new_block);
-        error_propagate(errp, local_err);
-        return NULL;
-    }
+    ram_block_add(new_block);
     return new_block;
 }
 
@@ -2066,7 +2056,6 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   MemoryRegion *mr, Error **errp)
 {
     RAMBlock *new_block;
-    Error *local_err = NULL;
     int align;
 
     assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
@@ -2084,14 +2073,17 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     }
     new_block->resized = resized;
 
-    new_block->host = host;
-    ram_block_add(new_block, &local_err);
-    if (local_err) {
-        ram_block_destroy_guest_memfd(new_block);
-        g_free(new_block);
-        error_propagate(errp, local_err);
-        return NULL;
+    if (!host) {
+        host = ram_block_alloc_host(new_block, errp);
+        if (!host) {
+            ram_block_destroy_guest_memfd(new_block);
+            g_free(new_block);
+            return NULL;
+        }
     }
+
+    new_block->host = host;
+    ram_block_add(new_block);
     return new_block;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 16/26] physmem: set ram block idstr earlier
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (14 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 15/26] physmem: hoist host memory allocation Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-04-29 15:55 ` [PATCH V1 17/26] machine: memfd-alloc option Steve Sistare
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Set the idstr for a ram block earlier, prior to calling ram_block_add,
so it can be used in a subsequent patch to find CPR attributes for the
block before it is created.

The id depends on the block's device path and its mr.  As as sanity check,
verify that the id has not changed (due to these dependencies changing)
by the time vmstate_register_ram is called (where the id was previously
assigned).

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/cpu-common.h |  3 +--
 migration/savevm.c        |  4 +---
 system/physmem.c          | 46 +++++++++++++++++++++++-----------------------
 3 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 6d53188..ffab5d9 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -82,8 +82,7 @@ RAMBlock *qemu_ram_block_by_name(const char *name);
 RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
                                    ram_addr_t *offset);
 ram_addr_t qemu_ram_block_host_offset(RAMBlock *rb, void *host);
-void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
-void qemu_ram_unset_idstr(RAMBlock *block);
+void qemu_ram_verify_idstr(RAMBlock *block, DeviceState *dev);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
 void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
diff --git a/migration/savevm.c b/migration/savevm.c
index 01ed78c..8463ddf 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3566,14 +3566,12 @@ bool delete_snapshot(const char *name, bool has_devices,
 
 void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
 {
-    qemu_ram_set_idstr(mr->ram_block,
-                       memory_region_name(mr), dev);
+    qemu_ram_verify_idstr(mr->ram_block, dev);
     qemu_ram_set_migratable(mr->ram_block);
 }
 
 void vmstate_unregister_ram(MemoryRegion *mr, DeviceState *dev)
 {
-    qemu_ram_unset_idstr(mr->ram_block);
     qemu_ram_unset_migratable(mr->ram_block);
 }
 
diff --git a/system/physmem.c b/system/physmem.c
index b57462d..c736af5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1597,35 +1597,20 @@ int qemu_ram_get_fd(RAMBlock *rb)
 }
 
 /* Called with the BQL held.  */
-void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
+static void qemu_ram_set_idstr(char *idstr, MemoryRegion *mr, DeviceState *dev)
 {
-    RAMBlock *block;
+    const char *name = memory_region_name(mr);
+    g_autofree char *id = dev ? qdev_get_dev_path(dev) : NULL;
 
-    assert(new_block);
-    assert(!new_block->idstr[0]);
-
-    if (dev) {
-        char *id = qdev_get_dev_path(dev);
-        if (id) {
-            snprintf(new_block->idstr, sizeof(new_block->idstr), "%s/", id);
-            g_free(id);
-        }
-    }
-    pstrcat(new_block->idstr, sizeof(new_block->idstr), name);
-
-    RCU_READ_LOCK_GUARD();
-    RAMBLOCK_FOREACH(block) {
-        if (block != new_block &&
-            !strcmp(block->idstr, new_block->idstr)) {
-            fprintf(stderr, "RAMBlock \"%s\" already registered, abort!\n",
-                    new_block->idstr);
-            abort();
-        }
+    if (id) {
+        snprintf(idstr, sizeof(VMStateId), "%s/%s", id, name);
+    } else {
+        pstrcpy(idstr, sizeof(VMStateId), name);
     }
 }
 
 /* Called with the BQL held.  */
-void qemu_ram_unset_idstr(RAMBlock *block)
+static void qemu_ram_unset_idstr(RAMBlock *block)
 {
     /* FIXME: arch_init.c assumes that this is not called throughout
      * migration.  Ignore the problem since hot-unplug during migration
@@ -1636,6 +1621,13 @@ void qemu_ram_unset_idstr(RAMBlock *block)
     }
 }
 
+void qemu_ram_verify_idstr(RAMBlock *new_block, DeviceState *dev)
+{
+    VMStateId idstr;
+    qemu_ram_set_idstr(idstr, new_block->mr, dev);
+    assert(!strcmp(new_block->idstr, idstr));
+}
+
 size_t qemu_ram_pagesize(RAMBlock *rb)
 {
     return rb->page_size;
@@ -1869,6 +1861,12 @@ static void ram_block_add(RAMBlock *new_block)
      * tail, so save the last element in last_block.
      */
     RAMBLOCK_FOREACH(block) {
+        if (!strcmp(block->idstr, new_block->idstr)) {
+            fprintf(stderr, "RAMBlock \"%s\" already added, abort!\n",
+                    new_block->idstr);
+            abort();
+        }
+
         last_block = block;
         if (block->max_length < new_block->max_length) {
             break;
@@ -1915,6 +1913,7 @@ static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,
 {
     RAMBlock *rb = g_malloc0(sizeof(*rb));
 
+    qemu_ram_set_idstr(rb->idstr, mr, mr->dev);
     rb->used_length = size;
     rb->max_length = max_size;
     rb->fd = -1;
@@ -2142,6 +2141,7 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    qemu_ram_unset_idstr(block);
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 17/26] machine: memfd-alloc option
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (15 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 16/26] physmem: set ram block idstr earlier Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-28 21:12   ` Peter Xu
  2024-04-29 15:55 ` [PATCH V1 18/26] migration: cpr-exec-args parameter Steve Sistare
                   ` (12 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Allocate anonymous memory using memfd_create if the memfd-alloc machine
option is set.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 22 ++++++++++++++++++++++
 include/hw/boards.h |  1 +
 qemu-options.hx     |  6 ++++++
 system/memory.c     |  9 ++++++---
 system/physmem.c    | 18 +++++++++++++++++-
 system/trace-events |  1 +
 6 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 582c2df..9567b97 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -443,6 +443,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static bool machine_get_memfd_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->memfd_alloc;
+}
+
+static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->memfd_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -1044,6 +1058,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_bool(oc, "memfd-alloc",
+        machine_get_memfd_alloc, machine_set_memfd_alloc);
+    object_class_property_set_description(oc, "memfd-alloc",
+        "Enable/disable allocating anonymous memory using memfd_create");
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
@@ -1387,6 +1406,9 @@ static bool create_default_memdev(MachineState *ms, const char *path, Error **er
     if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
         goto out;
     }
+    if (!object_property_set_bool(obj, "share", ms->memfd_alloc, errp)) {
+        goto out;
+    }
     object_property_add_child(object_get_objects_root(), mc->default_ram_id,
                               obj);
     /* Ensure backend's memory region name is equal to mc->default_ram_id */
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 69c1ba4..96259c3 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -372,6 +372,7 @@ struct MachineState {
     bool dump_guest_core;
     bool mem_merge;
     bool require_guest_memfd;
+    bool memfd_alloc;
     bool usb;
     bool usb_disabled;
     char *firmware;
diff --git a/qemu-options.hx b/qemu-options.hx
index cf61f6b..f0dfda5 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
     "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
     "                mem-merge=on|off controls memory merge support (default: on)\n"
+    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
     "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
     "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
     "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
@@ -79,6 +80,11 @@ SRST
         supported by the host, de-duplicates identical memory pages
         among VMs instances (enabled by default).
 
+    ``memfd-alloc=on|off``
+        Enables or disables allocation of anonymous guest RAM using
+        memfd_create.  Any associated memory-backend objects are created with
+        share=on.  The memfd-alloc default is off.
+
     ``aes-key-wrap=on|off``
         Enables or disables AES key wrapping support on s390-ccw hosts.
         This feature controls whether AES wrapping keys will be created
diff --git a/system/memory.c b/system/memory.c
index 49f1cb2..ca04a0e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
+    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
     return memory_region_init_ram_flags_nomigrate(mr, owner, name,
-                                                  size, 0, errp);
+                                                  size, flags, errp);
 }
 
 bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
@@ -1713,8 +1714,9 @@ bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
+    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
     if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
-                                                size, 0, errp)) {
+                                                size, flags, errp)) {
          return false;
     }
     mr->readonly = true;
@@ -1731,6 +1733,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
                                              Error **errp)
 {
     Error *err = NULL;
+    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
     assert(ops);
     memory_region_init(mr, owner, name, size);
     mr->ops = ops;
@@ -1738,7 +1741,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     mr->terminates = true;
     mr->rom_device = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, flags, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
diff --git a/system/physmem.c b/system/physmem.c
index c736af5..36d97ec 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -45,6 +45,7 @@
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -1825,6 +1826,19 @@ static void *ram_block_alloc_host(RAMBlock *rb, Error **errp)
     if (xen_enabled()) {
         xen_ram_alloc(rb->offset, rb->max_length, mr, errp);
 
+    } else if (rb->flags & RAM_SHARED) {
+        if (rb->fd == -1) {
+            mr->align = QEMU_VMALLOC_ALIGN;
+            rb->fd = qemu_memfd_create(rb->idstr, rb->max_length + mr->align,
+                                       0, 0, 0, errp);
+        }
+        if (rb->fd >= 0) {
+            int mfd = rb->fd;
+            qemu_set_cloexec(mfd);
+            host = file_ram_alloc(rb, rb->max_length, mfd, false, 0, errp);
+            trace_qemu_anon_memfd_alloc(rb->idstr, rb->max_length, mfd, host);
+        }
+
     } else {
         host = qemu_anon_ram_alloc(rb->max_length, &mr->align,
                                    qemu_ram_is_shared(rb),
@@ -2106,8 +2120,10 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
                                                      void *host),
                                      MemoryRegion *mr, Error **errp)
 {
+    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
+    flags |= RAM_RESIZEABLE;
     return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
-                                   RAM_RESIZEABLE, mr, errp);
+                                   flags, mr, errp);
 }
 
 static void reclaim_ramblock(RAMBlock *block)
diff --git a/system/trace-events b/system/trace-events
index f0a80ba..0092734 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -41,3 +41,4 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
 
 # physmem.c
 ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"
+qemu_anon_memfd_alloc(const char *name, size_t size, int fd, void *ptr) "%s size %zu fd %d -> %p"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 18/26] migration: cpr-exec-args parameter
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (16 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 17/26] machine: memfd-alloc option Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-02 12:23   ` Markus Armbruster
  2024-05-21  8:13   ` Daniel P. Berrangé
  2024-04-29 15:55 ` [PATCH V1 19/26] physmem: preserve ram blocks for cpr Steve Sistare
                   ` (11 subsequent siblings)
  29 siblings, 2 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Create the cpr-exec-args migration parameter, defined as a list of
strings.  It will be used for cpr-exec migration mode in a subsequent
patch.

No functional change, except that cpr-exec-args is shown by the
'info migrate' command.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx                |  2 +-
 migration/migration-hmp-cmds.c | 24 ++++++++++++++++++++++++
 migration/options.c            | 13 +++++++++++++
 qapi/migration.json            | 18 +++++++++++++++---
 4 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 2e2a3bc..39954ae 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1012,7 +1012,7 @@ ERST
 
     {
         .name       = "migrate_set_parameter",
-        .args_type  = "parameter:s,value:s",
+        .args_type  = "parameter:s,value:S",
         .params     = "parameter value",
         .help       = "Set the parameter for migration",
         .cmd        = hmp_migrate_set_parameter,
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 7e96ae6..414c7e8 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -255,6 +255,18 @@ void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict)
     qapi_free_MigrationCapabilityStatusList(caps);
 }
 
+static void monitor_print_cpr_exec_args(Monitor *mon, strList *args)
+{
+    monitor_printf(mon, "%s:",
+        MigrationParameter_str(MIGRATION_PARAMETER_CPR_EXEC_ARGS));
+
+    while (args) {
+        monitor_printf(mon, " %s", args->value);
+        args = args->next;
+    }
+    monitor_printf(mon, "\n");
+}
+
 void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
 {
     MigrationParameters *params;
@@ -397,6 +409,8 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: %s\n",
             MigrationParameter_str(MIGRATION_PARAMETER_MODE),
             qapi_enum_lookup(&MigMode_lookup, params->mode));
+        assert(params->has_cpr_exec_args);
+        monitor_print_cpr_exec_args(mon, params->cpr_exec_args);
     }
 
     qapi_free_MigrationParameters(params);
@@ -690,6 +704,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_mode = true;
         visit_type_MigMode(v, param, &p->mode, &err);
         break;
+    case MIGRATION_PARAMETER_CPR_EXEC_ARGS: {
+        g_autofree char **strv = g_strsplit(valuestr ?: "", " ", -1);
+        strList **tail = &p->cpr_exec_args;
+
+        for (int i = 0; strv[i]; i++) {
+            QAPI_LIST_APPEND(tail, strv[i]);
+        }
+        p->has_cpr_exec_args = true;
+        break;
+    }
     default:
         assert(0);
     }
diff --git a/migration/options.c b/migration/options.c
index 239f5ec..89082cc 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -1060,6 +1060,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->mode = s->parameters.mode;
     params->has_zero_page_detection = true;
     params->zero_page_detection = s->parameters.zero_page_detection;
+    params->has_cpr_exec_args = true;
+    params->cpr_exec_args = QAPI_CLONE(strList, s->parameters.cpr_exec_args);
 
     return params;
 }
@@ -1097,6 +1099,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_vcpu_dirty_limit = true;
     params->has_mode = true;
     params->has_zero_page_detection = true;
+    params->has_cpr_exec_args = true;
 }
 
 /*
@@ -1416,6 +1419,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_zero_page_detection) {
         dest->zero_page_detection = params->zero_page_detection;
     }
+
+    if (params->has_cpr_exec_args) {
+        dest->cpr_exec_args = params->cpr_exec_args;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1570,6 +1577,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_zero_page_detection) {
         s->parameters.zero_page_detection = params->zero_page_detection;
     }
+
+    if (params->has_cpr_exec_args) {
+        qapi_free_strList(s->parameters.cpr_exec_args);
+        s->parameters.cpr_exec_args =
+            QAPI_CLONE(strList, params->cpr_exec_args);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/qapi/migration.json b/qapi/migration.json
index 8c65b90..49710e7 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -914,6 +914,9 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
+#    See @cpr-exec for details.  (Since 9.1)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -948,7 +951,8 @@
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
            'vcpu-dirty-limit',
            'mode',
-           'zero-page-detection'] }
+           'zero-page-detection',
+           'cpr-exec-args'] }
 
 ##
 # @MigrateSetParameters:
@@ -1122,6 +1126,9 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
+#    See @cpr-exec for details.  (Since 9.1)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1176,7 +1183,8 @@
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
-            '*zero-page-detection': 'ZeroPageDetection'} }
+            '*zero-page-detection': 'ZeroPageDetection',
+            '*cpr-exec-args': [ 'str' ]} }
 
 ##
 # @migrate-set-parameters:
@@ -1354,6 +1362,9 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
+#    See @cpr-exec for details.  (Since 9.1)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1405,7 +1416,8 @@
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
-            '*zero-page-detection': 'ZeroPageDetection'} }
+            '*zero-page-detection': 'ZeroPageDetection',
+            '*cpr-exec-args': [ 'str' ]} }
 
 ##
 # @query-migrate-parameters:
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (17 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 18/26] migration: cpr-exec-args parameter Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-28 21:44   ` Peter Xu
  2024-04-29 15:55 ` [PATCH V1 20/26] migration: cpr-exec mode Steve Sistare
                   ` (10 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Preserve fields of RAMBlocks that allocate their host memory during CPR so
the RAM allocation can be recovered.  Mirror the mr->align field in the
RAMBlock to simplify the vmstate.  Preserve the old host address, even
though it is immediately discarded, as it will be needed in the future for
CPR with iommufd.  Preserve guest_memfd, even though CPR does not yet
support it, to maintain vmstate compatibility when it becomes supported.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/ramblock.h |  6 ++++++
 system/physmem.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 61deefe..b492d89 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -44,6 +44,7 @@ struct RAMBlock {
     uint64_t fd_offset;
     int guest_memfd;
     size_t page_size;
+    uint64_t align;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
 
@@ -91,5 +92,10 @@ struct RAMBlock {
      */
     ram_addr_t postcopy_length;
 };
+
+#define RAM_BLOCK "RAMBlock"
+
+extern const VMStateDescription vmstate_ram_block;
+
 #endif
 #endif
diff --git a/system/physmem.c b/system/physmem.c
index 36d97ec..3019284 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1398,6 +1398,7 @@ static void *file_ram_alloc(RAMBlock *block,
         block->mr->align = MAX(block->mr->align, QEMU_VMALLOC_ALIGN);
     }
 #endif
+    block->align = block->mr->align;
 
     if (memory < block->page_size) {
         error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
@@ -1848,6 +1849,7 @@ static void *ram_block_alloc_host(RAMBlock *rb, Error **errp)
                              rb->idstr);
         }
     }
+    rb->align = mr->align;
 
     if (host) {
         memory_try_enable_merging(host, rb->max_length);
@@ -1934,6 +1936,7 @@ static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,
     rb->flags = ram_flags;
     rb->page_size = qemu_real_host_page_size();
     rb->mr = mr;
+    rb->align = mr->align;
 
     if (ram_flags & RAM_GUEST_MEMFD) {
         rb->guest_memfd = ram_block_create_guest_memfd(rb, errp);
@@ -2060,6 +2063,26 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
 }
 #endif
 
+const VMStateDescription vmstate_ram_block = {
+    .name = RAM_BLOCK,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .precreate = true,
+    .factory = true,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(align, RAMBlock),
+        VMSTATE_VOID_PTR(host, RAMBlock),
+        VMSTATE_INT32(fd, RAMBlock),
+        VMSTATE_INT32(guest_memfd, RAMBlock),
+        VMSTATE_UINT32(flags, RAMBlock),
+        VMSTATE_UINT64(used_length, RAMBlock),
+        VMSTATE_UINT64(max_length, RAMBlock),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+vmstate_register_init_factory(vmstate_ram_block, RAMBlock);
+
 static
 RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
@@ -2070,6 +2093,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 {
     RAMBlock *new_block;
     int align;
+    g_autofree RAMBlock *preserved = NULL;
 
     assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
                           RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0);
@@ -2086,6 +2110,17 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     }
     new_block->resized = resized;
 
+    preserved = vmstate_claim_factory_object(RAM_BLOCK, new_block->idstr, 0);
+    if (preserved) {
+        assert(mr->align <= preserved->align);
+        mr->align = mr->align ?: preserved->align;
+        new_block->align = preserved->align;
+        new_block->fd = preserved->fd;
+        new_block->flags = preserved->flags;
+        new_block->used_length = preserved->used_length;
+        new_block->max_length = preserved->max_length;
+    }
+
     if (!host) {
         host = ram_block_alloc_host(new_block, errp);
         if (!host) {
@@ -2093,6 +2128,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
             g_free(new_block);
             return NULL;
         }
+        if (!(ram_flags & RAM_GUEST_MEMFD)) {
+            vmstate_register_named(new_block->idstr, 0, &vmstate_ram_block,
+                                   new_block);
+        }
     }
 
     new_block->host = host;
@@ -2157,6 +2196,7 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    vmstate_unregister_named(RAM_BLOCK, block->idstr, 0);
     qemu_ram_unset_idstr(block);
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 20/26] migration: cpr-exec mode
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (18 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 19/26] physmem: preserve ram blocks for cpr Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-02 12:23   ` Markus Armbruster
                     ` (2 more replies)
  2024-04-29 15:55 ` [PATCH V1 21/26] migration: migrate_add_blocker_mode Steve Sistare
                   ` (9 subsequent siblings)
  29 siblings, 3 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Add the cpr-exec migration mode.  Usage:
  qemu-system-$arch -machine memfd-alloc=on ...
  migrate_set_parameter mode cpr-exec
  migrate_set_parameter cpr-exec-args \
    <arg1> <arg2> ... -incoming <uri>
  migrate -d <uri>

The migrate command stops the VM, saves state to the URI,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from the URI.  Guest RAM is preserved in place,
albeit with new virtual addresses.

Arguments for the new QEMU process are taken from the
@cpr-exec-args parameter.  The first argument should be the
path of a new QEMU binary, or a prefix command that exec's the
new QEMU binary.

Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so the URI must be a type, such as
a file, that reads all data before old QEMU exits.

Memory backend objects must have the share=on attribute, and
must be mmap'able in the new QEMU process.  For example,
memory-backend-file is acceptable, but memory-backend-ram is
not.

The VM must be started with the '-machine memfd-alloc=on'
option.  This causes implicit ram blocks (those not explicitly
described by a memory-backend object) to be allocated by
mmap'ing a memfd.  Examples include VGA, ROM, and even guest
RAM when it is specified without a memory-backend object.

The implementation saves precreate vmstate at the end of normal
migration in migrate_fd_cleanup, and tells the main loop to call
cpr_exec.  Incoming qemu loads preceate state early, before objects
are created.  The memfds are kept open across exec by clearing the
close-on-exec flag, their values are saved in precreate vmstate,
and they are mmap'd in new qemu.

Note that the memfd-alloc option is not related to memory-backend-memfd.
Later patches add support for memory-backend-memfd, and for additional
devices, including vfio, chardev, and more.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h  |  14 +++++
 include/migration/misc.h |   3 ++
 migration/cpr.c          | 131 +++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build    |   1 +
 migration/migration.c    |  21 ++++++++
 migration/migration.h    |   5 +-
 migration/ram.c          |   1 +
 qapi/migration.json      |  30 ++++++++++-
 system/physmem.c         |   2 +
 system/vl.c              |   4 ++
 10 files changed, 210 insertions(+), 2 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..aa8316d
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,14 @@
+/*
+ * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+bool cpr_needed_for_exec(void *opaque);
+void cpr_unpreserve_fds(void);
+
+#endif
diff --git a/include/migration/misc.h b/include/migration/misc.h
index cf30351..5b963ba 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -122,4 +122,7 @@ bool migration_in_bg_snapshot(void);
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
+/* migration/cpr.c */
+void cpr_exec(char **argv);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..d4703e1
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,131 @@
+/*
+ * Copyright (c) 2021-2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "exec/ramblock.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/misc.h"
+#include "migration/vmstate.h"
+#include "sysemu/runstate.h"
+#include "trace.h"
+
+/*************************************************************************/
+#define CPR_STATE "CprState"
+
+typedef struct CprState {
+    MigMode mode;
+} CprState;
+
+static CprState cpr_state = {
+    .mode = MIG_MODE_NORMAL,
+};
+
+static int cpr_state_presave(void *opaque)
+{
+    cpr_state.mode = migrate_mode();
+    return 0;
+}
+
+bool cpr_needed_for_exec(void *opaque)
+{
+    return migrate_mode() == MIG_MODE_CPR_EXEC;
+}
+
+static const VMStateDescription vmstate_cpr_state = {
+    .name = CPR_STATE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = cpr_needed_for_exec,
+    .pre_save = cpr_state_presave,
+    .precreate = true,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(mode, CprState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+vmstate_register_init(NULL, 0, vmstate_cpr_state, &cpr_state);
+
+/*************************************************************************/
+
+typedef int (*cpr_walk_fd_cb)(int fd);
+
+static int walk_ramblock(FactoryObject *obj, void *opaque)
+{
+    RAMBlock *rb = obj->opaque;
+    cpr_walk_fd_cb cb = opaque;
+    return cb(rb->fd);
+}
+
+static int cpr_walk_fd(cpr_walk_fd_cb cb)
+{
+    int ret = vmstate_walk_factory_outgoing(RAM_BLOCK, walk_ramblock, cb);
+    return ret;
+}
+
+static int preserve_fd(int fd)
+{
+    qemu_clear_cloexec(fd);
+    return 0;
+}
+
+static int unpreserve_fd(int fd)
+{
+    qemu_set_cloexec(fd);
+    return 0;
+}
+
+static void cpr_preserve_fds(void)
+{
+    cpr_walk_fd(preserve_fd);
+}
+
+void cpr_unpreserve_fds(void)
+{
+    cpr_walk_fd(unpreserve_fd);
+}
+
+static int cpr_fd_notifier_func(NotifierWithReturn *notifier,
+                                 MigrationEvent *e, Error **errp)
+{
+    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
+        e->type == MIG_EVENT_PRECOPY_FAILED) {
+        cpr_unpreserve_fds();
+    }
+    return 0;
+}
+
+void cpr_mig_init(void)
+{
+    static NotifierWithReturn cpr_fd_notifier;
+
+    migrate_get_current()->parameters.mode = cpr_state.mode;
+    migration_add_notifier(&cpr_fd_notifier, cpr_fd_notifier_func);
+}
+
+void cpr_exec(char **argv)
+{
+    MigrationState *s = migrate_get_current();
+    Error *err = NULL;
+
+    /*
+     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
+     * earlier because they should not persist across miscellaneous fork and
+     * exec calls that are performed during normal operation.
+     */
+    cpr_preserve_fds();
+
+    execvp(argv[0], argv);
+
+    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
+    error_report_err(err);
+    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+    migrate_set_error(s, err);
+    migration_precreate_unsave();
+}
diff --git a/migration/meson.build b/migration/meson.build
index e667b40..d9e9c60 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -14,6 +14,7 @@ system_ss.add(files(
   'block-dirty-bitmap.c',
   'channel.c',
   'channel-block.c',
+  'cpr.c',
   'dirtyrate.c',
   'exec.c',
   'fd.c',
diff --git a/migration/migration.c b/migration/migration.c
index b5af6b5..0d91531 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -239,6 +239,7 @@ void migration_object_init(void)
     blk_mig_init();
     ram_mig_init();
     dirty_bitmap_mig_init();
+    cpr_mig_init();
 }
 
 typedef struct {
@@ -1395,6 +1396,15 @@ static void migrate_fd_cleanup(MigrationState *s)
         qemu_fclose(tmp);
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        Error *err = NULL;
+        if (migration_precreate_save(&err)) {
+            migrate_set_error(s, err);
+            error_report_err(err);
+            migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+        }
+    }
+
     assert(!migration_is_active());
 
     if (s->state == MIGRATION_STATUS_CANCELLING) {
@@ -1410,6 +1420,11 @@ static void migrate_fd_cleanup(MigrationState *s)
                                      MIG_EVENT_PRECOPY_DONE;
     migration_call_notifiers(s, type, NULL);
     block_cleanup_parameters();
+
+    if (migrate_mode() == MIG_MODE_CPR_EXEC && !migration_has_failed(s)) {
+        assert(s->state == MIGRATION_STATUS_COMPLETED);
+        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_args);
+    }
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
@@ -1977,6 +1992,12 @@ static bool migrate_prepare(MigrationState *s, bool blk, bool blk_inc,
         return false;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
+        !s->parameters.has_cpr_exec_args) {
+        error_setg(errp, "cpr-exec mode requires setting cpr-exec-args");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
diff --git a/migration/migration.h b/migration/migration.h
index 8045e39..2ad2163 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -490,7 +490,6 @@ bool migration_in_postcopy(void);
 bool migration_postcopy_is_alive(int state);
 MigrationState *migrate_get_current(void);
 bool migration_has_failed(MigrationState *);
-bool migrate_mode_is_cpr(MigrationState *);
 
 uint64_t ram_get_total_transferred_pages(void);
 
@@ -544,4 +543,8 @@ int migration_rp_wait(MigrationState *s);
  */
 void migration_rp_kick(MigrationState *s);
 
+/* CPR */
+bool migrate_mode_is_cpr(MigrationState *);
+void cpr_mig_init(void);
+
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index a975c5a..add285b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -219,6 +219,7 @@ static bool postcopy_preempt_active(void)
 bool migrate_ram_is_ignored(RAMBlock *block)
 {
     return !qemu_ram_is_migratable(block) ||
+           migrate_mode() == MIG_MODE_CPR_EXEC ||
            (migrate_ignore_shared() && qemu_ram_is_shared(block)
                                     && qemu_ram_is_named_file(block));
 }
diff --git a/qapi/migration.json b/qapi/migration.json
index 49710e7..7c5f45f 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -665,9 +665,37 @@
 #     or COLO.
 #
 #     (since 8.2)
+#
+# @cpr-exec: The migrate command stops the VM, saves state to the URI,
+#     directly exec's a new version of QEMU on the same host,
+#     replacing the original process while retaining its PID, and
+#     loads state from the URI.  Guest RAM is preserved in place,
+#     albeit with new virtual addresses.
+#
+#     Arguments for the new QEMU process are taken from the
+#     @cpr-exec-args parameter.  The first argument should be the
+#     path of a new QEMU binary, or a prefix command that exec's the
+#     new QEMU binary.
+#
+#     Because old QEMU terminates when new QEMU starts, one cannot
+#     stream data between the two, so the URI must be a type, such as
+#     a file, that reads all data before old QEMU exits.
+#
+#     Memory backend objects must have the share=on attribute, and
+#     must be mmap'able in the new QEMU process.  For example,
+#     memory-backend-file is acceptable, but memory-backend-ram is
+#     not.
+#
+#     The VM must be started with the '-machine memfd-alloc=on'
+#     option.  This causes implicit ram blocks -- those not explicitly
+#     described by a memory-backend object -- to be allocated by
+#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
+#     RAM when it is specified without a memory-backend object.
+#
+#     (since 9.1)
 ##
 { 'enum': 'MigMode',
-  'data': [ 'normal', 'cpr-reboot' ] }
+  'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] }
 
 ##
 # @ZeroPageDetection:
diff --git a/system/physmem.c b/system/physmem.c
index 3019284..87ad441 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -69,6 +69,7 @@
 
 #include "qemu/pmem.h"
 
+#include "migration/cpr.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -2069,6 +2070,7 @@ const VMStateDescription vmstate_ram_block = {
     .minimum_version_id = 1,
     .precreate = true,
     .factory = true,
+    .needed = cpr_needed_for_exec,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64(align, RAMBlock),
         VMSTATE_VOID_PTR(host, RAMBlock),
diff --git a/system/vl.c b/system/vl.c
index 7797206..7252100 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -76,6 +76,7 @@
 #include "hw/block/block.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/pc.h"
+#include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
 #include "migration/vmstate.h"
@@ -3665,6 +3666,9 @@ void qemu_init(int argc, char **argv)
     qemu_create_machine(machine_opts_dict);
 
     vmstate_register_init_all();
+    migration_precreate_load(&error_fatal);
+    /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
+    cpr_unpreserve_fds();
 
     suspend_mux_open();
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 21/26] migration: migrate_add_blocker_mode
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (19 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 20/26] migration: cpr-exec mode Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-09 17:47   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 22/26] migration: ram block cpr-exec blockers Steve Sistare
                   ` (8 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Define a convenience function to add a migration blocker for a single mode.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/blocker.h | 7 +++++++
 migration/migration.c       | 5 +++++
 stubs/migr-blocker.c        | 5 +++++
 3 files changed, 17 insertions(+)

diff --git a/include/migration/blocker.h b/include/migration/blocker.h
index a687ac0..5c2e5d4 100644
--- a/include/migration/blocker.h
+++ b/include/migration/blocker.h
@@ -94,4 +94,11 @@ int migrate_add_blocker_normal(Error **reasonp, Error **errp);
  */
 int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...);
 
+/**
+ * @migrate_add_blocker_mode - prevent a mode of migration from proceeding
+ *
+ * Like migrate_add_blocker_modes, but for a single mode.
+ */
+int migrate_add_blocker_mode(Error **reasonp, MigMode mode, Error **errp);
+
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 0d91531..4984dee 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1769,6 +1769,11 @@ int migrate_add_blocker_normal(Error **reasonp, Error **errp)
     return migrate_add_blocker_modes(reasonp, errp, MIG_MODE_NORMAL, -1);
 }
 
+int migrate_add_blocker_mode(Error **reasonp, MigMode mode, Error **errp)
+{
+    return migrate_add_blocker_modes(reasonp, errp, mode, -1);
+}
+
 int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...)
 {
     int modes;
diff --git a/stubs/migr-blocker.c b/stubs/migr-blocker.c
index 11cbff2..150eb62 100644
--- a/stubs/migr-blocker.c
+++ b/stubs/migr-blocker.c
@@ -16,6 +16,11 @@ int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...)
     return 0;
 }
 
+int migrate_add_blocker_mode(Error **reasonp, MigMode mode, Error **errp)
+{
+    return 0;
+}
+
 void migrate_del_blocker(Error **reasonp)
 {
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 22/26] migration: ram block cpr-exec blockers
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (20 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 21/26] migration: migrate_add_blocker_mode Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-09 18:01   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 23/26] migration: misc " Steve Sistare
                   ` (7 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Unlike cpr-reboot mode, cpr-exec mode cannot save volatile ram blocks in the
migration stream file and recreate them later, because the physical memory for
the blocks is pinned and registered for vfio.  Add an exec-mode blocker for
volatile ram blocks.

Also add a blocker for RAM_GUEST_MEMFD.  Preserving guest_memfd may be
sufficient for cpr-exec, but it has not been tested yet.

- Steve

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h   |  3 +++
 include/exec/ramblock.h |  1 +
 migration/savevm.c      |  2 ++
 system/physmem.c        | 52 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index dbb1bad..d337737 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -3182,6 +3182,9 @@ bool ram_block_discard_is_disabled(void);
  */
 bool ram_block_discard_is_required(void);
 
+void ram_block_add_cpr_blocker(RAMBlock *rb, Error **errp);
+void ram_block_del_cpr_blocker(RAMBlock *rb);
+
 #endif
 
 #endif
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index b492d89..b70ec0c 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -40,6 +40,7 @@ struct RAMBlock {
     /* RCU-enabled, writes protected by the ramlist lock */
     QLIST_ENTRY(RAMBlock) next;
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
+    Error *cpr_blocker;
     int fd;
     uint64_t fd_offset;
     int guest_memfd;
diff --git a/migration/savevm.c b/migration/savevm.c
index 8463ddf..6087c3a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3568,11 +3568,13 @@ void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
 {
     qemu_ram_verify_idstr(mr->ram_block, dev);
     qemu_ram_set_migratable(mr->ram_block);
+    ram_block_add_cpr_blocker(mr->ram_block, &error_fatal);
 }
 
 void vmstate_unregister_ram(MemoryRegion *mr, DeviceState *dev)
 {
     qemu_ram_unset_migratable(mr->ram_block);
+    ram_block_del_cpr_blocker(mr->ram_block);
 }
 
 void vmstate_register_ram_global(MemoryRegion *mr)
diff --git a/system/physmem.c b/system/physmem.c
index 87ad441..9d44b41 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -69,6 +69,7 @@
 
 #include "qemu/pmem.h"
 
+#include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/vmstate.h"
 
@@ -2130,7 +2131,14 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
             g_free(new_block);
             return NULL;
         }
-        if (!(ram_flags & RAM_GUEST_MEMFD)) {
+        if (ram_flags & RAM_GUEST_MEMFD) {
+            error_setg(&new_block->cpr_blocker,
+                       "Memory region %s uses guest_memfd, "
+                       "which is not supported with CPR.",
+                       memory_region_name(mr));
+            migrate_add_blocker_mode(&new_block->cpr_blocker, MIG_MODE_CPR_EXEC,
+                                     errp);
+        } else {
             vmstate_register_named(new_block->idstr, 0, &vmstate_ram_block,
                                    new_block);
         }
@@ -3997,3 +4005,45 @@ bool ram_block_discard_is_required(void)
     return qatomic_read(&ram_block_discard_required_cnt) ||
            qatomic_read(&ram_block_coordinated_discard_required_cnt);
 }
+
+/*
+ * Return true if ram contents would be lost during cpr for MIG_MODE_CPR_EXEC.
+ * Return false for ram_device because it is remapped after exec.  Do not
+ * exclude rom, even though it is readonly, because the rom file could change
+ * in the new qemu.  Return false for non-migratable blocks.  They are either
+ * re-created after exec, or are handled specially, or are covered by a
+ * device-level cpr blocker.  Return false for an fd, because it is visible and
+ * can be remapped in the new process.
+ */
+static bool ram_is_volatile(RAMBlock *rb)
+{
+    MemoryRegion *mr = rb->mr;
+
+    return mr &&
+        memory_region_is_ram(mr) &&
+        !memory_region_is_ram_device(mr) &&
+        (!qemu_ram_is_shared(rb) || !qemu_ram_is_named_file(rb)) &&
+        qemu_ram_is_migratable(rb) &&
+        rb->fd < 0;
+}
+
+/*
+ * Add a MIG_MODE_CPR_EXEC blocker for each volatile ram block.
+ */
+void ram_block_add_cpr_blocker(RAMBlock *rb, Error **errp)
+{
+    if (!ram_is_volatile(rb)) {
+        return;
+    }
+
+    error_setg(&rb->cpr_blocker,
+               "Memory region %s is volatile. A memory-backend-memfd or "
+               "memory-backend-file with share=on is required.",
+               memory_region_name(rb->mr));
+    migrate_add_blocker_mode(&rb->cpr_blocker, MIG_MODE_CPR_EXEC, errp);
+}
+
+void ram_block_del_cpr_blocker(RAMBlock *rb)
+{
+    migrate_del_blocker(&rb->cpr_blocker);
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 23/26] migration: misc cpr-exec blockers
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (21 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 22/26] migration: ram block cpr-exec blockers Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-09 18:05   ` Fabiano Rosas
  2024-05-24 12:40   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 24/26] seccomp: cpr-exec blocker Steve Sistare
                   ` (6 subsequent siblings)
  29 siblings, 2 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Add blockers for cpr-exec migration mode for devices and options that do
not support it.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 accel/xen/xen-all.c    |  5 +++++
 backends/hostmem-epc.c | 12 ++++++++++--
 hw/vfio/migration.c    |  3 ++-
 replay/replay.c        |  6 ++++++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c
index 0bdefce..9a7ed0f 100644
--- a/accel/xen/xen-all.c
+++ b/accel/xen/xen-all.c
@@ -78,6 +78,7 @@ static void xen_setup_post(MachineState *ms, AccelState *accel)
 static int xen_init(MachineState *ms)
 {
     MachineClass *mc = MACHINE_GET_CLASS(ms);
+    Error *blocker = NULL;
 
     xen_xc = xc_interface_open(0, 0, 0);
     if (xen_xc == NULL) {
@@ -112,6 +113,10 @@ static int xen_init(MachineState *ms)
     mc->default_ram_id = NULL;
 
     xen_mode = XEN_ATTACH;
+
+    error_setg(&blocker, "xen does not support cpr exec");
+    migrate_add_blocker_mode(&blocker, MIG_MODE_CPR_EXEC, &error_fatal);
+
     return 0;
 }
 
diff --git a/backends/hostmem-epc.c b/backends/hostmem-epc.c
index 735e2e1..837300f 100644
--- a/backends/hostmem-epc.c
+++ b/backends/hostmem-epc.c
@@ -15,6 +15,7 @@
 #include "qom/object_interfaces.h"
 #include "qapi/error.h"
 #include "sysemu/hostmem.h"
+#include "migration/blocker.h"
 #include "hw/i386/hostmem-epc.h"
 
 static bool
@@ -23,6 +24,7 @@ sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     g_autofree char *name = NULL;
     uint32_t ram_flags;
     int fd;
+    Error *blocker = NULL;
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
@@ -38,8 +40,14 @@ sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 
     name = object_get_canonical_path(OBJECT(backend));
     ram_flags = (backend->share ? RAM_SHARED : 0) | RAM_PROTECTED;
-    return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
-                                          backend->size, ram_flags, fd, 0, errp);
+    if (!memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
+                                        name, backend->size, ram_flags,
+                                        fd, 0, errp)) {
+        return false;
+    }
+    error_setg(&blocker, "memory-backend-epc does not support cpr exec");
+    migrate_add_blocker_mode(&blocker, MIG_MODE_CPR_EXEC, &error_fatal);
+    return true;
 }
 
 static void sgx_epc_backend_instance_init(Object *obj)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 06ae409..b9cd783 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -898,7 +898,8 @@ static int vfio_block_migration(VFIODevice *vbasedev, Error *err, Error **errp)
     vbasedev->migration_blocker = error_copy(err);
     error_free(err);
 
-    return migrate_add_blocker_normal(&vbasedev->migration_blocker, errp);
+    return migrate_add_blocker_modes(&vbasedev->migration_blocker, errp,
+                                     MIG_MODE_NORMAL, MIG_MODE_CPR_EXEC, -1);
 }
 
 /* ---------------------------------------------------------------------- */
diff --git a/replay/replay.c b/replay/replay.c
index a2c576c..1bf3f38 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -19,6 +19,7 @@
 #include "qemu/option.h"
 #include "sysemu/cpus.h"
 #include "qemu/error-report.h"
+#include "migration/blocker.h"
 
 /* Current version of the replay mechanism.
    Increase it when file format changes. */
@@ -339,6 +340,11 @@ G_NORETURN void replay_sync_error(const char *error)
 static void replay_enable(const char *fname, int mode)
 {
     const char *fmode = NULL;
+    Error *blocker = NULL;
+
+    error_setg(&blocker, "replay is not compatible with cpr");
+    migrate_add_blocker_mode(&blocker, MIG_MODE_CPR_EXEC, &error_fatal);
+
     assert(!replay_file);
 
     switch (mode) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 24/26] seccomp: cpr-exec blocker
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (22 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 23/26] migration: misc " Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-09 18:16   ` Fabiano Rosas
  2024-05-10  7:54   ` Daniel P. Berrangé
  2024-04-29 15:55 ` [PATCH V1 25/26] migration: fix mismatched GPAs during cpr-exec Steve Sistare
                   ` (5 subsequent siblings)
  29 siblings, 2 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

cpr-exec mode needs permission to exec.  Block it if permission is denied.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/seccomp.h |  1 +
 system/qemu-seccomp.c    | 10 ++++++++--
 system/vl.c              |  6 ++++++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/sysemu/seccomp.h b/include/sysemu/seccomp.h
index fe85989..023c0a1 100644
--- a/include/sysemu/seccomp.h
+++ b/include/sysemu/seccomp.h
@@ -22,5 +22,6 @@
 #define QEMU_SECCOMP_SET_RESOURCECTL (1 << 4)
 
 int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp);
+uint32_t qemu_seccomp_get_opts(void);
 
 #endif
diff --git a/system/qemu-seccomp.c b/system/qemu-seccomp.c
index 5c20ac0..0d2a561 100644
--- a/system/qemu-seccomp.c
+++ b/system/qemu-seccomp.c
@@ -360,12 +360,18 @@ static int seccomp_start(uint32_t seccomp_opts, Error **errp)
     return rc < 0 ? -1 : 0;
 }
 
+static uint32_t seccomp_opts;
+
+uint32_t qemu_seccomp_get_opts(void)
+{
+    return seccomp_opts;
+}
+
 int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp)
 {
     if (qemu_opt_get_bool(opts, "enable", false)) {
-        uint32_t seccomp_opts = QEMU_SECCOMP_SET_DEFAULT
-                | QEMU_SECCOMP_SET_OBSOLETE;
         const char *value = NULL;
+        seccomp_opts = QEMU_SECCOMP_SET_DEFAULT | QEMU_SECCOMP_SET_OBSOLETE;
 
         value = qemu_opt_get(opts, "obsolete");
         if (value) {
diff --git a/system/vl.c b/system/vl.c
index 7252100..b76881e 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -76,6 +76,7 @@
 #include "hw/block/block.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/pc.h"
+#include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
@@ -2493,6 +2494,11 @@ static void qemu_process_early_options(void)
     QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
     if (olist) {
         qemu_opts_foreach(olist, parse_sandbox, NULL, &error_fatal);
+        if (qemu_seccomp_get_opts() & QEMU_SECCOMP_SET_SPAWN) {
+            Error *blocker = NULL;
+            error_setg(&blocker, "-sandbox denies exec for cpr-exec");
+            migrate_add_blocker_mode(&blocker, MIG_MODE_CPR_EXEC, &error_fatal);
+        }
     }
 #endif
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 25/26] migration: fix mismatched GPAs during cpr-exec
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (23 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 24/26] seccomp: cpr-exec blocker Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-09 18:39   ` Fabiano Rosas
  2024-04-29 15:55 ` [PATCH V1 26/26] migration: only-migratable-modes Steve Sistare
                   ` (4 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

For cpr-exec mode, ramblock_is_ignored is always true, and the address of
each migrated memory region must match the address of the statically
initialized region on the target.  However, for a PCI rom block, the region
address is set when the guest writes to a BAR on the source, which does not
occur on the target, causing a "Mismatched GPAs" error during cpr-exec
migration.

To fix, unconditionally set the target's address to the source's address
if the region does not have an address yet.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h | 12 ++++++++++++
 migration/ram.c       | 15 +++++++++------
 system/memory.c       | 10 ++++++++--
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index d337737..4f654b0 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -801,6 +801,7 @@ struct MemoryRegion {
     bool unmergeable;
     uint8_t dirty_log_mask;
     bool is_iommu;
+    bool has_addr;
     RAMBlock *ram_block;
     Object *owner;
     /* owner as TYPE_DEVICE. Used for re-entrancy checks in MR access hotpath */
@@ -2402,6 +2403,17 @@ void memory_region_set_enabled(MemoryRegion *mr, bool enabled);
 void memory_region_set_address(MemoryRegion *mr, hwaddr addr);
 
 /*
+ * memory_region_set_address_only: set the address of a region.
+ *
+ * Same as memory_region_set_address, but without causing transaction side
+ * effects.
+ *
+ * @mr: the region to be updated
+ * @addr: new address, relative to container region
+ */
+void memory_region_set_address_only(MemoryRegion *mr, hwaddr addr);
+
+/*
  * memory_region_set_size: dynamically update the size of a region.
  *
  * Dynamically updates the size of a region.
diff --git a/migration/ram.c b/migration/ram.c
index add285b..7b8d7f6 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4196,12 +4196,15 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
     }
     if (migrate_ignore_shared()) {
         hwaddr addr = qemu_get_be64(f);
-        if (migrate_ram_is_ignored(block) &&
-            block->mr->addr != addr) {
-            error_report("Mismatched GPAs for block %s "
-                         "%" PRId64 "!= %" PRId64, block->idstr,
-                         (uint64_t)addr, (uint64_t)block->mr->addr);
-            return -EINVAL;
+        if (migrate_ram_is_ignored(block)) {
+            if (!block->mr->has_addr) {
+                memory_region_set_address_only(block->mr, addr);
+            } else if (block->mr->addr != addr) {
+                error_report("Mismatched GPAs for block %s "
+                             "%" PRId64 "!= %" PRId64, block->idstr,
+                             (uint64_t)addr, (uint64_t)block->mr->addr);
+                return -EINVAL;
+            }
         }
     }
     ret = rdma_block_notification_handle(f, block->idstr);
diff --git a/system/memory.c b/system/memory.c
index ca04a0e..3c72504 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2665,7 +2665,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr,
     for (alias = subregion->alias; alias; alias = alias->alias) {
         alias->mapped_via_alias++;
     }
-    subregion->addr = offset;
+    memory_region_set_address_only(subregion, offset);
     memory_region_update_container_subregions(subregion);
 }
 
@@ -2745,10 +2745,16 @@ static void memory_region_readd_subregion(MemoryRegion *mr)
     }
 }
 
+void memory_region_set_address_only(MemoryRegion *mr, hwaddr addr)
+{
+    mr->addr = addr;
+    mr->has_addr = true;
+}
+
 void memory_region_set_address(MemoryRegion *mr, hwaddr addr)
 {
     if (addr != mr->addr) {
-        mr->addr = addr;
+        memory_region_set_address_only(mr, addr);
         memory_region_readd_subregion(mr);
     }
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH V1 26/26] migration: only-migratable-modes
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (24 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 25/26] migration: fix mismatched GPAs during cpr-exec Steve Sistare
@ 2024-04-29 15:55 ` Steve Sistare
  2024-05-09 19:14   ` Fabiano Rosas
  2024-05-21  8:05   ` Daniel P. Berrangé
  2024-05-02 16:13 ` cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec) Steven Sistare
                   ` (3 subsequent siblings)
  29 siblings, 2 replies; 122+ messages in thread
From: Steve Sistare @ 2024-04-29 15:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster,
	Steve Sistare

Add the only-migratable-modes option as a generalization of only-migratable.
Only devices that support all requested modes are allowed.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/misc.h       |  3 +++
 include/sysemu/sysemu.h        |  1 -
 migration/migration-hmp-cmds.c | 26 +++++++++++++++++++++++++-
 migration/migration.c          | 22 +++++++++++++++++-----
 migration/savevm.c             |  2 +-
 qemu-options.hx                | 16 ++++++++++++++--
 system/globals.c               |  1 -
 system/vl.c                    | 13 ++++++++++++-
 target/s390x/cpu_models.c      |  4 +++-
 9 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 5b963ba..3ad2cd9 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -119,6 +119,9 @@ bool migration_incoming_postcopy_advised(void);
 /* True if background snapshot is active */
 bool migration_in_bg_snapshot(void);
 
+void migration_set_required_mode(MigMode mode);
+bool migration_mode_required(MigMode mode);
+
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 5b4397e..0a9c4b4 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -8,7 +8,6 @@
 
 /* vl.c */
 
-extern int only_migratable;
 extern const char *qemu_name;
 extern QemuUUID qemu_uuid;
 extern bool qemu_uuid_set;
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 414c7e8..ca913b7 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -16,6 +16,7 @@
 #include "qemu/osdep.h"
 #include "block/qapi.h"
 #include "migration/snapshot.h"
+#include "migration/misc.h"
 #include "monitor/hmp.h"
 #include "monitor/monitor.h"
 #include "qapi/error.h"
@@ -33,6 +34,28 @@
 #include "options.h"
 #include "migration.h"
 
+static void migration_dump_modes(Monitor *mon)
+{
+    int mode, n = 0;
+
+    monitor_printf(mon, "only-migratable-modes: ");
+
+    for (mode = 0; mode < MIG_MODE__MAX; mode++) {
+        if (migration_mode_required(mode)) {
+            if (n++) {
+                monitor_printf(mon, ",");
+            }
+            monitor_printf(mon, "%s", MigMode_str(mode));
+        }
+    }
+
+    if (!n) {
+        monitor_printf(mon, "none\n");
+    } else {
+        monitor_printf(mon, "\n");
+    }
+}
+
 static void migration_global_dump(Monitor *mon)
 {
     MigrationState *ms = migrate_get_current();
@@ -41,7 +64,7 @@ static void migration_global_dump(Monitor *mon)
     monitor_printf(mon, "store-global-state: %s\n",
                    ms->store_global_state ? "on" : "off");
     monitor_printf(mon, "only-migratable: %s\n",
-                   only_migratable ? "on" : "off");
+                   migration_mode_required(MIG_MODE_NORMAL) ? "on" : "off");
     monitor_printf(mon, "send-configuration: %s\n",
                    ms->send_configuration ? "on" : "off");
     monitor_printf(mon, "send-section-footer: %s\n",
@@ -50,6 +73,7 @@ static void migration_global_dump(Monitor *mon)
                    ms->decompress_error_check ? "on" : "off");
     monitor_printf(mon, "clear-bitmap-shift: %u\n",
                    ms->clear_bitmap_shift);
+    migration_dump_modes(mon);
 }
 
 void hmp_info_migrate(Monitor *mon, const QDict *qdict)
diff --git a/migration/migration.c b/migration/migration.c
index 4984dee..5535b84 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1719,17 +1719,29 @@ static bool is_busy(Error **reasonp, Error **errp)
     return false;
 }
 
-static bool is_only_migratable(Error **reasonp, Error **errp, int modes)
+static int migration_modes_required;
+
+void migration_set_required_mode(MigMode mode)
+{
+    migration_modes_required |= BIT(mode);
+}
+
+bool migration_mode_required(MigMode mode)
+{
+    return !!(migration_modes_required & BIT(mode));
+}
+
+static bool modes_are_required(Error **reasonp, Error **errp, int modes)
 {
     ERRP_GUARD();
 
-    if (only_migratable && (modes & BIT(MIG_MODE_NORMAL))) {
+    if (migration_modes_required & modes) {
         error_propagate_prepend(errp, *reasonp,
-                                "disallowing migration blocker "
-                                "(--only-migratable) for: ");
+                                "-only-migratable{-modes}  specified, but: ");
         *reasonp = NULL;
         return true;
     }
+
     return false;
 }
 
@@ -1783,7 +1795,7 @@ int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...)
     modes = get_modes(mode, ap);
     va_end(ap);
 
-    if (is_only_migratable(reasonp, errp, modes)) {
+    if (modes_are_required(reasonp, errp, modes)) {
         return -EACCES;
     } else if (is_busy(reasonp, errp)) {
         return -EBUSY;
diff --git a/migration/savevm.c b/migration/savevm.c
index 6087c3a..e53ac84 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3585,7 +3585,7 @@ void vmstate_register_ram_global(MemoryRegion *mr)
 bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
 {
     /* check needed if --only-migratable is specified */
-    if (!only_migratable) {
+    if (!migration_mode_required(MIG_MODE_NORMAL)) {
         return true;
     }
 
diff --git a/qemu-options.hx b/qemu-options.hx
index f0dfda5..946d731 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4807,8 +4807,20 @@ DEF("only-migratable", 0, QEMU_OPTION_only_migratable, \
     "-only-migratable     allow only migratable devices\n", QEMU_ARCH_ALL)
 SRST
 ``-only-migratable``
-    Only allow migratable devices. Devices will not be allowed to enter
-    an unmigratable state.
+    Only allow devices that can migrate using normal mode. Devices will not
+    be allowed to enter an unmigratable state.
+ERST
+
+DEF("only-migratable-modes", HAS_ARG, QEMU_OPTION_only_migratable_modes, \
+    "-only-migratable-modes mode1[,...]\n"
+    "                allow only devices that are migratable using mode(s)\n",
+    QEMU_ARCH_ALL)
+SRST
+``-only-migratable-modes mode1[,...]``
+    Only allow devices which are migratable using all modes in the list,
+    which guarantees that migration will not fail due to a blocker.
+    If both only-migratable-modes and only-migratable are specified,
+    or are specified multiple times, then the required modes accumulate.
 ERST
 
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
diff --git a/system/globals.c b/system/globals.c
index e353584..fdc263e 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -48,7 +48,6 @@ const char *qemu_name;
 unsigned int nb_prom_envs;
 const char *prom_envs[MAX_PROM_ENVS];
 uint8_t *boot_splash_filedata;
-int only_migratable; /* turn it off unless user states otherwise */
 int icount_align_option;
 
 /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
diff --git a/system/vl.c b/system/vl.c
index b76881e..7e73be9 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -3458,7 +3458,18 @@ void qemu_init(int argc, char **argv)
                 incoming = optarg;
                 break;
             case QEMU_OPTION_only_migratable:
-                only_migratable = 1;
+                migration_set_required_mode(MIG_MODE_NORMAL);
+                break;
+            case QEMU_OPTION_only_migratable_modes:
+                {
+                    int i, mode;
+                    g_autofree char **words = g_strsplit(optarg, ",", -1);
+                    for (i = 0; words[i]; i++) {
+                        mode = qapi_enum_parse(&MigMode_lookup, words[i], -1,
+                                               &error_fatal);
+                        migration_set_required_mode(mode);
+                    }
+                }
                 break;
             case QEMU_OPTION_nodefaults:
                 has_defaults = 0;
diff --git a/target/s390x/cpu_models.c b/target/s390x/cpu_models.c
index 8ed3bb6..42ad160 100644
--- a/target/s390x/cpu_models.c
+++ b/target/s390x/cpu_models.c
@@ -16,6 +16,7 @@
 #include "kvm/kvm_s390x.h"
 #include "sysemu/kvm.h"
 #include "sysemu/tcg.h"
+#include "migration/misc.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qapi/visitor.h"
@@ -526,7 +527,8 @@ static void check_compatibility(const S390CPUModel *max_model,
     }
 
 #ifndef CONFIG_USER_ONLY
-    if (only_migratable && test_bit(S390_FEAT_UNPACK, model->features)) {
+    if (migration_mode_required(MIG_MODE_NORMAL) &&
+        test_bit(S390_FEAT_UNPACK, model->features)) {
         error_setg(errp, "The unpack facility is not compatible with "
                    "the --only-migratable option. You must remove either "
                    "the 'unpack' facility or the --only-migratable option");
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 18/26] migration: cpr-exec-args parameter
  2024-04-29 15:55 ` [PATCH V1 18/26] migration: cpr-exec-args parameter Steve Sistare
@ 2024-05-02 12:23   ` Markus Armbruster
  2024-05-02 16:00     ` Steven Sistare
  2024-05-21  8:13   ` Daniel P. Berrangé
  1 sibling, 1 reply; 122+ messages in thread
From: Markus Armbruster @ 2024-05-02 12:23 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange

Steve Sistare <steven.sistare@oracle.com> writes:

> Create the cpr-exec-args migration parameter, defined as a list of
> strings.  It will be used for cpr-exec migration mode in a subsequent
> patch.
>
> No functional change, except that cpr-exec-args is shown by the
> 'info migrate' command.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8c65b90..49710e7 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -914,6 +914,9 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
> +#    See @cpr-exec for details.  (Since 9.1)
> +#

You mean migration mode @cpr-exec, don't you?

If yes, dangling reference until PATCH 20 adds it.  Okay, but worth a
mention in the commit message.

Suggest "See MigMode @cpr-exec for details."

>  # Features:
>  #
>  # @deprecated: Member @block-incremental is deprecated.  Use
> @@ -948,7 +951,8 @@
>             { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
>             'vcpu-dirty-limit',
>             'mode',
> -           'zero-page-detection'] }
> +           'zero-page-detection',
> +           'cpr-exec-args'] }
>  
>  ##
>  # @MigrateSetParameters:
> @@ -1122,6 +1126,9 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
> +#    See @cpr-exec for details.  (Since 9.1)
> +#
>  # Features:
>  #
>  # @deprecated: Member @block-incremental is deprecated.  Use
> @@ -1176,7 +1183,8 @@
>                                              'features': [ 'unstable' ] },
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
> -            '*zero-page-detection': 'ZeroPageDetection'} }
> +            '*zero-page-detection': 'ZeroPageDetection',
> +            '*cpr-exec-args': [ 'str' ]} }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1354,6 +1362,9 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
> +#    See @cpr-exec for details.  (Since 9.1)
> +#
>  # Features:
>  #
>  # @deprecated: Member @block-incremental is deprecated.  Use
> @@ -1405,7 +1416,8 @@
>                                              'features': [ 'unstable' ] },
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
> -            '*zero-page-detection': 'ZeroPageDetection'} }
> +            '*zero-page-detection': 'ZeroPageDetection',
> +            '*cpr-exec-args': [ 'str' ]} }
>  
>  ##
>  # @query-migrate-parameters:



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 20/26] migration: cpr-exec mode
  2024-04-29 15:55 ` [PATCH V1 20/26] migration: cpr-exec mode Steve Sistare
@ 2024-05-02 12:23   ` Markus Armbruster
  2024-05-02 16:00     ` Steven Sistare
  2024-05-21  8:20   ` Daniel P. Berrangé
  2024-05-24 14:58   ` Fabiano Rosas
  2 siblings, 1 reply; 122+ messages in thread
From: Markus Armbruster @ 2024-05-02 12:23 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange

Steve Sistare <steven.sistare@oracle.com> writes:

> Add the cpr-exec migration mode.  Usage:
>   qemu-system-$arch -machine memfd-alloc=on ...
>   migrate_set_parameter mode cpr-exec
>   migrate_set_parameter cpr-exec-args \
>     <arg1> <arg2> ... -incoming <uri>
>   migrate -d <uri>
>
> The migrate command stops the VM, saves state to the URI,
> directly exec's a new version of QEMU on the same host,
> replacing the original process while retaining its PID, and
> loads state from the URI.  Guest RAM is preserved in place,
> albeit with new virtual addresses.
>
> Arguments for the new QEMU process are taken from the
> @cpr-exec-args parameter.  The first argument should be the
> path of a new QEMU binary, or a prefix command that exec's the
> new QEMU binary.
>
> Because old QEMU terminates when new QEMU starts, one cannot
> stream data between the two, so the URI must be a type, such as
> a file, that reads all data before old QEMU exits.
>
> Memory backend objects must have the share=on attribute, and
> must be mmap'able in the new QEMU process.  For example,
> memory-backend-file is acceptable, but memory-backend-ram is
> not.
>
> The VM must be started with the '-machine memfd-alloc=on'
> option.  This causes implicit ram blocks (those not explicitly
> described by a memory-backend object) to be allocated by
> mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> RAM when it is specified without a memory-backend object.
>
> The implementation saves precreate vmstate at the end of normal
> migration in migrate_fd_cleanup, and tells the main loop to call
> cpr_exec.  Incoming qemu loads preceate state early, before objects
> are created.  The memfds are kept open across exec by clearing the
> close-on-exec flag, their values are saved in precreate vmstate,
> and they are mmap'd in new qemu.
>
> Note that the memfd-alloc option is not related to memory-backend-memfd.
> Later patches add support for memory-backend-memfd, and for additional
> devices, including vfio, chardev, and more.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index 49710e7..7c5f45f 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -665,9 +665,37 @@
>  #     or COLO.
>  #
>  #     (since 8.2)
> +#
> +# @cpr-exec: The migrate command stops the VM, saves state to the URI,
> +#     directly exec's a new version of QEMU on the same host,
> +#     replacing the original process while retaining its PID, and
> +#     loads state from the URI.  Guest RAM is preserved in place,
> +#     albeit with new virtual addresses.

Do you mean the virtual addresses of guest RAM may differ betwen old and
new QEMU process?

> +#
> +#     Arguments for the new QEMU process are taken from the
> +#     @cpr-exec-args parameter.  The first argument should be the
> +#     path of a new QEMU binary, or a prefix command that exec's the
> +#     new QEMU binary.

What's a "prefix command"?  A wrapper script, perhaps?

> +#
> +#     Because old QEMU terminates when new QEMU starts, one cannot
> +#     stream data between the two, so the URI must be a type, such as
> +#     a file, that reads all data before old QEMU exits.

What happens when you specify a URI that doesn't?

> +#
> +#     Memory backend objects must have the share=on attribute, and
> +#     must be mmap'able in the new QEMU process.  For example,
> +#     memory-backend-file is acceptable, but memory-backend-ram is
> +#     not.
> +#
> +#     The VM must be started with the '-machine memfd-alloc=on'

What happens when you don't?

> +#     option.  This causes implicit ram blocks -- those not explicitly
> +#     described by a memory-backend object -- to be allocated by
> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> +#     RAM when it is specified without a memory-backend object.
> +#
> +#     (since 9.1)
>  ##
>  { 'enum': 'MigMode',
> -  'data': [ 'normal', 'cpr-reboot' ] }
> +  'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] }
>  
>  ##
>  # @ZeroPageDetection:

[...]



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 18/26] migration: cpr-exec-args parameter
  2024-05-02 12:23   ` Markus Armbruster
@ 2024-05-02 16:00     ` Steven Sistare
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare @ 2024-05-02 16:00 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange

On 5/2/2024 8:23 AM, Markus Armbruster wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Create the cpr-exec-args migration parameter, defined as a list of
>> strings.  It will be used for cpr-exec migration mode in a subsequent
>> patch.
>>
>> No functional change, except that cpr-exec-args is shown by the
>> 'info migrate' command.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> [...]
> 
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 8c65b90..49710e7 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -914,6 +914,9 @@
>>   #     See description in @ZeroPageDetection.  Default is 'multifd'.
>>   #     (since 9.0)
>>   #
>> +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
>> +#    See @cpr-exec for details.  (Since 9.1)
>> +#
> 
> You mean migration mode @cpr-exec, don't you?
> 
> If yes, dangling reference until PATCH 20 adds it.  Okay, but worth a
> mention in the commit message.
> 
> Suggest "See MigMode @cpr-exec for details."

Yes to all.  Will update as you suggest.

- Steve

>>   # Features:
>>   #
>>   # @deprecated: Member @block-incremental is deprecated.  Use
>> @@ -948,7 +951,8 @@
>>              { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
>>              'vcpu-dirty-limit',
>>              'mode',
>> -           'zero-page-detection'] }
>> +           'zero-page-detection',
>> +           'cpr-exec-args'] }
>>   
>>   ##
>>   # @MigrateSetParameters:
>> @@ -1122,6 +1126,9 @@
>>   #     See description in @ZeroPageDetection.  Default is 'multifd'.
>>   #     (since 9.0)
>>   #
>> +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
>> +#    See @cpr-exec for details.  (Since 9.1)
>> +#
>>   # Features:
>>   #
>>   # @deprecated: Member @block-incremental is deprecated.  Use
>> @@ -1176,7 +1183,8 @@
>>                                               'features': [ 'unstable' ] },
>>               '*vcpu-dirty-limit': 'uint64',
>>               '*mode': 'MigMode',
>> -            '*zero-page-detection': 'ZeroPageDetection'} }
>> +            '*zero-page-detection': 'ZeroPageDetection',
>> +            '*cpr-exec-args': [ 'str' ]} }
>>   
>>   ##
>>   # @migrate-set-parameters:
>> @@ -1354,6 +1362,9 @@
>>   #     See description in @ZeroPageDetection.  Default is 'multifd'.
>>   #     (since 9.0)
>>   #
>> +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
>> +#    See @cpr-exec for details.  (Since 9.1)
>> +#
>>   # Features:
>>   #
>>   # @deprecated: Member @block-incremental is deprecated.  Use
>> @@ -1405,7 +1416,8 @@
>>                                               'features': [ 'unstable' ] },
>>               '*vcpu-dirty-limit': 'uint64',
>>               '*mode': 'MigMode',
>> -            '*zero-page-detection': 'ZeroPageDetection'} }
>> +            '*zero-page-detection': 'ZeroPageDetection',
>> +            '*cpr-exec-args': [ 'str' ]} }
>>   
>>   ##
>>   # @query-migrate-parameters:
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 20/26] migration: cpr-exec mode
  2024-05-02 12:23   ` Markus Armbruster
@ 2024-05-02 16:00     ` Steven Sistare
  2024-05-03  6:26       ` Markus Armbruster
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-02 16:00 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange

On 5/2/2024 8:23 AM, Markus Armbruster wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Add the cpr-exec migration mode.  Usage:
>>    qemu-system-$arch -machine memfd-alloc=on ...
>>    migrate_set_parameter mode cpr-exec
>>    migrate_set_parameter cpr-exec-args \
>>      <arg1> <arg2> ... -incoming <uri>
>>    migrate -d <uri>
>>
>> The migrate command stops the VM, saves state to the URI,
>> directly exec's a new version of QEMU on the same host,
>> replacing the original process while retaining its PID, and
>> loads state from the URI.  Guest RAM is preserved in place,
>> albeit with new virtual addresses.
>>
>> Arguments for the new QEMU process are taken from the
>> @cpr-exec-args parameter.  The first argument should be the
>> path of a new QEMU binary, or a prefix command that exec's the
>> new QEMU binary.
>>
>> Because old QEMU terminates when new QEMU starts, one cannot
>> stream data between the two, so the URI must be a type, such as
>> a file, that reads all data before old QEMU exits.
>>
>> Memory backend objects must have the share=on attribute, and
>> must be mmap'able in the new QEMU process.  For example,
>> memory-backend-file is acceptable, but memory-backend-ram is
>> not.
>>
>> The VM must be started with the '-machine memfd-alloc=on'
>> option.  This causes implicit ram blocks (those not explicitly
>> described by a memory-backend object) to be allocated by
>> mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>> RAM when it is specified without a memory-backend object.
>>
>> The implementation saves precreate vmstate at the end of normal
>> migration in migrate_fd_cleanup, and tells the main loop to call
>> cpr_exec.  Incoming qemu loads preceate state early, before objects
>> are created.  The memfds are kept open across exec by clearing the
>> close-on-exec flag, their values are saved in precreate vmstate,
>> and they are mmap'd in new qemu.
>>
>> Note that the memfd-alloc option is not related to memory-backend-memfd.
>> Later patches add support for memory-backend-memfd, and for additional
>> devices, including vfio, chardev, and more.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> [...]
> 
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 49710e7..7c5f45f 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -665,9 +665,37 @@
>>   #     or COLO.
>>   #
>>   #     (since 8.2)
>> +#
>> +# @cpr-exec: The migrate command stops the VM, saves state to the URI,
>> +#     directly exec's a new version of QEMU on the same host,
>> +#     replacing the original process while retaining its PID, and
>> +#     loads state from the URI.  Guest RAM is preserved in place,
>> +#     albeit with new virtual addresses.
> 
> Do you mean the virtual addresses of guest RAM may differ betwen old and
> new QEMU process?

The VA at which a guest RAM segment is mapped in the QEMU process
changes.  The end user would not notice or care, so I'll drop that
detail here.

>> +#
>> +#     Arguments for the new QEMU process are taken from the
>> +#     @cpr-exec-args parameter.  The first argument should be the
>> +#     path of a new QEMU binary, or a prefix command that exec's the
>> +#     new QEMU binary.
> 
> What's a "prefix command"?  A wrapper script, perhaps?

A prefix command is any command of the form:
   command1 command1-args command2 command2-args
where command1 performs some set up before exec'ing command2.
However, I will drop the word "prefix", it adds no meaning here.

>> +#
>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>> +#     stream data between the two, so the URI must be a type, such as
>> +#     a file, that reads all data before old QEMU exits.
> 
> What happens when you specify a URI that doesn't?

Old QEMU will quietly block indefinitely writing to the URI.

>> +#
>> +#     Memory backend objects must have the share=on attribute, and
>> +#     must be mmap'able in the new QEMU process.  For example,
>> +#     memory-backend-file is acceptable, but memory-backend-ram is
>> +#     not.
>> +#
>> +#     The VM must be started with the '-machine memfd-alloc=on'
> 
> What happens when you don't?

If '-only-migratable-modes cpr-exec' is specified, then QEMU will fail
to start, and print a clear error message.

Otherwise, a blocker is registered and any attempt to cpr-exec will fail
with a clear error message.

- Steve

>> +#     option.  This causes implicit ram blocks -- those not explicitly
>> +#     described by a memory-backend object -- to be allocated by
>> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>> +#     RAM when it is specified without a memory-backend object.
>> +#
>> +#     (since 9.1)
>>   ##
>>   { 'enum': 'MigMode',
>> -  'data': [ 'normal', 'cpr-reboot' ] }
>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] }
>>   
>>   ##
>>   # @ZeroPageDetection:
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec)
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (25 preceding siblings ...)
  2024-04-29 15:55 ` [PATCH V1 26/26] migration: only-migratable-modes Steve Sistare
@ 2024-05-02 16:13 ` Steven Sistare
  2024-05-02 18:15   ` Peter Xu
  2024-05-20 18:30 ` [PATCH V1 00/26] Live update: cpr-exec Steven Sistare
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-02 16:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 4/29/2024 11:55 AM, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.

Here is the text I plan to add to docs/devel/migration/CPR.rst.  It is
premature for me to submit this as a patch, because it includes all
the functionality I plan to add in this and future series, but it may
help you while reviewing this series.

- Steve

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

cpr-exec mode
---------------

In this mode, QEMU stops the VM, writes VM state to the migration
URI, and directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID.  Guest RAM is
preserved in place, albeit with new virtual addresses.  The user
completes the migration by specifying the ``-incoming`` option, and
by issuing the ``migrate-incoming`` command if necessary; see details
below.

This mode supports vfio devices by preserving device descriptors and
hence kernel state across the exec, even for devices that do not
support live migration, and preserves tap and vhost descriptors.

cpr-exec also preserves descriptors for a subset of chardevs,
including socket, file, parallel, pipe, serial, pty, stdio, and null.
chardevs that support cpr-exec have the QEMU_CHAR_FEATURE_CPR set in
the Chardev object.  The client side of a preserved chardev sees no
loss of connectivity during cpr-exec.  More chardevs could be
preserved with additional developement.

All chardevs have a ``reopen-on-cpr`` option which causes the chardev
to be closed and reopened during cpr-exec.  This can be set to allow
cpr-exec when the configuration includes a chardev (such as vc) that
does not have QEMU_CHAR_FEATURE_CPR.

Because the old and new QEMU instances are not active concurrently,
the URI cannot be a type that streams data from one instance to the
other.

Usage
^^^^^

Arguments for the new QEMU process are taken from the
@cpr-exec-args parameter.  The first argument should be the
path of a new QEMU binary, or a prefix command that exec's the
new QEMU binary, and the arguments should include the ''-incoming''
option.

Memory backend objects must have the ``share=on`` attribute, and
must be mmap'able in the new QEMU process.  For example,
memory-backend-file is acceptable, but memory-backend-ram is
not.

The VM must be started with the ``-machine memfd-alloc=on``
option.  This causes implicit RAM blocks (those not explicitly
described by a memory-backend object) to be allocated by
mmap'ing a memfd.  Examples include VGA, ROM, and even guest
RAM when it is specified without without reference to a
memory-backend object.

Add the ``-only-migratable-modes cpr-exec`` option to guarantee that
the configuration supports cpr-exec.  QEMU will exit at start time
if not.

Outgoing:
   * Set the migration mode parameter to ``cpr-exec``.
   * Set the ``cpr-exec-args`` parameter.
   * Issue the ``migrate`` command.  It is recommended the the URI be
     a ``file`` type, but one can use other types such as ``exec``,
     provided the command captures all the data from the outgoing side,
     and provides all the data to the incoming side.

Incoming:
   * You do not need to explicitly start new QEMU.  It is started as
     a side effect of the migrate command above.
   * If the VM was running when the outgoing ``migrate`` command was
     issued, then QEMU automatically resumes VM execution.

Example 1: incoming URI
^^^^^^^^^^^^^^^^^^^^^^^

In these examples, we simply restart the same version of QEMU, but in
a real scenario one would set a new QEMU binary path in cpr-exec-args.

::

   # qemu-kvm -monitor stdio
   -object 
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G
   -machine memfd-alloc=on
   ...

   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running
   (qemu) migrate_set_parameter mode cpr-exec
   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming 
file:vm.state
   (qemu) migrate -d file:vm.state
   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running

Example 2: incoming defer
^^^^^^^^^^^^^^^^^^^^^^^^^
::

   # qemu-kvm -monitor stdio
   -object 
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G
   -machine memfd-alloc=on
   ...

   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running
   (qemu) migrate_set_parameter mode cpr-exec
   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming defer
   (qemu) migrate -d file:vm.state
   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   status: paused (inmigrate)
   (qemu) migrate_incoming file:vm.state
   (qemu) info status
   VM status: running


Caveats
^^^^^^^

cpr-exec mode may not be used with postcopy, background-snapshot,
or COLO.

cpr-exec mode requires permission to use the exec system call, which
is denied by certain sandbox options, such as spawn.  Use finer
grained controls to allow exec, eg:
``-sandbox on,fork=deny,ns=deny,exec=allow``

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec)
  2024-05-02 16:13 ` cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec) Steven Sistare
@ 2024-05-02 18:15   ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2024-05-02 18:15 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Thu, May 02, 2024 at 12:13:17PM -0400, Steven Sistare wrote:
> On 4/29/2024 11:55 AM, Steve Sistare wrote:
> > This patch series adds the live migration cpr-exec mode.
> 
> Here is the text I plan to add to docs/devel/migration/CPR.rst.  It is
> premature for me to submit this as a patch, because it includes all
> the functionality I plan to add in this and future series, but it may
> help you while reviewing this series.

I haven't reached this series at all yet but thanks for sending this,
definitely helpful for reviews.  I almost tried to ask for it. :)

I don't think it's an issue to send doc updates without full
implementations ready.  We can still mark things as BTD even in doc IMHO,
and it may help to provide a better picture of the whole thing if e.g. this
series only implemented part of them, to either reviewers or users (for the
latter, if the partially impl feature can already be consumed).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 20/26] migration: cpr-exec mode
  2024-05-02 16:00     ` Steven Sistare
@ 2024-05-03  6:26       ` Markus Armbruster
  0 siblings, 0 replies; 122+ messages in thread
From: Markus Armbruster @ 2024-05-03  6:26 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange

Steven Sistare <steven.sistare@oracle.com> writes:

> On 5/2/2024 8:23 AM, Markus Armbruster wrote:
>> Steve Sistare <steven.sistare@oracle.com> writes:
>> 
>>> Add the cpr-exec migration mode.  Usage:
>>>    qemu-system-$arch -machine memfd-alloc=on ...
>>>    migrate_set_parameter mode cpr-exec
>>>    migrate_set_parameter cpr-exec-args \
>>>      <arg1> <arg2> ... -incoming <uri>
>>>    migrate -d <uri>
>>>
>>> The migrate command stops the VM, saves state to the URI,
>>> directly exec's a new version of QEMU on the same host,
>>> replacing the original process while retaining its PID, and
>>> loads state from the URI.  Guest RAM is preserved in place,
>>> albeit with new virtual addresses.
>>>
>>> Arguments for the new QEMU process are taken from the
>>> @cpr-exec-args parameter.  The first argument should be the
>>> path of a new QEMU binary, or a prefix command that exec's the
>>> new QEMU binary.
>>>
>>> Because old QEMU terminates when new QEMU starts, one cannot
>>> stream data between the two, so the URI must be a type, such as
>>> a file, that reads all data before old QEMU exits.
>>>
>>> Memory backend objects must have the share=on attribute, and
>>> must be mmap'able in the new QEMU process.  For example,
>>> memory-backend-file is acceptable, but memory-backend-ram is
>>> not.
>>>
>>> The VM must be started with the '-machine memfd-alloc=on'
>>> option.  This causes implicit ram blocks (those not explicitly
>>> described by a memory-backend object) to be allocated by
>>> mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>>> RAM when it is specified without a memory-backend object.
>>>
>>> The implementation saves precreate vmstate at the end of normal
>>> migration in migrate_fd_cleanup, and tells the main loop to call
>>> cpr_exec.  Incoming qemu loads preceate state early, before objects
>>> are created.  The memfds are kept open across exec by clearing the
>>> close-on-exec flag, their values are saved in precreate vmstate,
>>> and they are mmap'd in new qemu.
>>>
>>> Note that the memfd-alloc option is not related to memory-backend-memfd.
>>> Later patches add support for memory-backend-memfd, and for additional
>>> devices, including vfio, chardev, and more.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> 
>> [...]
>> 
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 49710e7..7c5f45f 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -665,9 +665,37 @@
>>>  #     or COLO.
>>>  #
>>>  #     (since 8.2)
>>> +#
>>> +# @cpr-exec: The migrate command stops the VM, saves state to the URI,

What URI?  I know you mean the migration URI, but will readers know?
Elsewhere, we use "migration URI".

Hmm.  That's no good, either: we may not *have* a migration URI since
commit 074dbce5fcce (migration: New migrate and migrate-incoming
argument 'channels') and its fixup commit 57fd4b4e1075 made command
migrate argument @uri optional and mutually exclusive with @channels.

I think we better use more generic terminology here.  Let's have a look
at migrate's documentation for inspiration:

    ##
    # @migrate:
    #
    # Migrates the current running guest to another Virtual Machine.
    #
    # @uri: the Uniform Resource Identifier of the destination VM
    #
    # @channels: list of migration stream channels with each stream in the
    #     list connected to a destination interface endpoint.
    #
    [...]
    # Notes:
    [...]
    #     4. The uri argument should have the Uniform Resource Identifier
    #        of default destination VM. This connection will be bound to
    #        default network.
    #
    #     5. For now, number of migration streams is restricted to one,
    #        i.e. number of items in 'channels' list is just 1.
    #
    #     6. The 'uri' and 'channels' arguments are mutually exclusive;
    #        exactly one of the two should be present.

Perhaps "saves the state to the migration destination"?

>>> +#     directly exec's a new version of QEMU on the same host,
>>> +#     replacing the original process while retaining its PID, and
>>> +#     loads state from the URI.  Guest RAM is preserved in place,

"loads the state from the migration destination"?

We should also fix up existing uses of "migration URI": @mapped-ram,
@cpr-reboot, @tls-hostname.  Not this series' job.  I'll report it
separately.

>>> +#     albeit with new virtual addresses.
>> 
>> Do you mean the virtual addresses of guest RAM may differ betwen old and
>> new QEMU process?
>
> The VA at which a guest RAM segment is mapped in the QEMU process
> changes.  The end user would not notice or care, so I'll drop that
> detail here.
>
>>> +#
>>> +#     Arguments for the new QEMU process are taken from the
>>> +#     @cpr-exec-args parameter.  The first argument should be the
>>> +#     path of a new QEMU binary, or a prefix command that exec's the
>>> +#     new QEMU binary.
>> 
>> What's a "prefix command"?  A wrapper script, perhaps?
>
> A prefix command is any command of the form:
>    command1 command1-args command2 command2-args
> where command1 performs some set up before exec'ing command2.
> However, I will drop the word "prefix", it adds no meaning here.

Maybe "the command to start the new QEMU process"?

Hmm.  @cpr-exec-args is documented like this:

    # @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
    #    See @cpr-exec for details.  (Since 9.1)

Is it a good idea to keep the details with @cpr-exec?  Let me try not
to.  Replace the "Arguments for the new QEMU process..." paragraph by

    #     The new QEMU process is started according to migration parameter
    #     @cpr-exec-args.

Then document cpr-exec-args like

    # @cpr-exec-args: Command to start the new QEMU process for MigMode
    # @cpr-exec.  The first list element is the program's filename, the
    # remainder its arguments.

What do you think?

Naming the thing "-args" feels questionable.  It's program and
arguments.

For what it's worth, QGA command guest-exec has them separate:

    # @path: path or executable name to execute
    #
    # @arg: argument list to pass to executable

The name @path is poorly chosen.

qmp_guest_exec() then prepends @path to @arg to make the argv[] for the
execve() wrapper it uses.

I figure you'd rather not have them separate, to keep migration
parameters simpler.  Name it @cpr-exec-command?

>>> +#
>>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>>> +#     stream data between the two, so the URI must be a type, such as
>>> +#     a file, that reads all data before old QEMU exits.
>> 
>> What happens when you specify a URI that doesn't?
>
> Old QEMU will quietly block indefinitely writing to the URI.

Worth spelling that out in the doc comment?

>>> +#
>>> +#     Memory backend objects must have the share=on attribute, and
>>> +#     must be mmap'able in the new QEMU process.  For example,
>>> +#     memory-backend-file is acceptable, but memory-backend-ram is
>>> +#     not.
>>> +#
>>> +#     The VM must be started with the '-machine memfd-alloc=on'
>> 
>> What happens when you don't?
>
> If '-only-migratable-modes cpr-exec' is specified, then QEMU will fail
> to start, and print a clear error message.
>
> Otherwise, a blocker is registered and any attempt to cpr-exec will fail
> with a clear error message.

With clear errors, no further documentation is needed.  Good :)

> - Steve
>
>>> +#     option.  This causes implicit ram blocks -- those not explicitly
>>> +#     described by a memory-backend object -- to be allocated by
>>> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>>> +#     RAM when it is specified without a memory-backend object.
>>> +#
>>> +#     (since 9.1)
>>>   ##
>>>   { 'enum': 'MigMode',
>>> -  'data': [ 'normal', 'cpr-reboot' ] }
>>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] }
>>>   
>>>   ##
>>>   # @ZeroPageDetection:
>> 
>> [...]
>> 



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 04/26] migration: delete unused parameter mis
  2024-04-29 15:55 ` [PATCH V1 04/26] migration: delete unused parameter mis Steve Sistare
@ 2024-05-06 21:50   ` Fabiano Rosas
  2024-05-27 18:02   ` Peter Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-06 21:50 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 03/26] migration: SAVEVM_FOREACH
  2024-04-29 15:55 ` [PATCH V1 03/26] migration: SAVEVM_FOREACH Steve Sistare
@ 2024-05-06 23:17   ` Fabiano Rosas
  2024-05-13 19:27     ` Steven Sistare
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-06 23:17 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Define an abstraction SAVEVM_FOREACH to loop over all savevm state
> handlers, and replace QTAILQ_FOREACH.  Define variants for ALL so
> we can loop over all handlers vs a subset of handlers in a subsequent
> patch, but at this time there is no distinction between the two.
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/savevm.c | 55 +++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 32 insertions(+), 23 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 4509482..6829ba3 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -237,6 +237,15 @@ static SaveState savevm_state = {
>      .global_section_id = 0,
>  };
>  
> +#define SAVEVM_FOREACH(se, entry)                                    \
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
> +
> +#define SAVEVM_FOREACH_ALL(se, entry)                                \
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)

This feels worse than SAVEVM_FOREACH_NOT_PRECREATED. We'll have to keep
coming back to the definition to figure out which FOREACH is the real
deal.

> +
> +#define SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se)                   \
> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
> +
>  static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id);
>  
>  static bool should_validate_capability(int capability)
> @@ -674,7 +683,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
>      SaveStateEntry *se;
>      uint32_t instance_id = 0;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH_ALL(se, entry) {

In this patch we can't have both instances...

>          if (strcmp(idstr, se->idstr) == 0
>              && instance_id <= se->instance_id) {
>              instance_id = se->instance_id + 1;
> @@ -690,7 +699,7 @@ static int calculate_compat_instance_id(const char *idstr)
>      SaveStateEntry *se;
>      int instance_id = 0;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {

...otherwise one of the two changes will go undocumented because the
actual reason for it will only be described in the next patch.

>          if (!se->compat) {
>              continue;
>          }
> @@ -816,7 +825,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
>      }
>      pstrcat(id, sizeof(id), idstr);
>  
> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
> +    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
>          if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
>              savevm_state_handler_remove(se);
>              g_free(se->compat);
> @@ -939,7 +948,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
>  {
>      SaveStateEntry *se, *new_se;
>  
> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
> +    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
>          if (se->vmsd == vmsd && se->opaque == opaque) {
>              savevm_state_handler_remove(se);
>              g_free(se->compat);
> @@ -1223,7 +1232,7 @@ bool qemu_savevm_state_blocked(Error **errp)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->vmsd && se->vmsd->unmigratable) {
>              error_setg(errp, "State blocked by non-migratable device '%s'",
>                         se->idstr);
> @@ -1237,7 +1246,7 @@ void qemu_savevm_non_migratable_list(strList **reasons)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->vmsd && se->vmsd->unmigratable) {
>              QAPI_LIST_PREPEND(*reasons,
>                                g_strdup_printf("non-migratable device: %s",
> @@ -1276,7 +1285,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->vmsd && se->vmsd->dev_unplug_pending &&
>              se->vmsd->dev_unplug_pending(se->opaque)) {
>              return true;
> @@ -1291,7 +1300,7 @@ int qemu_savevm_state_prepare(Error **errp)
>      SaveStateEntry *se;
>      int ret;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->save_prepare) {
>              continue;
>          }
> @@ -1321,7 +1330,7 @@ int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
>      json_writer_start_array(ms->vmdesc, "devices");
>  
>      trace_savevm_state_setup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->vmsd && se->vmsd->early_setup) {
>              ret = vmstate_save(f, se, ms->vmdesc, errp);
>              if (ret) {
> @@ -1365,7 +1374,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
>  
>      trace_savevm_state_resume_prepare();
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->resume_prepare) {
>              continue;
>          }
> @@ -1396,7 +1405,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
>      int ret;
>  
>      trace_savevm_state_iterate();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->save_live_iterate) {
>              continue;
>          }
> @@ -1461,7 +1470,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
>      SaveStateEntry *se;
>      int ret;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->save_live_complete_postcopy) {
>              continue;
>          }
> @@ -1495,7 +1504,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>      SaveStateEntry *se;
>      int ret;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops ||
>              (in_postcopy && se->ops->has_postcopy &&
>               se->ops->has_postcopy(se->opaque)) ||
> @@ -1543,7 +1552,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>      Error *local_err = NULL;
>      int ret;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->vmsd && se->vmsd->early_setup) {
>              /* Already saved during qemu_savevm_state_setup(). */
>              continue;
> @@ -1649,7 +1658,7 @@ void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>      *must_precopy = 0;
>      *can_postcopy = 0;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->state_pending_estimate) {
>              continue;
>          }
> @@ -1670,7 +1679,7 @@ void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>      *must_precopy = 0;
>      *can_postcopy = 0;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->state_pending_exact) {
>              continue;
>          }
> @@ -1693,7 +1702,7 @@ void qemu_savevm_state_cleanup(void)
>      }
>  
>      trace_savevm_state_cleanup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->ops && se->ops->save_cleanup) {
>              se->ops->save_cleanup(se->opaque);
>          }
> @@ -1778,7 +1787,7 @@ int qemu_save_device_state(QEMUFile *f)
>      }
>      cpu_synchronize_all_states();
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          int ret;
>  
>          if (se->is_ram) {
> @@ -1801,7 +1810,7 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH_ALL(se, entry) {
>          if (!strcmp(se->idstr, idstr) &&
>              (instance_id == se->instance_id ||
>               instance_id == se->alias_id))
> @@ -2680,7 +2689,7 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis,
>      }
>  
>      trace_qemu_loadvm_state_section_partend(section_id);
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->load_section_id == section_id) {
>              break;
>          }
> @@ -2755,7 +2764,7 @@ static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->switchover_ack_needed) {
>              continue;
>          }
> @@ -2775,7 +2784,7 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>      int ret;
>  
>      trace_loadvm_state_setup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->load_setup) {
>              continue;
>          }
> @@ -2801,7 +2810,7 @@ void qemu_loadvm_state_cleanup(void)
>      SaveStateEntry *se;
>  
>      trace_loadvm_state_cleanup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->ops && se->ops->load_cleanup) {
>              se->ops->load_cleanup(se->opaque);
>          }


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 01/26] oslib: qemu_clear_cloexec
  2024-04-29 15:55 ` [PATCH V1 01/26] oslib: qemu_clear_cloexec Steve Sistare
@ 2024-05-06 23:27   ` Fabiano Rosas
  2024-05-07  8:56     ` Daniel P. Berrangé
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-06 23:27 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare,
	Dr. David Alan Gilbert, Marc-André Lureau

Steve Sistare <steven.sistare@oracle.com> writes:

+cc dgilbert, marcandre

> Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

A v1 patch with two reviews already, from people from another company
and they're not in CC. Looks suspicious. =)

Here's a fresh one, hopefully it won't spend another 4 years in the
drawer:

Reviewed-by: Fabiano Rosas <farosas@suse.de>

> ---
>  include/qemu/osdep.h | 9 +++++++++
>  util/oslib-posix.c   | 9 +++++++++
>  util/oslib-win32.c   | 4 ++++
>  3 files changed, 22 insertions(+)
>
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index c7053cd..b58f312 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -660,6 +660,15 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
>  
>  void qemu_set_cloexec(int fd);
>  
> +/*
> + * Clear FD_CLOEXEC for a descriptor.
> + *
> + * The caller must guarantee that no other fork+exec's occur before the
> + * exec that is intended to inherit this descriptor, eg by suspending CPUs
> + * and blocking monitor commands.
> + */
> +void qemu_clear_cloexec(int fd);
> +
>  /* Return a dynamically allocated directory path that is appropriate for storing
>   * local state.
>   *
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index e764416..614c3e5 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -272,6 +272,15 @@ int qemu_socketpair(int domain, int type, int protocol, int sv[2])
>      return ret;
>  }
>  
> +void qemu_clear_cloexec(int fd)
> +{
> +    int f;
> +    f = fcntl(fd, F_GETFD);
> +    assert(f != -1);
> +    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
> +    assert(f != -1);
> +}
> +
>  char *
>  qemu_get_local_state_dir(void)
>  {
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index b623830..c3e969a 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -222,6 +222,10 @@ void qemu_set_cloexec(int fd)
>  {
>  }
>  
> +void qemu_clear_cloexec(int fd)
> +{
> +}
> +
>  int qemu_get_thread_id(void)
>  {
>      return GetCurrentThreadId();


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 06/26] migration: precreate vmstate for exec
  2024-04-29 15:55 ` [PATCH V1 06/26] migration: precreate vmstate for exec Steve Sistare
@ 2024-05-06 23:34   ` Fabiano Rosas
  2024-05-13 19:28     ` Steven Sistare
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-06 23:34 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Provide migration_precreate_save for saving precreate vmstate across exec.
> Create a memfd, save its value in the environment, and serialize state
> to it.  Reverse the process in migration_precreate_load.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/misc.h |   5 ++
>  migration/meson.build    |   1 +
>  migration/precreate.c    | 139 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 145 insertions(+)
>  create mode 100644 migration/precreate.c
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index c9e200f..cf30351 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -56,6 +56,11 @@ AnnounceParameters *migrate_announce_params(void);
>  
>  void dump_vmstate_json_to_file(FILE *out_fp);
>  
> +/* migration/precreate.c */
> +int migration_precreate_save(Error **errp);
> +void migration_precreate_unsave(void);
> +int migration_precreate_load(Error **errp);
> +
>  /* migration/migration.c */
>  void migration_object_init(void);
>  void migration_shutdown(void);
> diff --git a/migration/meson.build b/migration/meson.build
> index f76b1ba..50e7cb2 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -26,6 +26,7 @@ system_ss.add(files(
>    'ram-compress.c',
>    'options.c',
>    'postcopy-ram.c',
> +  'precreate.c',
>    'savevm.c',
>    'socket.c',
>    'tls.c',
> diff --git a/migration/precreate.c b/migration/precreate.c
> new file mode 100644
> index 0000000..0bf5e1f
> --- /dev/null
> +++ b/migration/precreate.c
> @@ -0,0 +1,139 @@
> +/*
> + * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/cutils.h"
> +#include "qemu/memfd.h"
> +#include "qapi/error.h"
> +#include "io/channel-file.h"
> +#include "migration/misc.h"
> +#include "migration/qemu-file.h"
> +#include "migration/savevm.h"
> +
> +#define PRECREATE_STATE_NAME "QEMU_PRECREATE_STATE"
> +
> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    return qemu_file_new_input(ioc);
> +}
> +
> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    return qemu_file_new_output(ioc);
> +}
> +
> +static int memfd_create_named(const char *name, Error **errp)
> +{
> +    int mfd;
> +    char val[16];
> +
> +    mfd = memfd_create(name, 0);
> +    if (mfd < 0) {
> +        error_setg_errno(errp, errno, "memfd_create failed");
> +        return -1;
> +    }
> +
> +    /* Remember mfd in environment for post-exec load */
> +    qemu_clear_cloexec(mfd);
> +    snprintf(val, sizeof(val), "%d", mfd);
> +    g_setenv(name, val, 1);
> +
> +    return mfd;
> +}
> +
> +static int memfd_find_named(const char *name, int *mfd_p, Error **errp)
> +{
> +    const char *val = g_getenv(name);
> +
> +    if (!val) {
> +        *mfd_p = -1;
> +        return 0;       /* No memfd was created, not an error */
> +    }
> +    g_unsetenv(name);
> +    if (qemu_strtoi(val, NULL, 10, mfd_p)) {
> +        error_setg(errp, "Bad %s env value %s", PRECREATE_STATE_NAME, val);
> +        return -1;
> +    }
> +    lseek(*mfd_p, 0, SEEK_SET);
> +    return 0;
> +}
> +
> +static void memfd_delete_named(const char *name)
> +{
> +    int mfd;
> +    const char *val = g_getenv(name);
> +
> +    if (val) {
> +        g_unsetenv(name);
> +        if (!qemu_strtoi(val, NULL, 10, &mfd)) {
> +            close(mfd);
> +        }
> +    }
> +}
> +
> +static QEMUFile *qemu_file_new_memfd_output(const char *name, Error **errp)
> +{
> +    int mfd = memfd_create_named(name, errp);
> +
> +    if (mfd < 0) {
> +        return NULL;
> +    }
> +
> +    return qemu_file_new_fd_output(mfd, name);
> +}
> +
> +static QEMUFile *qemu_file_new_memfd_input(const char *name, Error **errp)
> +{
> +    int ret, mfd;
> +
> +    ret = memfd_find_named(name, &mfd, errp);
> +    if (ret || mfd < 0) {
> +        return NULL;
> +    }
> +
> +    return qemu_file_new_fd_input(mfd, name);
> +}
> +
> +int migration_precreate_save(Error **errp)
> +{
> +    QEMUFile *f = qemu_file_new_memfd_output(PRECREATE_STATE_NAME, errp);
> +
> +    if (!f) {
> +        return -1;
> +    } else if (qemu_savevm_precreate_save(f, errp)) {
> +        memfd_delete_named(PRECREATE_STATE_NAME);
> +        return -1;
> +    } else {
> +        /* Do not close f, as mfd must remain open. */
> +        return 0;
> +    }
> +}
> +
> +void migration_precreate_unsave(void)
> +{
> +    memfd_delete_named(PRECREATE_STATE_NAME);
> +}
> +
> +int migration_precreate_load(Error **errp)
> +{
> +    int ret;
> +    QEMUFile *f = qemu_file_new_memfd_input(PRECREATE_STATE_NAME, errp);

Can we avoid the QEMUFile? I don't see it being exported from this file.

> +
> +    if (!f) {
> +        return -1;
> +    }
> +    ret = qemu_savevm_precreate_load(f, errp);
> +    qemu_fclose(f);
> +    g_unsetenv(PRECREATE_STATE_NAME);
> +    return ret;
> +}


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 01/26] oslib: qemu_clear_cloexec
  2024-05-06 23:27   ` Fabiano Rosas
@ 2024-05-07  8:56     ` Daniel P. Berrangé
  2024-05-07 13:54       ` Fabiano Rosas
  0 siblings, 1 reply; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-05-07  8:56 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Steve Sistare, qemu-devel, Peter Xu, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster,
	Dr. David Alan Gilbert, Marc-André Lureau

On Mon, May 06, 2024 at 08:27:15PM -0300, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
> +cc dgilbert, marcandre
> 
> > Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
> >
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> 
> A v1 patch with two reviews already, from people from another company
> and they're not in CC. Looks suspicious. =)

It is ok in this case - the cpr work has been going on a long
time and the original series that got partial reviews has been
split up somewhat. So its "v1" of this series of patches, but
not "v1" of what we've seen posted on qemu-devel in the past

> 
> Here's a fresh one, hopefully it won't spend another 4 years in the
> drawer:
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 
> > ---
> >  include/qemu/osdep.h | 9 +++++++++
> >  util/oslib-posix.c   | 9 +++++++++
> >  util/oslib-win32.c   | 4 ++++
> >  3 files changed, 22 insertions(+)
> >
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index c7053cd..b58f312 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -660,6 +660,15 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
> >  
> >  void qemu_set_cloexec(int fd);
> >  
> > +/*
> > + * Clear FD_CLOEXEC for a descriptor.
> > + *
> > + * The caller must guarantee that no other fork+exec's occur before the
> > + * exec that is intended to inherit this descriptor, eg by suspending CPUs
> > + * and blocking monitor commands.
> > + */
> > +void qemu_clear_cloexec(int fd);
> > +
> >  /* Return a dynamically allocated directory path that is appropriate for storing
> >   * local state.
> >   *
> > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > index e764416..614c3e5 100644
> > --- a/util/oslib-posix.c
> > +++ b/util/oslib-posix.c
> > @@ -272,6 +272,15 @@ int qemu_socketpair(int domain, int type, int protocol, int sv[2])
> >      return ret;
> >  }
> >  
> > +void qemu_clear_cloexec(int fd)
> > +{
> > +    int f;
> > +    f = fcntl(fd, F_GETFD);
> > +    assert(f != -1);
> > +    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
> > +    assert(f != -1);
> > +}
> > +
> >  char *
> >  qemu_get_local_state_dir(void)
> >  {
> > diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> > index b623830..c3e969a 100644
> > --- a/util/oslib-win32.c
> > +++ b/util/oslib-win32.c
> > @@ -222,6 +222,10 @@ void qemu_set_cloexec(int fd)
> >  {
> >  }
> >  
> > +void qemu_clear_cloexec(int fd)
> > +{
> > +}
> > +
> >  int qemu_get_thread_id(void)
> >  {
> >      return GetCurrentThreadId();
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 01/26] oslib: qemu_clear_cloexec
  2024-05-07  8:56     ` Daniel P. Berrangé
@ 2024-05-07 13:54       ` Fabiano Rosas
  0 siblings, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-07 13:54 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Steve Sistare, qemu-devel, Peter Xu, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster,
	Dr. David Alan Gilbert, Marc-André Lureau

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, May 06, 2024 at 08:27:15PM -0300, Fabiano Rosas wrote:
>> Steve Sistare <steven.sistare@oracle.com> writes:
>> 
>> +cc dgilbert, marcandre
>> 
>> > Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
>> >
>> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> > Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
>> 
>> A v1 patch with two reviews already, from people from another company
>> and they're not in CC. Looks suspicious. =)
>
> It is ok in this case - the cpr work has been going on a long
> time and the original series that got partial reviews has been
> split up somewhat. So its "v1" of this series of patches, but
> not "v1" of what we've seen posted on qemu-devel in the past

I know =) I searched the archives to make sure those r-bs were actually
provided by them and I also remember the series from back then.

On that topic, but not related to this patch at all, I would prefer if
we had a no-preexisting r-bs rule. I don't see any value in an r-b that
already comes present in v1 and has not been provided through the
list. There's the obvious concern about bad faith, but also that we
might lose track of a series and maintainers/tools might take those r-bs
as a sign of the code actually being reviewed properly (which may or may
not be true).

In the case people develop a series inside the company and then post to
the list, an sob seems to be adequate enough.

>
>> 
>> Here's a fresh one, hopefully it won't spend another 4 years in the
>> drawer:
>> 
>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
>> 
>> > ---
>> >  include/qemu/osdep.h | 9 +++++++++
>> >  util/oslib-posix.c   | 9 +++++++++
>> >  util/oslib-win32.c   | 4 ++++
>> >  3 files changed, 22 insertions(+)
>> >
>> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> > index c7053cd..b58f312 100644
>> > --- a/include/qemu/osdep.h
>> > +++ b/include/qemu/osdep.h
>> > @@ -660,6 +660,15 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
>> >  
>> >  void qemu_set_cloexec(int fd);
>> >  
>> > +/*
>> > + * Clear FD_CLOEXEC for a descriptor.
>> > + *
>> > + * The caller must guarantee that no other fork+exec's occur before the
>> > + * exec that is intended to inherit this descriptor, eg by suspending CPUs
>> > + * and blocking monitor commands.
>> > + */
>> > +void qemu_clear_cloexec(int fd);
>> > +
>> >  /* Return a dynamically allocated directory path that is appropriate for storing
>> >   * local state.
>> >   *
>> > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> > index e764416..614c3e5 100644
>> > --- a/util/oslib-posix.c
>> > +++ b/util/oslib-posix.c
>> > @@ -272,6 +272,15 @@ int qemu_socketpair(int domain, int type, int protocol, int sv[2])
>> >      return ret;
>> >  }
>> >  
>> > +void qemu_clear_cloexec(int fd)
>> > +{
>> > +    int f;
>> > +    f = fcntl(fd, F_GETFD);
>> > +    assert(f != -1);
>> > +    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
>> > +    assert(f != -1);
>> > +}
>> > +
>> >  char *
>> >  qemu_get_local_state_dir(void)
>> >  {
>> > diff --git a/util/oslib-win32.c b/util/oslib-win32.c
>> > index b623830..c3e969a 100644
>> > --- a/util/oslib-win32.c
>> > +++ b/util/oslib-win32.c
>> > @@ -222,6 +222,10 @@ void qemu_set_cloexec(int fd)
>> >  {
>> >  }
>> >  
>> > +void qemu_clear_cloexec(int fd)
>> > +{
>> > +}
>> > +
>> >  int qemu_get_thread_id(void)
>> >  {
>> >      return GetCurrentThreadId();
>> 
>
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 05/26] migration: precreate vmstate
  2024-04-29 15:55 ` [PATCH V1 05/26] migration: precreate vmstate Steve Sistare
@ 2024-05-07 21:02   ` Fabiano Rosas
  2024-05-13 19:28     ` Steven Sistare
  2024-05-24 13:56   ` Fabiano Rosas
  2024-05-27 18:16   ` Peter Xu
  2 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-07 21:02 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Provide the VMStateDescription precreate field to mark objects that must
> be loaded on the incoming side before devices have been created, because
> they provide properties that will be needed at creation time.  They will
> be saved to and loaded from their own QEMUFile, via

It's not obvious to me what the reason is to have a separate
QEMUFile. Could you expand on this?

> qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
> functions are not yet called in this patch.  Allow them to be called
> before or after normal migration is active, when current_migration and
> current_incoming are not valid.
>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-04-29 15:55 ` [PATCH V1 07/26] migration: VMStateId Steve Sistare
@ 2024-05-07 21:03   ` Fabiano Rosas
  2024-05-27 18:20   ` Peter Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-07 21:03 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Define a type for the 256 byte id string to guarantee the same length is
> used and enforced everywhere.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr
  2024-04-29 15:55 ` [PATCH V1 08/26] migration: vmstate_info_void_ptr Steve Sistare
@ 2024-05-07 21:33   ` Fabiano Rosas
  2024-05-27 18:31   ` Peter Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-07 21:33 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> can be saved in the migration stream.  This will be needed for CPR.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 09/26] migration: vmstate_register_named
  2024-04-29 15:55 ` [PATCH V1 09/26] migration: vmstate_register_named Steve Sistare
@ 2024-05-09 14:19   ` Fabiano Rosas
  2024-05-09 14:32     ` Fabiano Rosas
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 14:19 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Define vmstate_register_named which takes the instance name as its first
> parameter, instead of generating the name from VMStateIf of the Object.
> This will be needed to register objects that are not Objects.  Pass the
> new name parameter to vmstate_register_with_alias_id.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 09/26] migration: vmstate_register_named
  2024-05-09 14:19   ` Fabiano Rosas
@ 2024-05-09 14:32     ` Fabiano Rosas
  2024-05-13 19:29       ` Steven Sistare
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 14:32 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Fabiano Rosas <farosas@suse.de> writes:

> Steve Sistare <steven.sistare@oracle.com> writes:
>
>> Define vmstate_register_named which takes the instance name as its first
>> parameter, instead of generating the name from VMStateIf of the Object.
>> This will be needed to register objects that are not Objects.  Pass the
>> new name parameter to vmstate_register_with_alias_id.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>

Actually, can't we define a wrapper type just for this purpose? For
example, looking at dbus-vmstate.c:

static void dbus_vmstate_class_init(ObjectClass *oc, void *data)
{
...
    VMStateIfClass *vc = VMSTATE_IF_CLASS(oc);

    vc->get_id = dbus_vmstate_get_id;
...
}

static const TypeInfo dbus_vmstate_info = {
    .name = TYPE_DBUS_VMSTATE,
    .parent = TYPE_OBJECT,
    .instance_size = sizeof(DBusVMState),
    .instance_finalize = dbus_vmstate_finalize,
    .class_init = dbus_vmstate_class_init,
    .interfaces = (InterfaceInfo[]) {
        { TYPE_USER_CREATABLE },   // without this one
        { TYPE_VMSTATE_IF },
        { }
    }
};

static void register_types(void)
{
    type_register_static(&dbus_vmstate_info);
}
type_init(register_types);


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 21/26] migration: migrate_add_blocker_mode
  2024-04-29 15:55 ` [PATCH V1 21/26] migration: migrate_add_blocker_mode Steve Sistare
@ 2024-05-09 17:47   ` Fabiano Rosas
  0 siblings, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 17:47 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Define a convenience function to add a migration blocker for a single mode.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 22/26] migration: ram block cpr-exec blockers
  2024-04-29 15:55 ` [PATCH V1 22/26] migration: ram block cpr-exec blockers Steve Sistare
@ 2024-05-09 18:01   ` Fabiano Rosas
  2024-05-13 19:29     ` Steven Sistare
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 18:01 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Unlike cpr-reboot mode, cpr-exec mode cannot save volatile ram blocks in the
> migration stream file and recreate them later, because the physical memory for
> the blocks is pinned and registered for vfio.  Add an exec-mode blocker for
> volatile ram blocks.
>
> Also add a blocker for RAM_GUEST_MEMFD.  Preserving guest_memfd may be
> sufficient for cpr-exec, but it has not been tested yet.
>
> - Steve

extra text here

>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 23/26] migration: misc cpr-exec blockers
  2024-04-29 15:55 ` [PATCH V1 23/26] migration: misc " Steve Sistare
@ 2024-05-09 18:05   ` Fabiano Rosas
  2024-05-24 12:40   ` Fabiano Rosas
  1 sibling, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 18:05 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Add blockers for cpr-exec migration mode for devices and options that do
> not support it.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 24/26] seccomp: cpr-exec blocker
  2024-04-29 15:55 ` [PATCH V1 24/26] seccomp: cpr-exec blocker Steve Sistare
@ 2024-05-09 18:16   ` Fabiano Rosas
  2024-05-10  7:54   ` Daniel P. Berrangé
  1 sibling, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 18:16 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> cpr-exec mode needs permission to exec.  Block it if permission is denied.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 25/26] migration: fix mismatched GPAs during cpr-exec
  2024-04-29 15:55 ` [PATCH V1 25/26] migration: fix mismatched GPAs during cpr-exec Steve Sistare
@ 2024-05-09 18:39   ` Fabiano Rosas
  0 siblings, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 18:39 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> For cpr-exec mode, ramblock_is_ignored is always true, and the address of
> each migrated memory region must match the address of the statically
> initialized region on the target.  However, for a PCI rom block, the region
> address is set when the guest writes to a BAR on the source, which does not
> occur on the target, causing a "Mismatched GPAs" error during cpr-exec
> migration.
>
> To fix, unconditionally set the target's address to the source's address
> if the region does not have an address yet.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Just a detail below.

Reviewed-by: Fabiano Rosas <farosas@suse.de>

> ---
>  include/exec/memory.h | 12 ++++++++++++
>  migration/ram.c       | 15 +++++++++------
>  system/memory.c       | 10 ++++++++--
>  3 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index d337737..4f654b0 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -801,6 +801,7 @@ struct MemoryRegion {
>      bool unmergeable;
>      uint8_t dirty_log_mask;
>      bool is_iommu;
> +    bool has_addr;

This field is not used during memory access, maybe move it down below to
preserve the hole for future usage.

>      RAMBlock *ram_block;
>      Object *owner;
>      /* owner as TYPE_DEVICE. Used for re-entrancy checks in MR access hotpath */
> @@ -2402,6 +2403,17 @@ void memory_region_set_enabled(MemoryRegion *mr, bool enabled);
>  void memory_region_set_address(MemoryRegion *mr, hwaddr addr);
>  
>  /*
> + * memory_region_set_address_only: set the address of a region.
> + *
> + * Same as memory_region_set_address, but without causing transaction side
> + * effects.
> + *
> + * @mr: the region to be updated
> + * @addr: new address, relative to container region
> + */
> +void memory_region_set_address_only(MemoryRegion *mr, hwaddr addr);
> +
> +/*
>   * memory_region_set_size: dynamically update the size of a region.
>   *
>   * Dynamically updates the size of a region.
> diff --git a/migration/ram.c b/migration/ram.c
> index add285b..7b8d7f6 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4196,12 +4196,15 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>      }
>      if (migrate_ignore_shared()) {
>          hwaddr addr = qemu_get_be64(f);
> -        if (migrate_ram_is_ignored(block) &&
> -            block->mr->addr != addr) {
> -            error_report("Mismatched GPAs for block %s "
> -                         "%" PRId64 "!= %" PRId64, block->idstr,
> -                         (uint64_t)addr, (uint64_t)block->mr->addr);
> -            return -EINVAL;
> +        if (migrate_ram_is_ignored(block)) {
> +            if (!block->mr->has_addr) {
> +                memory_region_set_address_only(block->mr, addr);
> +            } else if (block->mr->addr != addr) {
> +                error_report("Mismatched GPAs for block %s "
> +                             "%" PRId64 "!= %" PRId64, block->idstr,
> +                             (uint64_t)addr, (uint64_t)block->mr->addr);
> +                return -EINVAL;
> +            }
>          }
>      }
>      ret = rdma_block_notification_handle(f, block->idstr);
> diff --git a/system/memory.c b/system/memory.c
> index ca04a0e..3c72504 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2665,7 +2665,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr,
>      for (alias = subregion->alias; alias; alias = alias->alias) {
>          alias->mapped_via_alias++;
>      }
> -    subregion->addr = offset;
> +    memory_region_set_address_only(subregion, offset);
>      memory_region_update_container_subregions(subregion);
>  }
>  
> @@ -2745,10 +2745,16 @@ static void memory_region_readd_subregion(MemoryRegion *mr)
>      }
>  }
>  
> +void memory_region_set_address_only(MemoryRegion *mr, hwaddr addr)
> +{
> +    mr->addr = addr;
> +    mr->has_addr = true;
> +}
> +
>  void memory_region_set_address(MemoryRegion *mr, hwaddr addr)
>  {
>      if (addr != mr->addr) {
> -        mr->addr = addr;
> +        memory_region_set_address_only(mr, addr);
>          memory_region_readd_subregion(mr);
>      }
>  }


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 26/26] migration: only-migratable-modes
  2024-04-29 15:55 ` [PATCH V1 26/26] migration: only-migratable-modes Steve Sistare
@ 2024-05-09 19:14   ` Fabiano Rosas
  2024-05-13 19:48     ` Steven Sistare
  2024-05-21  8:05   ` Daniel P. Berrangé
  1 sibling, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-09 19:14 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Add the only-migratable-modes option as a generalization of only-migratable.
> Only devices that support all requested modes are allowed.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/misc.h       |  3 +++
>  include/sysemu/sysemu.h        |  1 -
>  migration/migration-hmp-cmds.c | 26 +++++++++++++++++++++++++-
>  migration/migration.c          | 22 +++++++++++++++++-----
>  migration/savevm.c             |  2 +-
>  qemu-options.hx                | 16 ++++++++++++++--
>  system/globals.c               |  1 -
>  system/vl.c                    | 13 ++++++++++++-
>  target/s390x/cpu_models.c      |  4 +++-
>  9 files changed, 75 insertions(+), 13 deletions(-)
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 5b963ba..3ad2cd9 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -119,6 +119,9 @@ bool migration_incoming_postcopy_advised(void);
>  /* True if background snapshot is active */
>  bool migration_in_bg_snapshot(void);
>  
> +void migration_set_required_mode(MigMode mode);
> +bool migration_mode_required(MigMode mode);
> +
>  /* migration/block-dirty-bitmap.c */
>  void dirty_bitmap_mig_init(void);
>  
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 5b4397e..0a9c4b4 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -8,7 +8,6 @@
>  
>  /* vl.c */
>  
> -extern int only_migratable;
>  extern const char *qemu_name;
>  extern QemuUUID qemu_uuid;
>  extern bool qemu_uuid_set;
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 414c7e8..ca913b7 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -16,6 +16,7 @@
>  #include "qemu/osdep.h"
>  #include "block/qapi.h"
>  #include "migration/snapshot.h"
> +#include "migration/misc.h"
>  #include "monitor/hmp.h"
>  #include "monitor/monitor.h"
>  #include "qapi/error.h"
> @@ -33,6 +34,28 @@
>  #include "options.h"
>  #include "migration.h"
>  
> +static void migration_dump_modes(Monitor *mon)
> +{
> +    int mode, n = 0;
> +
> +    monitor_printf(mon, "only-migratable-modes: ");
> +
> +    for (mode = 0; mode < MIG_MODE__MAX; mode++) {
> +        if (migration_mode_required(mode)) {
> +            if (n++) {
> +                monitor_printf(mon, ",");
> +            }
> +            monitor_printf(mon, "%s", MigMode_str(mode));
> +        }
> +    }
> +
> +    if (!n) {
> +        monitor_printf(mon, "none\n");
> +    } else {
> +        monitor_printf(mon, "\n");
> +    }
> +}
> +
>  static void migration_global_dump(Monitor *mon)
>  {
>      MigrationState *ms = migrate_get_current();
> @@ -41,7 +64,7 @@ static void migration_global_dump(Monitor *mon)
>      monitor_printf(mon, "store-global-state: %s\n",
>                     ms->store_global_state ? "on" : "off");
>      monitor_printf(mon, "only-migratable: %s\n",
> -                   only_migratable ? "on" : "off");
> +                   migration_mode_required(MIG_MODE_NORMAL) ? "on" : "off");
>      monitor_printf(mon, "send-configuration: %s\n",
>                     ms->send_configuration ? "on" : "off");
>      monitor_printf(mon, "send-section-footer: %s\n",
> @@ -50,6 +73,7 @@ static void migration_global_dump(Monitor *mon)
>                     ms->decompress_error_check ? "on" : "off");
>      monitor_printf(mon, "clear-bitmap-shift: %u\n",
>                     ms->clear_bitmap_shift);
> +    migration_dump_modes(mon);
>  }
>  
>  void hmp_info_migrate(Monitor *mon, const QDict *qdict)
> diff --git a/migration/migration.c b/migration/migration.c
> index 4984dee..5535b84 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1719,17 +1719,29 @@ static bool is_busy(Error **reasonp, Error **errp)
>      return false;
>  }
>  
> -static bool is_only_migratable(Error **reasonp, Error **errp, int modes)
> +static int migration_modes_required;
> +
> +void migration_set_required_mode(MigMode mode)
> +{
> +    migration_modes_required |= BIT(mode);
> +}
> +
> +bool migration_mode_required(MigMode mode)
> +{
> +    return !!(migration_modes_required & BIT(mode));
> +}
> +
> +static bool modes_are_required(Error **reasonp, Error **errp, int modes)
>  {
>      ERRP_GUARD();
>  
> -    if (only_migratable && (modes & BIT(MIG_MODE_NORMAL))) {
> +    if (migration_modes_required & modes) {
>          error_propagate_prepend(errp, *reasonp,
> -                                "disallowing migration blocker "
> -                                "(--only-migratable) for: ");
> +                                "-only-migratable{-modes}  specified, but: ");

extra space before 'specified'

>          *reasonp = NULL;
>          return true;
>      }
> +
>      return false;
>  }
>  
> @@ -1783,7 +1795,7 @@ int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...)
>      modes = get_modes(mode, ap);
>      va_end(ap);
>  
> -    if (is_only_migratable(reasonp, errp, modes)) {
> +    if (modes_are_required(reasonp, errp, modes)) {
>          return -EACCES;
>      } else if (is_busy(reasonp, errp)) {
>          return -EBUSY;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 6087c3a..e53ac84 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3585,7 +3585,7 @@ void vmstate_register_ram_global(MemoryRegion *mr)
>  bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
>  {
>      /* check needed if --only-migratable is specified */
> -    if (!only_migratable) {
> +    if (!migration_mode_required(MIG_MODE_NORMAL)) {
>          return true;
>      }
>  
> diff --git a/qemu-options.hx b/qemu-options.hx
> index f0dfda5..946d731 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -4807,8 +4807,20 @@ DEF("only-migratable", 0, QEMU_OPTION_only_migratable, \
>      "-only-migratable     allow only migratable devices\n", QEMU_ARCH_ALL)
>  SRST
>  ``-only-migratable``
> -    Only allow migratable devices. Devices will not be allowed to enter
> -    an unmigratable state.
> +    Only allow devices that can migrate using normal mode. Devices will not
> +    be allowed to enter an unmigratable state.

What's a "normal" mode is what people will ask. I don't think we need to
expose this. This option never had anything to do with "modes" and I
think we can keep it this way. See below...

> +ERST
> +
> +DEF("only-migratable-modes", HAS_ARG, QEMU_OPTION_only_migratable_modes, \
> +    "-only-migratable-modes mode1[,...]\n"
> +    "                allow only devices that are migratable using mode(s)\n",
> +    QEMU_ARCH_ALL)
> +SRST
> +``-only-migratable-modes mode1[,...]``
> +    Only allow devices which are migratable using all modes in the list,
> +    which guarantees that migration will not fail due to a blocker.
> +    If both only-migratable-modes and only-migratable are specified,
> +    or are specified multiple times, then the required modes accumulate.
>  ERST
>  
>  DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
> diff --git a/system/globals.c b/system/globals.c
> index e353584..fdc263e 100644
> --- a/system/globals.c
> +++ b/system/globals.c
> @@ -48,7 +48,6 @@ const char *qemu_name;
>  unsigned int nb_prom_envs;
>  const char *prom_envs[MAX_PROM_ENVS];
>  uint8_t *boot_splash_filedata;
> -int only_migratable; /* turn it off unless user states otherwise */
>  int icount_align_option;
>  
>  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
> diff --git a/system/vl.c b/system/vl.c
> index b76881e..7e73be9 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -3458,7 +3458,18 @@ void qemu_init(int argc, char **argv)
>                  incoming = optarg;
>                  break;
>              case QEMU_OPTION_only_migratable:
> -                only_migratable = 1;
> +                migration_set_required_mode(MIG_MODE_NORMAL);

...from the point of view of user intent, I think this should be
MIG_MODE_ALL. If I have this option set I never want to see a blocker,
period. That's not a change in behavior because the mode has to be
explicitly selected anyway.

> +                break;
> +            case QEMU_OPTION_only_migratable_modes:
> +                {
> +                    int i, mode;
> +                    g_autofree char **words = g_strsplit(optarg, ",", -1);
> +                    for (i = 0; words[i]; i++) {
> +                        mode = qapi_enum_parse(&MigMode_lookup, words[i], -1,
> +                                               &error_fatal);
> +                        migration_set_required_mode(mode);

This option can be used to refine the modes being considered, it should
take precedence if both are present.

> +                    }
> +                }
>                  break;
>              case QEMU_OPTION_nodefaults:
>                  has_defaults = 0;
> diff --git a/target/s390x/cpu_models.c b/target/s390x/cpu_models.c
> index 8ed3bb6..42ad160 100644
> --- a/target/s390x/cpu_models.c
> +++ b/target/s390x/cpu_models.c
> @@ -16,6 +16,7 @@
>  #include "kvm/kvm_s390x.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/tcg.h"
> +#include "migration/misc.h"
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "qapi/visitor.h"
> @@ -526,7 +527,8 @@ static void check_compatibility(const S390CPUModel *max_model,
>      }
>  
>  #ifndef CONFIG_USER_ONLY
> -    if (only_migratable && test_bit(S390_FEAT_UNPACK, model->features)) {
> +    if (migration_mode_required(MIG_MODE_NORMAL) &&
> +        test_bit(S390_FEAT_UNPACK, model->features)) {
>          error_setg(errp, "The unpack facility is not compatible with "
>                     "the --only-migratable option. You must remove either "
>                     "the 'unpack' facility or the --only-migratable option");


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 24/26] seccomp: cpr-exec blocker
  2024-04-29 15:55 ` [PATCH V1 24/26] seccomp: cpr-exec blocker Steve Sistare
  2024-05-09 18:16   ` Fabiano Rosas
@ 2024-05-10  7:54   ` Daniel P. Berrangé
  2024-05-13 19:29     ` Steven Sistare
  1 sibling, 1 reply; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-05-10  7:54 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:33AM -0700, Steve Sistare wrote:
> cpr-exec mode needs permission to exec.  Block it if permission is denied.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/sysemu/seccomp.h |  1 +
>  system/qemu-seccomp.c    | 10 ++++++++--
>  system/vl.c              |  6 ++++++
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/sysemu/seccomp.h b/include/sysemu/seccomp.h
> index fe85989..023c0a1 100644
> --- a/include/sysemu/seccomp.h
> +++ b/include/sysemu/seccomp.h
> @@ -22,5 +22,6 @@
>  #define QEMU_SECCOMP_SET_RESOURCECTL (1 << 4)
>  
>  int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp);
> +uint32_t qemu_seccomp_get_opts(void);
>  
>  #endif
> diff --git a/system/qemu-seccomp.c b/system/qemu-seccomp.c
> index 5c20ac0..0d2a561 100644
> --- a/system/qemu-seccomp.c
> +++ b/system/qemu-seccomp.c
> @@ -360,12 +360,18 @@ static int seccomp_start(uint32_t seccomp_opts, Error **errp)
>      return rc < 0 ? -1 : 0;
>  }
>  
> +static uint32_t seccomp_opts;
> +
> +uint32_t qemu_seccomp_get_opts(void)
> +{
> +    return seccomp_opts;
> +}
> +
>  int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp)
>  {
>      if (qemu_opt_get_bool(opts, "enable", false)) {
> -        uint32_t seccomp_opts = QEMU_SECCOMP_SET_DEFAULT
> -                | QEMU_SECCOMP_SET_OBSOLETE;
>          const char *value = NULL;
> +        seccomp_opts = QEMU_SECCOMP_SET_DEFAULT | QEMU_SECCOMP_SET_OBSOLETE;
>  
>          value = qemu_opt_get(opts, "obsolete");
>          if (value) {
> diff --git a/system/vl.c b/system/vl.c
> index 7252100..b76881e 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -76,6 +76,7 @@
>  #include "hw/block/block.h"
>  #include "hw/i386/x86.h"
>  #include "hw/i386/pc.h"
> +#include "migration/blocker.h"
>  #include "migration/cpr.h"
>  #include "migration/misc.h"
>  #include "migration/snapshot.h"
> @@ -2493,6 +2494,11 @@ static void qemu_process_early_options(void)
>      QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
>      if (olist) {
>          qemu_opts_foreach(olist, parse_sandbox, NULL, &error_fatal);
> +        if (qemu_seccomp_get_opts() & QEMU_SECCOMP_SET_SPAWN) {
> +            Error *blocker = NULL;
> +            error_setg(&blocker, "-sandbox denies exec for cpr-exec");
> +            migrate_add_blocker_mode(&blocker, MIG_MODE_CPR_EXEC, &error_fatal);
> +        }
>      }
>  #endi

There are a whole pile of features that get blocked wehn -sandbox is
used. I'm not convinced we should be adding code to check for specific
blocked features, as such a list will always be incomplete at best, and
incorrectly block things at worst.

I view this primarily as a documentation task for the cpr-exec command.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 13/26] physmem: ram_block_create
  2024-04-29 15:55 ` [PATCH V1 13/26] physmem: ram_block_create Steve Sistare
@ 2024-05-13 18:37   ` Fabiano Rosas
  2024-05-13 19:30     ` Steven Sistare
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-13 18:37 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Create a common subroutine to allocate a RAMBlock, de-duping the code to
> populate its common fields.  Add a trace point for good measure.
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  system/physmem.c    | 47 ++++++++++++++++++++++++++---------------------
>  system/trace-events |  3 +++
>  2 files changed, 29 insertions(+), 21 deletions(-)
>
> diff --git a/system/physmem.c b/system/physmem.c
> index c3d04ca..6216b14 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -52,6 +52,7 @@
>  #include "sysemu/hw_accel.h"
>  #include "sysemu/xen-mapcache.h"
>  #include "trace/trace-root.h"
> +#include "trace.h"
>  
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>  #include <linux/falloc.h>
> @@ -1918,11 +1919,29 @@ out_free:
>      }
>  }
>  
> +static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,
> +                                  ram_addr_t max_size, uint32_t ram_flags)
> +{
> +    RAMBlock *rb = g_malloc0(sizeof(*rb));
> +
> +    rb->used_length = size;
> +    rb->max_length = max_size;
> +    rb->fd = -1;
> +    rb->flags = ram_flags;
> +    rb->page_size = qemu_real_host_page_size();
> +    rb->mr = mr;
> +    rb->guest_memfd = -1;
> +    trace_ram_block_create(rb->idstr, rb->flags, rb->fd, rb->used_length,

There's no idstr at this point, is there? I think this needs to be
memory_region_name(mr).

> +                           rb->max_length, mr->align);
> +    return rb;
> +}
> +
>  #ifdef CONFIG_POSIX
>  RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>                                   uint32_t ram_flags, int fd, off_t offset,
>                                   Error **errp)
>  {
> +    void *host;
>      RAMBlock *new_block;
>      Error *local_err = NULL;
>      int64_t file_size, file_align;
> @@ -1962,19 +1981,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>          return NULL;
>      }
>  
> -    new_block = g_malloc0(sizeof(*new_block));
> -    new_block->mr = mr;
> -    new_block->used_length = size;
> -    new_block->max_length = size;
> -    new_block->flags = ram_flags;
> -    new_block->guest_memfd = -1;
> -    new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
> -                                     errp);
> -    if (!new_block->host) {
> +    new_block = ram_block_create(mr, size, size, ram_flags);
> +    host = file_ram_alloc(new_block, size, fd, !file_size, offset, errp);
> +    if (!host) {
>          g_free(new_block);
>          return NULL;
>      }
>  
> +    new_block->host = host;
>      ram_block_add(new_block, &local_err);
>      if (local_err) {
>          g_free(new_block);
> @@ -1982,7 +1996,6 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>          return NULL;
>      }
>      return new_block;
> -
>  }
>  
>  
> @@ -2054,18 +2067,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>      align = MAX(align, TARGET_PAGE_SIZE);
>      size = ROUND_UP(size, align);
>      max_size = ROUND_UP(max_size, align);
> -
> -    new_block = g_malloc0(sizeof(*new_block));
> -    new_block->mr = mr;
> -    new_block->resized = resized;
> -    new_block->used_length = size;
> -    new_block->max_length = max_size;
>      assert(max_size >= size);
> -    new_block->fd = -1;
> -    new_block->guest_memfd = -1;
> -    new_block->page_size = qemu_real_host_page_size();
> -    new_block->host = host;
> -    new_block->flags = ram_flags;
> +    new_block = ram_block_create(mr, size, max_size, ram_flags);
> +    new_block->resized = resized;
> +
>      ram_block_add(new_block, &local_err);
>      if (local_err) {
>          g_free(new_block);
> diff --git a/system/trace-events b/system/trace-events
> index 69c9044..f0a80ba 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -38,3 +38,6 @@ dirtylimit_state_finalize(void)
>  dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>  dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>  dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
> +
> +# physmem.c
> +ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 03/26] migration: SAVEVM_FOREACH
  2024-05-06 23:17   ` Fabiano Rosas
@ 2024-05-13 19:27     ` Steven Sistare
  2024-05-27 18:14       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:27 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel, Peter Xu
  Cc: David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/6/2024 7:17 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Define an abstraction SAVEVM_FOREACH to loop over all savevm state
>> handlers, and replace QTAILQ_FOREACH.  Define variants for ALL so
>> we can loop over all handlers vs a subset of handlers in a subsequent
>> patch, but at this time there is no distinction between the two.
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   migration/savevm.c | 55 +++++++++++++++++++++++++++++++-----------------------
>>   1 file changed, 32 insertions(+), 23 deletions(-)
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 4509482..6829ba3 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -237,6 +237,15 @@ static SaveState savevm_state = {
>>       .global_section_id = 0,
>>   };
>>   
>> +#define SAVEVM_FOREACH(se, entry)                                    \
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
>> +
>> +#define SAVEVM_FOREACH_ALL(se, entry)                                \
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
> 
> This feels worse than SAVEVM_FOREACH_NOT_PRECREATED. We'll have to keep
> coming back to the definition to figure out which FOREACH is the real
> deal.

I take your point, but the majority of the loops do not care about precreated
objects, so it seems backwards to make them more verbose with 
SAVEVM_FOREACH_NOT_PRECREATE.  I can go either way, but we need
Peter's opinion also.

>> +
>> +#define SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se)                   \
>> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
>> +
>>   static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id);
>>   
>>   static bool should_validate_capability(int capability)
>> @@ -674,7 +683,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
>>       SaveStateEntry *se;
>>       uint32_t instance_id = 0;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH_ALL(se, entry) {
> 
> In this patch we can't have both instances...
> 
>>           if (strcmp(idstr, se->idstr) == 0
>>               && instance_id <= se->instance_id) {
>>               instance_id = se->instance_id + 1;
>> @@ -690,7 +699,7 @@ static int calculate_compat_instance_id(const char *idstr)
>>       SaveStateEntry *se;
>>       int instance_id = 0;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
> 
> ...otherwise one of the two changes will go undocumented because the
> actual reason for it will only be described in the next patch.

Sure, I'll move this to the precreate patch.

- Steve

>>           if (!se->compat) {
>>               continue;
>>           }
>> @@ -816,7 +825,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
>>       }
>>       pstrcat(id, sizeof(id), idstr);
>>   
>> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
>> +    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
>>           if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
>>               savevm_state_handler_remove(se);
>>               g_free(se->compat);
>> @@ -939,7 +948,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
>>   {
>>       SaveStateEntry *se, *new_se;
>>   
>> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
>> +    SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
>>           if (se->vmsd == vmsd && se->opaque == opaque) {
>>               savevm_state_handler_remove(se);
>>               g_free(se->compat);
>> @@ -1223,7 +1232,7 @@ bool qemu_savevm_state_blocked(Error **errp)
>>   {
>>       SaveStateEntry *se;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->vmsd && se->vmsd->unmigratable) {
>>               error_setg(errp, "State blocked by non-migratable device '%s'",
>>                          se->idstr);
>> @@ -1237,7 +1246,7 @@ void qemu_savevm_non_migratable_list(strList **reasons)
>>   {
>>       SaveStateEntry *se;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->vmsd && se->vmsd->unmigratable) {
>>               QAPI_LIST_PREPEND(*reasons,
>>                                 g_strdup_printf("non-migratable device: %s",
>> @@ -1276,7 +1285,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
>>   {
>>       SaveStateEntry *se;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->vmsd && se->vmsd->dev_unplug_pending &&
>>               se->vmsd->dev_unplug_pending(se->opaque)) {
>>               return true;
>> @@ -1291,7 +1300,7 @@ int qemu_savevm_state_prepare(Error **errp)
>>       SaveStateEntry *se;
>>       int ret;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->save_prepare) {
>>               continue;
>>           }
>> @@ -1321,7 +1330,7 @@ int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
>>       json_writer_start_array(ms->vmdesc, "devices");
>>   
>>       trace_savevm_state_setup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->vmsd && se->vmsd->early_setup) {
>>               ret = vmstate_save(f, se, ms->vmdesc, errp);
>>               if (ret) {
>> @@ -1365,7 +1374,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
>>   
>>       trace_savevm_state_resume_prepare();
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->resume_prepare) {
>>               continue;
>>           }
>> @@ -1396,7 +1405,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
>>       int ret;
>>   
>>       trace_savevm_state_iterate();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->save_live_iterate) {
>>               continue;
>>           }
>> @@ -1461,7 +1470,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
>>       SaveStateEntry *se;
>>       int ret;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->save_live_complete_postcopy) {
>>               continue;
>>           }
>> @@ -1495,7 +1504,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>>       SaveStateEntry *se;
>>       int ret;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops ||
>>               (in_postcopy && se->ops->has_postcopy &&
>>                se->ops->has_postcopy(se->opaque)) ||
>> @@ -1543,7 +1552,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>       Error *local_err = NULL;
>>       int ret;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->vmsd && se->vmsd->early_setup) {
>>               /* Already saved during qemu_savevm_state_setup(). */
>>               continue;
>> @@ -1649,7 +1658,7 @@ void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>>       *must_precopy = 0;
>>       *can_postcopy = 0;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->state_pending_estimate) {
>>               continue;
>>           }
>> @@ -1670,7 +1679,7 @@ void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>>       *must_precopy = 0;
>>       *can_postcopy = 0;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->state_pending_exact) {
>>               continue;
>>           }
>> @@ -1693,7 +1702,7 @@ void qemu_savevm_state_cleanup(void)
>>       }
>>   
>>       trace_savevm_state_cleanup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->ops && se->ops->save_cleanup) {
>>               se->ops->save_cleanup(se->opaque);
>>           }
>> @@ -1778,7 +1787,7 @@ int qemu_save_device_state(QEMUFile *f)
>>       }
>>       cpu_synchronize_all_states();
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           int ret;
>>   
>>           if (se->is_ram) {
>> @@ -1801,7 +1810,7 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
>>   {
>>       SaveStateEntry *se;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH_ALL(se, entry) {
>>           if (!strcmp(se->idstr, idstr) &&
>>               (instance_id == se->instance_id ||
>>                instance_id == se->alias_id))
>> @@ -2680,7 +2689,7 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis,
>>       }
>>   
>>       trace_qemu_loadvm_state_section_partend(section_id);
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->load_section_id == section_id) {
>>               break;
>>           }
>> @@ -2755,7 +2764,7 @@ static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
>>   {
>>       SaveStateEntry *se;
>>   
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->switchover_ack_needed) {
>>               continue;
>>           }
>> @@ -2775,7 +2784,7 @@ static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
>>       int ret;
>>   
>>       trace_loadvm_state_setup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (!se->ops || !se->ops->load_setup) {
>>               continue;
>>           }
>> @@ -2801,7 +2810,7 @@ void qemu_loadvm_state_cleanup(void)
>>       SaveStateEntry *se;
>>   
>>       trace_loadvm_state_cleanup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>           if (se->ops && se->ops->load_cleanup) {
>>               se->ops->load_cleanup(se->opaque);
>>           }


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 05/26] migration: precreate vmstate
  2024-05-07 21:02   ` Fabiano Rosas
@ 2024-05-13 19:28     ` Steven Sistare
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:28 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/7/2024 5:02 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Provide the VMStateDescription precreate field to mark objects that must
>> be loaded on the incoming side before devices have been created, because
>> they provide properties that will be needed at creation time.  They will
>> be saved to and loaded from their own QEMUFile, via
> 
> It's not obvious to me what the reason is to have a separate
> QEMUFile. Could you expand on this?

The migration stream is read in the calling sequence at B below, but precreate
state is needed at A before chardev and memory backends are created.

main()
   qemu_init()
     A:
     qemu_create_early_backends()
     qemu_create_late_backends()
     migration_object_init()
     qmp_x_exit_preconfig()
       qmp_migrate_incoming()

   qemu_default_main()
     qemu_main_loop()
       fd_accept_incoming_migration()
         migration_channel_process_incoming()
           migration_ioc_process_incoming()
             migration_incoming_process()
               process_incoming_migration_co()
                 B:
                 qemu_loadvm_state()

precreate objects could be emitted first in the existing migration stream and
read at A, but this requires untangling numerous ordering dependencies amongst
migration_object_init, qemu_create_machine, configure_accelerators, monitor
init, and the main loop.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 06/26] migration: precreate vmstate for exec
  2024-05-06 23:34   ` Fabiano Rosas
@ 2024-05-13 19:28     ` Steven Sistare
  2024-05-13 21:21       ` Fabiano Rosas
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:28 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/6/2024 7:34 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Provide migration_precreate_save for saving precreate vmstate across exec.
>> Create a memfd, save its value in the environment, and serialize state
>> to it.  Reverse the process in migration_precreate_load.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/misc.h |   5 ++
>>   migration/meson.build    |   1 +
>>   migration/precreate.c    | 139 +++++++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 145 insertions(+)
>>   create mode 100644 migration/precreate.c
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index c9e200f..cf30351 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -56,6 +56,11 @@ AnnounceParameters *migrate_announce_params(void);
>>   
>>   void dump_vmstate_json_to_file(FILE *out_fp);
>>   
>> +/* migration/precreate.c */
>> +int migration_precreate_save(Error **errp);
>> +void migration_precreate_unsave(void);
>> +int migration_precreate_load(Error **errp);
>> +
>>   /* migration/migration.c */
>>   void migration_object_init(void);
>>   void migration_shutdown(void);
>> diff --git a/migration/meson.build b/migration/meson.build
>> index f76b1ba..50e7cb2 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -26,6 +26,7 @@ system_ss.add(files(
>>     'ram-compress.c',
>>     'options.c',
>>     'postcopy-ram.c',
>> +  'precreate.c',
>>     'savevm.c',
>>     'socket.c',
>>     'tls.c',
>> diff --git a/migration/precreate.c b/migration/precreate.c
>> new file mode 100644
>> index 0000000..0bf5e1f
>> --- /dev/null
>> +++ b/migration/precreate.c
>> @@ -0,0 +1,139 @@
>> +/*
>> + * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/cutils.h"
>> +#include "qemu/memfd.h"
>> +#include "qapi/error.h"
>> +#include "io/channel-file.h"
>> +#include "migration/misc.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/savevm.h"
>> +
>> +#define PRECREATE_STATE_NAME "QEMU_PRECREATE_STATE"
>> +
>> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +    qio_channel_set_name(ioc, name);
>> +    return qemu_file_new_input(ioc);
>> +}
>> +
>> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +    qio_channel_set_name(ioc, name);
>> +    return qemu_file_new_output(ioc);
>> +}
>> +
>> +static int memfd_create_named(const char *name, Error **errp)
>> +{
>> +    int mfd;
>> +    char val[16];
>> +
>> +    mfd = memfd_create(name, 0);
>> +    if (mfd < 0) {
>> +        error_setg_errno(errp, errno, "memfd_create failed");
>> +        return -1;
>> +    }
>> +
>> +    /* Remember mfd in environment for post-exec load */
>> +    qemu_clear_cloexec(mfd);
>> +    snprintf(val, sizeof(val), "%d", mfd);
>> +    g_setenv(name, val, 1);
>> +
>> +    return mfd;
>> +}
>> +
>> +static int memfd_find_named(const char *name, int *mfd_p, Error **errp)
>> +{
>> +    const char *val = g_getenv(name);
>> +
>> +    if (!val) {
>> +        *mfd_p = -1;
>> +        return 0;       /* No memfd was created, not an error */
>> +    }
>> +    g_unsetenv(name);
>> +    if (qemu_strtoi(val, NULL, 10, mfd_p)) {
>> +        error_setg(errp, "Bad %s env value %s", PRECREATE_STATE_NAME, val);
>> +        return -1;
>> +    }
>> +    lseek(*mfd_p, 0, SEEK_SET);
>> +    return 0;
>> +}
>> +
>> +static void memfd_delete_named(const char *name)
>> +{
>> +    int mfd;
>> +    const char *val = g_getenv(name);
>> +
>> +    if (val) {
>> +        g_unsetenv(name);
>> +        if (!qemu_strtoi(val, NULL, 10, &mfd)) {
>> +            close(mfd);
>> +        }
>> +    }
>> +}
>> +
>> +static QEMUFile *qemu_file_new_memfd_output(const char *name, Error **errp)
>> +{
>> +    int mfd = memfd_create_named(name, errp);
>> +
>> +    if (mfd < 0) {
>> +        return NULL;
>> +    }
>> +
>> +    return qemu_file_new_fd_output(mfd, name);
>> +}
>> +
>> +static QEMUFile *qemu_file_new_memfd_input(const char *name, Error **errp)
>> +{
>> +    int ret, mfd;
>> +
>> +    ret = memfd_find_named(name, &mfd, errp);
>> +    if (ret || mfd < 0) {
>> +        return NULL;
>> +    }
>> +
>> +    return qemu_file_new_fd_input(mfd, name);
>> +}
>> +
>> +int migration_precreate_save(Error **errp)
>> +{
>> +    QEMUFile *f = qemu_file_new_memfd_output(PRECREATE_STATE_NAME, errp);
>> +
>> +    if (!f) {
>> +        return -1;
>> +    } else if (qemu_savevm_precreate_save(f, errp)) {
>> +        memfd_delete_named(PRECREATE_STATE_NAME);
>> +        return -1;
>> +    } else {
>> +        /* Do not close f, as mfd must remain open. */
>> +        return 0;
>> +    }
>> +}
>> +
>> +void migration_precreate_unsave(void)
>> +{
>> +    memfd_delete_named(PRECREATE_STATE_NAME);
>> +}
>> +
>> +int migration_precreate_load(Error **errp)
>> +{
>> +    int ret;
>> +    QEMUFile *f = qemu_file_new_memfd_input(PRECREATE_STATE_NAME, errp);
> 
> Can we avoid the QEMUFile? I don't see it being exported from this file.

It is not exported, but within this file, it is the basis for all read and
write operations, via the existing functions qemu_file_new_input() and 
qemu_file_new_output()

- Steve

>> +
>> +    if (!f) {
>> +        return -1;
>> +    }
>> +    ret = qemu_savevm_precreate_load(f, errp);
>> +    qemu_fclose(f);
>> +    g_unsetenv(PRECREATE_STATE_NAME);
>> +    return ret;
>> +}


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 09/26] migration: vmstate_register_named
  2024-05-09 14:32     ` Fabiano Rosas
@ 2024-05-13 19:29       ` Steven Sistare
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:29 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/9/2024 10:32 AM, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
>> Steve Sistare <steven.sistare@oracle.com> writes:
>>
>>> Define vmstate_register_named which takes the instance name as its first
>>> parameter, instead of generating the name from VMStateIf of the Object.
>>> This will be needed to register objects that are not Objects.  Pass the
>>> new name parameter to vmstate_register_with_alias_id.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>
>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 
> Actually, can't we define a wrapper type just for this purpose? For
> example, looking at dbus-vmstate.c:

One would need to provide a separate wrapper for each struct to be registered
as vmstate.  This patch set only has RAMBlock, but there are more coming in
my next patch sets.  vmstate_register_named avoids adding such boilerplate,
and makes it easier to add more cpr state in the future.

- Steve

> static void dbus_vmstate_class_init(ObjectClass *oc, void *data)
> {
> ...
>      VMStateIfClass *vc = VMSTATE_IF_CLASS(oc);
> 
>      vc->get_id = dbus_vmstate_get_id;
> ...
> }
> 
> static const TypeInfo dbus_vmstate_info = {
>      .name = TYPE_DBUS_VMSTATE,
>      .parent = TYPE_OBJECT,
>      .instance_size = sizeof(DBusVMState),
>      .instance_finalize = dbus_vmstate_finalize,
>      .class_init = dbus_vmstate_class_init,
>      .interfaces = (InterfaceInfo[]) {
>          { TYPE_USER_CREATABLE },   // without this one
>          { TYPE_VMSTATE_IF },
>          { }
>      }
> };
> 
> static void register_types(void)
> {
>      type_register_static(&dbus_vmstate_info);
> }
> type_init(register_types);


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 22/26] migration: ram block cpr-exec blockers
  2024-05-09 18:01   ` Fabiano Rosas
@ 2024-05-13 19:29     ` Steven Sistare
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:29 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/9/2024 2:01 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Unlike cpr-reboot mode, cpr-exec mode cannot save volatile ram blocks in the
>> migration stream file and recreate them later, because the physical memory for
>> the blocks is pinned and registered for vfio.  Add an exec-mode blocker for
>> volatile ram blocks.
>>
>> Also add a blocker for RAM_GUEST_MEMFD.  Preserving guest_memfd may be
>> sufficient for cpr-exec, but it has not been tested yet.
>>
>> - Steve
> 
> extra text here

Will fix, thanks - steve

>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 24/26] seccomp: cpr-exec blocker
  2024-05-10  7:54   ` Daniel P. Berrangé
@ 2024-05-13 19:29     ` Steven Sistare
  2024-05-21  7:14       ` Daniel P. Berrangé
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:29 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On 5/10/2024 3:54 AM, Daniel P. Berrangé wrote:
> On Mon, Apr 29, 2024 at 08:55:33AM -0700, Steve Sistare wrote:
>> cpr-exec mode needs permission to exec.  Block it if permission is denied.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/sysemu/seccomp.h |  1 +
>>   system/qemu-seccomp.c    | 10 ++++++++--
>>   system/vl.c              |  6 ++++++
>>   3 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/sysemu/seccomp.h b/include/sysemu/seccomp.h
>> index fe85989..023c0a1 100644
>> --- a/include/sysemu/seccomp.h
>> +++ b/include/sysemu/seccomp.h
>> @@ -22,5 +22,6 @@
>>   #define QEMU_SECCOMP_SET_RESOURCECTL (1 << 4)
>>   
>>   int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp);
>> +uint32_t qemu_seccomp_get_opts(void);
>>   
>>   #endif
>> diff --git a/system/qemu-seccomp.c b/system/qemu-seccomp.c
>> index 5c20ac0..0d2a561 100644
>> --- a/system/qemu-seccomp.c
>> +++ b/system/qemu-seccomp.c
>> @@ -360,12 +360,18 @@ static int seccomp_start(uint32_t seccomp_opts, Error **errp)
>>       return rc < 0 ? -1 : 0;
>>   }
>>   
>> +static uint32_t seccomp_opts;
>> +
>> +uint32_t qemu_seccomp_get_opts(void)
>> +{
>> +    return seccomp_opts;
>> +}
>> +
>>   int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp)
>>   {
>>       if (qemu_opt_get_bool(opts, "enable", false)) {
>> -        uint32_t seccomp_opts = QEMU_SECCOMP_SET_DEFAULT
>> -                | QEMU_SECCOMP_SET_OBSOLETE;
>>           const char *value = NULL;
>> +        seccomp_opts = QEMU_SECCOMP_SET_DEFAULT | QEMU_SECCOMP_SET_OBSOLETE;
>>   
>>           value = qemu_opt_get(opts, "obsolete");
>>           if (value) {
>> diff --git a/system/vl.c b/system/vl.c
>> index 7252100..b76881e 100644
>> --- a/system/vl.c
>> +++ b/system/vl.c
>> @@ -76,6 +76,7 @@
>>   #include "hw/block/block.h"
>>   #include "hw/i386/x86.h"
>>   #include "hw/i386/pc.h"
>> +#include "migration/blocker.h"
>>   #include "migration/cpr.h"
>>   #include "migration/misc.h"
>>   #include "migration/snapshot.h"
>> @@ -2493,6 +2494,11 @@ static void qemu_process_early_options(void)
>>       QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
>>       if (olist) {
>>           qemu_opts_foreach(olist, parse_sandbox, NULL, &error_fatal);
>> +        if (qemu_seccomp_get_opts() & QEMU_SECCOMP_SET_SPAWN) {
>> +            Error *blocker = NULL;
>> +            error_setg(&blocker, "-sandbox denies exec for cpr-exec");
>> +            migrate_add_blocker_mode(&blocker, MIG_MODE_CPR_EXEC, &error_fatal);
>> +        }
>>       }
>>   #endi
> 
> There are a whole pile of features that get blocked wehn -sandbox is
> used. I'm not convinced we should be adding code to check for specific
> blocked features, as such a list will always be incomplete at best, and
> incorrectly block things at worst.
> 
> I view this primarily as a documentation task for the cpr-exec command.

For cpr and live migration, we do our best to prevent breaking the guest
for cases we know will fail.  Independently, a clear error message here
will reduce error reports for this new cpr feature.

Would it be more palatable if I move this blocker's creation to cpr_mig_init?

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 13/26] physmem: ram_block_create
  2024-05-13 18:37   ` Fabiano Rosas
@ 2024-05-13 19:30     ` Steven Sistare
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:30 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/13/2024 2:37 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Create a common subroutine to allocate a RAMBlock, de-duping the code to
>> populate its common fields.  Add a trace point for good measure.
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   system/physmem.c    | 47 ++++++++++++++++++++++++++---------------------
>>   system/trace-events |  3 +++
>>   2 files changed, 29 insertions(+), 21 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index c3d04ca..6216b14 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -52,6 +52,7 @@
>>   #include "sysemu/hw_accel.h"
>>   #include "sysemu/xen-mapcache.h"
>>   #include "trace/trace-root.h"
>> +#include "trace.h"
>>   
>>   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>   #include <linux/falloc.h>
>> @@ -1918,11 +1919,29 @@ out_free:
>>       }
>>   }
>>   
>> +static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,
>> +                                  ram_addr_t max_size, uint32_t ram_flags)
>> +{
>> +    RAMBlock *rb = g_malloc0(sizeof(*rb));
>> +
>> +    rb->used_length = size;
>> +    rb->max_length = max_size;
>> +    rb->fd = -1;
>> +    rb->flags = ram_flags;
>> +    rb->page_size = qemu_real_host_page_size();
>> +    rb->mr = mr;
>> +    rb->guest_memfd = -1;
>> +    trace_ram_block_create(rb->idstr, rb->flags, rb->fd, rb->used_length,
> 
> There's no idstr at this point, is there? I think this needs to be
> memory_region_name(mr).

Thanks, will fix. That is a bug in my patch factoring.  I add the call to
qemu_ram_set_idstr in patch "physmem: set ram block idstr earlier".

- Steve

>> +                           rb->max_length, mr->align);
>> +    return rb;
>> +}
>> +
>>   #ifdef CONFIG_POSIX
>>   RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>>                                    uint32_t ram_flags, int fd, off_t offset,
>>                                    Error **errp)
>>   {
>> +    void *host;
>>       RAMBlock *new_block;
>>       Error *local_err = NULL;
>>       int64_t file_size, file_align;
>> @@ -1962,19 +1981,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>>           return NULL;
>>       }
>>   
>> -    new_block = g_malloc0(sizeof(*new_block));
>> -    new_block->mr = mr;
>> -    new_block->used_length = size;
>> -    new_block->max_length = size;
>> -    new_block->flags = ram_flags;
>> -    new_block->guest_memfd = -1;
>> -    new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
>> -                                     errp);
>> -    if (!new_block->host) {
>> +    new_block = ram_block_create(mr, size, size, ram_flags);
>> +    host = file_ram_alloc(new_block, size, fd, !file_size, offset, errp);
>> +    if (!host) {
>>           g_free(new_block);
>>           return NULL;
>>       }
>>   
>> +    new_block->host = host;
>>       ram_block_add(new_block, &local_err);
>>       if (local_err) {
>>           g_free(new_block);
>> @@ -1982,7 +1996,6 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>>           return NULL;
>>       }
>>       return new_block;
>> -
>>   }
>>   
>>   
>> @@ -2054,18 +2067,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>>       align = MAX(align, TARGET_PAGE_SIZE);
>>       size = ROUND_UP(size, align);
>>       max_size = ROUND_UP(max_size, align);
>> -
>> -    new_block = g_malloc0(sizeof(*new_block));
>> -    new_block->mr = mr;
>> -    new_block->resized = resized;
>> -    new_block->used_length = size;
>> -    new_block->max_length = max_size;
>>       assert(max_size >= size);
>> -    new_block->fd = -1;
>> -    new_block->guest_memfd = -1;
>> -    new_block->page_size = qemu_real_host_page_size();
>> -    new_block->host = host;
>> -    new_block->flags = ram_flags;
>> +    new_block = ram_block_create(mr, size, max_size, ram_flags);
>> +    new_block->resized = resized;
>> +
>>       ram_block_add(new_block, &local_err);
>>       if (local_err) {
>>           g_free(new_block);
>> diff --git a/system/trace-events b/system/trace-events
>> index 69c9044..f0a80ba 100644
>> --- a/system/trace-events
>> +++ b/system/trace-events
>> @@ -38,3 +38,6 @@ dirtylimit_state_finalize(void)
>>   dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>>   dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>>   dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
>> +
>> +# physmem.c
>> +ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 26/26] migration: only-migratable-modes
  2024-05-09 19:14   ` Fabiano Rosas
@ 2024-05-13 19:48     ` Steven Sistare
  2024-05-13 21:57       ` Fabiano Rosas
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-13 19:48 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/9/2024 3:14 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Add the only-migratable-modes option as a generalization of only-migratable.
>> Only devices that support all requested modes are allowed.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/misc.h       |  3 +++
>>   include/sysemu/sysemu.h        |  1 -
>>   migration/migration-hmp-cmds.c | 26 +++++++++++++++++++++++++-
>>   migration/migration.c          | 22 +++++++++++++++++-----
>>   migration/savevm.c             |  2 +-
>>   qemu-options.hx                | 16 ++++++++++++++--
>>   system/globals.c               |  1 -
>>   system/vl.c                    | 13 ++++++++++++-
>>   target/s390x/cpu_models.c      |  4 +++-
>>   9 files changed, 75 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 5b963ba..3ad2cd9 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -119,6 +119,9 @@ bool migration_incoming_postcopy_advised(void);
>>   /* True if background snapshot is active */
>>   bool migration_in_bg_snapshot(void);
>>   
>> +void migration_set_required_mode(MigMode mode);
>> +bool migration_mode_required(MigMode mode);
>> +
>>   /* migration/block-dirty-bitmap.c */
>>   void dirty_bitmap_mig_init(void);
>>   
>> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
>> index 5b4397e..0a9c4b4 100644
>> --- a/include/sysemu/sysemu.h
>> +++ b/include/sysemu/sysemu.h
>> @@ -8,7 +8,6 @@
>>   
>>   /* vl.c */
>>   
>> -extern int only_migratable;
>>   extern const char *qemu_name;
>>   extern QemuUUID qemu_uuid;
>>   extern bool qemu_uuid_set;
>> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
>> index 414c7e8..ca913b7 100644
>> --- a/migration/migration-hmp-cmds.c
>> +++ b/migration/migration-hmp-cmds.c
>> @@ -16,6 +16,7 @@
>>   #include "qemu/osdep.h"
>>   #include "block/qapi.h"
>>   #include "migration/snapshot.h"
>> +#include "migration/misc.h"
>>   #include "monitor/hmp.h"
>>   #include "monitor/monitor.h"
>>   #include "qapi/error.h"
>> @@ -33,6 +34,28 @@
>>   #include "options.h"
>>   #include "migration.h"
>>   
>> +static void migration_dump_modes(Monitor *mon)
>> +{
>> +    int mode, n = 0;
>> +
>> +    monitor_printf(mon, "only-migratable-modes: ");
>> +
>> +    for (mode = 0; mode < MIG_MODE__MAX; mode++) {
>> +        if (migration_mode_required(mode)) {
>> +            if (n++) {
>> +                monitor_printf(mon, ",");
>> +            }
>> +            monitor_printf(mon, "%s", MigMode_str(mode));
>> +        }
>> +    }
>> +
>> +    if (!n) {
>> +        monitor_printf(mon, "none\n");
>> +    } else {
>> +        monitor_printf(mon, "\n");
>> +    }
>> +}
>> +
>>   static void migration_global_dump(Monitor *mon)
>>   {
>>       MigrationState *ms = migrate_get_current();
>> @@ -41,7 +64,7 @@ static void migration_global_dump(Monitor *mon)
>>       monitor_printf(mon, "store-global-state: %s\n",
>>                      ms->store_global_state ? "on" : "off");
>>       monitor_printf(mon, "only-migratable: %s\n",
>> -                   only_migratable ? "on" : "off");
>> +                   migration_mode_required(MIG_MODE_NORMAL) ? "on" : "off");
>>       monitor_printf(mon, "send-configuration: %s\n",
>>                      ms->send_configuration ? "on" : "off");
>>       monitor_printf(mon, "send-section-footer: %s\n",
>> @@ -50,6 +73,7 @@ static void migration_global_dump(Monitor *mon)
>>                      ms->decompress_error_check ? "on" : "off");
>>       monitor_printf(mon, "clear-bitmap-shift: %u\n",
>>                      ms->clear_bitmap_shift);
>> +    migration_dump_modes(mon);
>>   }
>>   
>>   void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 4984dee..5535b84 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1719,17 +1719,29 @@ static bool is_busy(Error **reasonp, Error **errp)
>>       return false;
>>   }
>>   
>> -static bool is_only_migratable(Error **reasonp, Error **errp, int modes)
>> +static int migration_modes_required;
>> +
>> +void migration_set_required_mode(MigMode mode)
>> +{
>> +    migration_modes_required |= BIT(mode);
>> +}
>> +
>> +bool migration_mode_required(MigMode mode)
>> +{
>> +    return !!(migration_modes_required & BIT(mode));
>> +}
>> +
>> +static bool modes_are_required(Error **reasonp, Error **errp, int modes)
>>   {
>>       ERRP_GUARD();
>>   
>> -    if (only_migratable && (modes & BIT(MIG_MODE_NORMAL))) {
>> +    if (migration_modes_required & modes) {
>>           error_propagate_prepend(errp, *reasonp,
>> -                                "disallowing migration blocker "
>> -                                "(--only-migratable) for: ");
>> +                                "-only-migratable{-modes}  specified, but: ");
> 
> extra space before 'specified'

Will fix, thanks.

>>           *reasonp = NULL;
>>           return true;
>>       }
>> +
>>       return false;
>>   }
>>   
>> @@ -1783,7 +1795,7 @@ int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...)
>>       modes = get_modes(mode, ap);
>>       va_end(ap);
>>   
>> -    if (is_only_migratable(reasonp, errp, modes)) {
>> +    if (modes_are_required(reasonp, errp, modes)) {
>>           return -EACCES;
>>       } else if (is_busy(reasonp, errp)) {
>>           return -EBUSY;
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 6087c3a..e53ac84 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -3585,7 +3585,7 @@ void vmstate_register_ram_global(MemoryRegion *mr)
>>   bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
>>   {
>>       /* check needed if --only-migratable is specified */
>> -    if (!only_migratable) {
>> +    if (!migration_mode_required(MIG_MODE_NORMAL)) {
>>           return true;
>>       }
>>   
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index f0dfda5..946d731 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -4807,8 +4807,20 @@ DEF("only-migratable", 0, QEMU_OPTION_only_migratable, \
>>       "-only-migratable     allow only migratable devices\n", QEMU_ARCH_ALL)
>>   SRST
>>   ``-only-migratable``
>> -    Only allow migratable devices. Devices will not be allowed to enter
>> -    an unmigratable state.
>> +    Only allow devices that can migrate using normal mode. Devices will not
>> +    be allowed to enter an unmigratable state.
> 
> What's a "normal" mode is what people will ask. I don't think we need to
> expose this. This option never had anything to do with "modes" and I
> think we can keep it this way. See below...

We now have a mode parameter and enum MigMode which includes normal, and is
documented in qapi.

>> +ERST
>> +
>> +DEF("only-migratable-modes", HAS_ARG, QEMU_OPTION_only_migratable_modes, \
>> +    "-only-migratable-modes mode1[,...]\n"
>> +    "                allow only devices that are migratable using mode(s)\n",
>> +    QEMU_ARCH_ALL)
>> +SRST
>> +``-only-migratable-modes mode1[,...]``
>> +    Only allow devices which are migratable using all modes in the list,
>> +    which guarantees that migration will not fail due to a blocker.
>> +    If both only-migratable-modes and only-migratable are specified,
>> +    or are specified multiple times, then the required modes accumulate.
>>   ERST
>>   
>>   DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
>> diff --git a/system/globals.c b/system/globals.c
>> index e353584..fdc263e 100644
>> --- a/system/globals.c
>> +++ b/system/globals.c
>> @@ -48,7 +48,6 @@ const char *qemu_name;
>>   unsigned int nb_prom_envs;
>>   const char *prom_envs[MAX_PROM_ENVS];
>>   uint8_t *boot_splash_filedata;
>> -int only_migratable; /* turn it off unless user states otherwise */
>>   int icount_align_option;
>>   
>>   /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
>> diff --git a/system/vl.c b/system/vl.c
>> index b76881e..7e73be9 100644
>> --- a/system/vl.c
>> +++ b/system/vl.c
>> @@ -3458,7 +3458,18 @@ void qemu_init(int argc, char **argv)
>>                   incoming = optarg;
>>                   break;
>>               case QEMU_OPTION_only_migratable:
>> -                only_migratable = 1;
>> +                migration_set_required_mode(MIG_MODE_NORMAL);
> 
> ...from the point of view of user intent, I think this should be
> MIG_MODE_ALL. 
If only-migratable applies to all modes, then:
If a user only intends to use mode A, then a blocker for mode B will terminate
qemu.  Not good.

Defining only-migratable to apply to normal mode is the backwards-compatible
solution.

- Steve

> If I have this option set I never want to see a blocker,
> period. That's not a change in behavior because the mode has to be
> explicitly selected anyway.
> 
>> +                break;
>> +            case QEMU_OPTION_only_migratable_modes:
>> +                {
>> +                    int i, mode;
>> +                    g_autofree char **words = g_strsplit(optarg, ",", -1);
>> +                    for (i = 0; words[i]; i++) {
>> +                        mode = qapi_enum_parse(&MigMode_lookup, words[i], -1,
>> +                                               &error_fatal);
>> +                        migration_set_required_mode(mode);
> 
> This option can be used to refine the modes being considered, it should
> take precedence if both are present.
> 
>> +                    }
>> +                }
>>                   break;
>>               case QEMU_OPTION_nodefaults:
>>                   has_defaults = 0;
>> diff --git a/target/s390x/cpu_models.c b/target/s390x/cpu_models.c
>> index 8ed3bb6..42ad160 100644
>> --- a/target/s390x/cpu_models.c
>> +++ b/target/s390x/cpu_models.c
>> @@ -16,6 +16,7 @@
>>   #include "kvm/kvm_s390x.h"
>>   #include "sysemu/kvm.h"
>>   #include "sysemu/tcg.h"
>> +#include "migration/misc.h"
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>>   #include "qapi/visitor.h"
>> @@ -526,7 +527,8 @@ static void check_compatibility(const S390CPUModel *max_model,
>>       }
>>   
>>   #ifndef CONFIG_USER_ONLY
>> -    if (only_migratable && test_bit(S390_FEAT_UNPACK, model->features)) {
>> +    if (migration_mode_required(MIG_MODE_NORMAL) &&
>> +        test_bit(S390_FEAT_UNPACK, model->features)) {
>>           error_setg(errp, "The unpack facility is not compatible with "
>>                      "the --only-migratable option. You must remove either "
>>                      "the 'unpack' facility or the --only-migratable option");


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 06/26] migration: precreate vmstate for exec
  2024-05-13 19:28     ` Steven Sistare
@ 2024-05-13 21:21       ` Fabiano Rosas
  0 siblings, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-13 21:21 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

Steven Sistare <steven.sistare@oracle.com> writes:

> On 5/6/2024 7:34 PM, Fabiano Rosas wrote:
>> Steve Sistare <steven.sistare@oracle.com> writes:
>> 
>>> Provide migration_precreate_save for saving precreate vmstate across exec.
>>> Create a memfd, save its value in the environment, and serialize state
>>> to it.  Reverse the process in migration_precreate_load.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   include/migration/misc.h |   5 ++
>>>   migration/meson.build    |   1 +
>>>   migration/precreate.c    | 139 +++++++++++++++++++++++++++++++++++++++++++++++
>>>   3 files changed, 145 insertions(+)
>>>   create mode 100644 migration/precreate.c
>>>
>>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>>> index c9e200f..cf30351 100644
>>> --- a/include/migration/misc.h
>>> +++ b/include/migration/misc.h
>>> @@ -56,6 +56,11 @@ AnnounceParameters *migrate_announce_params(void);
>>>   
>>>   void dump_vmstate_json_to_file(FILE *out_fp);
>>>   
>>> +/* migration/precreate.c */
>>> +int migration_precreate_save(Error **errp);
>>> +void migration_precreate_unsave(void);
>>> +int migration_precreate_load(Error **errp);
>>> +
>>>   /* migration/migration.c */
>>>   void migration_object_init(void);
>>>   void migration_shutdown(void);
>>> diff --git a/migration/meson.build b/migration/meson.build
>>> index f76b1ba..50e7cb2 100644
>>> --- a/migration/meson.build
>>> +++ b/migration/meson.build
>>> @@ -26,6 +26,7 @@ system_ss.add(files(
>>>     'ram-compress.c',
>>>     'options.c',
>>>     'postcopy-ram.c',
>>> +  'precreate.c',
>>>     'savevm.c',
>>>     'socket.c',
>>>     'tls.c',
>>> diff --git a/migration/precreate.c b/migration/precreate.c
>>> new file mode 100644
>>> index 0000000..0bf5e1f
>>> --- /dev/null
>>> +++ b/migration/precreate.c
>>> @@ -0,0 +1,139 @@
>>> +/*
>>> + * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>> + * See the COPYING file in the top-level directory.
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/cutils.h"
>>> +#include "qemu/memfd.h"
>>> +#include "qapi/error.h"
>>> +#include "io/channel-file.h"
>>> +#include "migration/misc.h"
>>> +#include "migration/qemu-file.h"
>>> +#include "migration/savevm.h"
>>> +
>>> +#define PRECREATE_STATE_NAME "QEMU_PRECREATE_STATE"
>>> +
>>> +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
>>> +{
>>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>>> +    qio_channel_set_name(ioc, name);
>>> +    return qemu_file_new_input(ioc);
>>> +}
>>> +
>>> +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
>>> +{
>>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>>> +    qio_channel_set_name(ioc, name);
>>> +    return qemu_file_new_output(ioc);
>>> +}
>>> +
>>> +static int memfd_create_named(const char *name, Error **errp)
>>> +{
>>> +    int mfd;
>>> +    char val[16];
>>> +
>>> +    mfd = memfd_create(name, 0);
>>> +    if (mfd < 0) {
>>> +        error_setg_errno(errp, errno, "memfd_create failed");
>>> +        return -1;
>>> +    }
>>> +
>>> +    /* Remember mfd in environment for post-exec load */
>>> +    qemu_clear_cloexec(mfd);
>>> +    snprintf(val, sizeof(val), "%d", mfd);
>>> +    g_setenv(name, val, 1);
>>> +
>>> +    return mfd;
>>> +}
>>> +
>>> +static int memfd_find_named(const char *name, int *mfd_p, Error **errp)
>>> +{
>>> +    const char *val = g_getenv(name);
>>> +
>>> +    if (!val) {
>>> +        *mfd_p = -1;
>>> +        return 0;       /* No memfd was created, not an error */
>>> +    }
>>> +    g_unsetenv(name);
>>> +    if (qemu_strtoi(val, NULL, 10, mfd_p)) {
>>> +        error_setg(errp, "Bad %s env value %s", PRECREATE_STATE_NAME, val);
>>> +        return -1;
>>> +    }
>>> +    lseek(*mfd_p, 0, SEEK_SET);
>>> +    return 0;
>>> +}
>>> +
>>> +static void memfd_delete_named(const char *name)
>>> +{
>>> +    int mfd;
>>> +    const char *val = g_getenv(name);
>>> +
>>> +    if (val) {
>>> +        g_unsetenv(name);
>>> +        if (!qemu_strtoi(val, NULL, 10, &mfd)) {
>>> +            close(mfd);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static QEMUFile *qemu_file_new_memfd_output(const char *name, Error **errp)
>>> +{
>>> +    int mfd = memfd_create_named(name, errp);
>>> +
>>> +    if (mfd < 0) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    return qemu_file_new_fd_output(mfd, name);
>>> +}
>>> +
>>> +static QEMUFile *qemu_file_new_memfd_input(const char *name, Error **errp)
>>> +{
>>> +    int ret, mfd;
>>> +
>>> +    ret = memfd_find_named(name, &mfd, errp);
>>> +    if (ret || mfd < 0) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    return qemu_file_new_fd_input(mfd, name);
>>> +}
>>> +
>>> +int migration_precreate_save(Error **errp)
>>> +{
>>> +    QEMUFile *f = qemu_file_new_memfd_output(PRECREATE_STATE_NAME, errp);
>>> +
>>> +    if (!f) {
>>> +        return -1;
>>> +    } else if (qemu_savevm_precreate_save(f, errp)) {
>>> +        memfd_delete_named(PRECREATE_STATE_NAME);
>>> +        return -1;
>>> +    } else {
>>> +        /* Do not close f, as mfd must remain open. */
>>> +        return 0;
>>> +    }
>>> +}
>>> +
>>> +void migration_precreate_unsave(void)
>>> +{
>>> +    memfd_delete_named(PRECREATE_STATE_NAME);
>>> +}
>>> +
>>> +int migration_precreate_load(Error **errp)
>>> +{
>>> +    int ret;
>>> +    QEMUFile *f = qemu_file_new_memfd_input(PRECREATE_STATE_NAME, errp);
>> 
>> Can we avoid the QEMUFile? I don't see it being exported from this file.
>
> It is not exported, but within this file, it is the basis for all read and
> write operations, via the existing functions qemu_file_new_input() and 
> qemu_file_new_output()

Right, that's the easy part, we could just use the QIOChannel
directly. But I missed the fact that you also need to interop with the
existing code that needs the QEMUFile.

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 26/26] migration: only-migratable-modes
  2024-05-13 19:48     ` Steven Sistare
@ 2024-05-13 21:57       ` Fabiano Rosas
  0 siblings, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-13 21:57 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

Steven Sistare <steven.sistare@oracle.com> writes:

> On 5/9/2024 3:14 PM, Fabiano Rosas wrote:
>> Steve Sistare <steven.sistare@oracle.com> writes:
>> 
>>> Add the only-migratable-modes option as a generalization of only-migratable.
>>> Only devices that support all requested modes are allowed.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   include/migration/misc.h       |  3 +++
>>>   include/sysemu/sysemu.h        |  1 -
>>>   migration/migration-hmp-cmds.c | 26 +++++++++++++++++++++++++-
>>>   migration/migration.c          | 22 +++++++++++++++++-----
>>>   migration/savevm.c             |  2 +-
>>>   qemu-options.hx                | 16 ++++++++++++++--
>>>   system/globals.c               |  1 -
>>>   system/vl.c                    | 13 ++++++++++++-
>>>   target/s390x/cpu_models.c      |  4 +++-
>>>   9 files changed, 75 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>>> index 5b963ba..3ad2cd9 100644
>>> --- a/include/migration/misc.h
>>> +++ b/include/migration/misc.h
>>> @@ -119,6 +119,9 @@ bool migration_incoming_postcopy_advised(void);
>>>   /* True if background snapshot is active */
>>>   bool migration_in_bg_snapshot(void);
>>>   
>>> +void migration_set_required_mode(MigMode mode);
>>> +bool migration_mode_required(MigMode mode);
>>> +
>>>   /* migration/block-dirty-bitmap.c */
>>>   void dirty_bitmap_mig_init(void);
>>>   
>>> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
>>> index 5b4397e..0a9c4b4 100644
>>> --- a/include/sysemu/sysemu.h
>>> +++ b/include/sysemu/sysemu.h
>>> @@ -8,7 +8,6 @@
>>>   
>>>   /* vl.c */
>>>   
>>> -extern int only_migratable;
>>>   extern const char *qemu_name;
>>>   extern QemuUUID qemu_uuid;
>>>   extern bool qemu_uuid_set;
>>> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
>>> index 414c7e8..ca913b7 100644
>>> --- a/migration/migration-hmp-cmds.c
>>> +++ b/migration/migration-hmp-cmds.c
>>> @@ -16,6 +16,7 @@
>>>   #include "qemu/osdep.h"
>>>   #include "block/qapi.h"
>>>   #include "migration/snapshot.h"
>>> +#include "migration/misc.h"
>>>   #include "monitor/hmp.h"
>>>   #include "monitor/monitor.h"
>>>   #include "qapi/error.h"
>>> @@ -33,6 +34,28 @@
>>>   #include "options.h"
>>>   #include "migration.h"
>>>   
>>> +static void migration_dump_modes(Monitor *mon)
>>> +{
>>> +    int mode, n = 0;
>>> +
>>> +    monitor_printf(mon, "only-migratable-modes: ");
>>> +
>>> +    for (mode = 0; mode < MIG_MODE__MAX; mode++) {
>>> +        if (migration_mode_required(mode)) {
>>> +            if (n++) {
>>> +                monitor_printf(mon, ",");
>>> +            }
>>> +            monitor_printf(mon, "%s", MigMode_str(mode));
>>> +        }
>>> +    }
>>> +
>>> +    if (!n) {
>>> +        monitor_printf(mon, "none\n");
>>> +    } else {
>>> +        monitor_printf(mon, "\n");
>>> +    }
>>> +}
>>> +
>>>   static void migration_global_dump(Monitor *mon)
>>>   {
>>>       MigrationState *ms = migrate_get_current();
>>> @@ -41,7 +64,7 @@ static void migration_global_dump(Monitor *mon)
>>>       monitor_printf(mon, "store-global-state: %s\n",
>>>                      ms->store_global_state ? "on" : "off");
>>>       monitor_printf(mon, "only-migratable: %s\n",
>>> -                   only_migratable ? "on" : "off");
>>> +                   migration_mode_required(MIG_MODE_NORMAL) ? "on" : "off");
>>>       monitor_printf(mon, "send-configuration: %s\n",
>>>                      ms->send_configuration ? "on" : "off");
>>>       monitor_printf(mon, "send-section-footer: %s\n",
>>> @@ -50,6 +73,7 @@ static void migration_global_dump(Monitor *mon)
>>>                      ms->decompress_error_check ? "on" : "off");
>>>       monitor_printf(mon, "clear-bitmap-shift: %u\n",
>>>                      ms->clear_bitmap_shift);
>>> +    migration_dump_modes(mon);
>>>   }
>>>   
>>>   void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 4984dee..5535b84 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -1719,17 +1719,29 @@ static bool is_busy(Error **reasonp, Error **errp)
>>>       return false;
>>>   }
>>>   
>>> -static bool is_only_migratable(Error **reasonp, Error **errp, int modes)
>>> +static int migration_modes_required;
>>> +
>>> +void migration_set_required_mode(MigMode mode)
>>> +{
>>> +    migration_modes_required |= BIT(mode);
>>> +}
>>> +
>>> +bool migration_mode_required(MigMode mode)
>>> +{
>>> +    return !!(migration_modes_required & BIT(mode));
>>> +}
>>> +
>>> +static bool modes_are_required(Error **reasonp, Error **errp, int modes)
>>>   {
>>>       ERRP_GUARD();
>>>   
>>> -    if (only_migratable && (modes & BIT(MIG_MODE_NORMAL))) {
>>> +    if (migration_modes_required & modes) {
>>>           error_propagate_prepend(errp, *reasonp,
>>> -                                "disallowing migration blocker "
>>> -                                "(--only-migratable) for: ");
>>> +                                "-only-migratable{-modes}  specified, but: ");
>> 
>> extra space before 'specified'
>
> Will fix, thanks.
>
>>>           *reasonp = NULL;
>>>           return true;
>>>       }
>>> +
>>>       return false;
>>>   }
>>>   
>>> @@ -1783,7 +1795,7 @@ int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...)
>>>       modes = get_modes(mode, ap);
>>>       va_end(ap);
>>>   
>>> -    if (is_only_migratable(reasonp, errp, modes)) {
>>> +    if (modes_are_required(reasonp, errp, modes)) {
>>>           return -EACCES;
>>>       } else if (is_busy(reasonp, errp)) {
>>>           return -EBUSY;
>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>> index 6087c3a..e53ac84 100644
>>> --- a/migration/savevm.c
>>> +++ b/migration/savevm.c
>>> @@ -3585,7 +3585,7 @@ void vmstate_register_ram_global(MemoryRegion *mr)
>>>   bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
>>>   {
>>>       /* check needed if --only-migratable is specified */
>>> -    if (!only_migratable) {
>>> +    if (!migration_mode_required(MIG_MODE_NORMAL)) {
>>>           return true;
>>>       }
>>>   
>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>> index f0dfda5..946d731 100644
>>> --- a/qemu-options.hx
>>> +++ b/qemu-options.hx
>>> @@ -4807,8 +4807,20 @@ DEF("only-migratable", 0, QEMU_OPTION_only_migratable, \
>>>       "-only-migratable     allow only migratable devices\n", QEMU_ARCH_ALL)
>>>   SRST
>>>   ``-only-migratable``
>>> -    Only allow migratable devices. Devices will not be allowed to enter
>>> -    an unmigratable state.
>>> +    Only allow devices that can migrate using normal mode. Devices will not
>>> +    be allowed to enter an unmigratable state.
>> 
>> What's a "normal" mode is what people will ask. I don't think we need to
>> expose this. This option never had anything to do with "modes" and I
>> think we can keep it this way. See below...
>
> We now have a mode parameter and enum MigMode which includes normal, and is
> documented in qapi.
>

Alright, I take your point below. We could declare -only-migratable
superseded by -only-migratable-modes then:

``-only-migratable``
   Only allow devices that can migrate using normal mode. Devices will not
   be allowed to enter an unmigratable state. Same as
   -only-migratable-modes normal. Kept for backward-compatibility.


>>> +ERST
>>> +
>>> +DEF("only-migratable-modes", HAS_ARG, QEMU_OPTION_only_migratable_modes, \
>>> +    "-only-migratable-modes mode1[,...]\n"
>>> +    "                allow only devices that are migratable using mode(s)\n",
>>> +    QEMU_ARCH_ALL)
>>> +SRST
>>> +``-only-migratable-modes mode1[,...]``
>>> +    Only allow devices which are migratable using all modes in the list,
>>> +    which guarantees that migration will not fail due to a blocker.
>>> +    If both only-migratable-modes and only-migratable are specified,
>>> +    or are specified multiple times, then the required modes accumulate.
>>>   ERST
>>>   
>>>   DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
>>> diff --git a/system/globals.c b/system/globals.c
>>> index e353584..fdc263e 100644
>>> --- a/system/globals.c
>>> +++ b/system/globals.c
>>> @@ -48,7 +48,6 @@ const char *qemu_name;
>>>   unsigned int nb_prom_envs;
>>>   const char *prom_envs[MAX_PROM_ENVS];
>>>   uint8_t *boot_splash_filedata;
>>> -int only_migratable; /* turn it off unless user states otherwise */
>>>   int icount_align_option;
>>>   
>>>   /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
>>> diff --git a/system/vl.c b/system/vl.c
>>> index b76881e..7e73be9 100644
>>> --- a/system/vl.c
>>> +++ b/system/vl.c
>>> @@ -3458,7 +3458,18 @@ void qemu_init(int argc, char **argv)
>>>                   incoming = optarg;
>>>                   break;
>>>               case QEMU_OPTION_only_migratable:
>>> -                only_migratable = 1;
>>> +                migration_set_required_mode(MIG_MODE_NORMAL);
>> 
>> ...from the point of view of user intent, I think this should be
>> MIG_MODE_ALL. 
> If only-migratable applies to all modes, then:
> If a user only intends to use mode A, then a blocker for mode B will terminate
> qemu.  Not good.

Ok, we can't know the mode before migration time, nevermind.

>
> Defining only-migratable to apply to normal mode is the backwards-compatible
> solution.
>
> - Steve
>
>> If I have this option set I never want to see a blocker,
>> period. That's not a change in behavior because the mode has to be
>> explicitly selected anyway.
>> 
>>> +                break;
>>> +            case QEMU_OPTION_only_migratable_modes:
>>> +                {
>>> +                    int i, mode;
>>> +                    g_autofree char **words = g_strsplit(optarg, ",", -1);
>>> +                    for (i = 0; words[i]; i++) {
>>> +                        mode = qapi_enum_parse(&MigMode_lookup, words[i], -1,
>>> +                                               &error_fatal);
>>> +                        migration_set_required_mode(mode);
>> 
>> This option can be used to refine the modes being considered, it should
>> take precedence if both are present.
>> 
>>> +                    }
>>> +                }
>>>                   break;
>>>               case QEMU_OPTION_nodefaults:
>>>                   has_defaults = 0;
>>> diff --git a/target/s390x/cpu_models.c b/target/s390x/cpu_models.c
>>> index 8ed3bb6..42ad160 100644
>>> --- a/target/s390x/cpu_models.c
>>> +++ b/target/s390x/cpu_models.c
>>> @@ -16,6 +16,7 @@
>>>   #include "kvm/kvm_s390x.h"
>>>   #include "sysemu/kvm.h"
>>>   #include "sysemu/tcg.h"
>>> +#include "migration/misc.h"
>>>   #include "qapi/error.h"
>>>   #include "qemu/error-report.h"
>>>   #include "qapi/visitor.h"
>>> @@ -526,7 +527,8 @@ static void check_compatibility(const S390CPUModel *max_model,
>>>       }
>>>   
>>>   #ifndef CONFIG_USER_ONLY
>>> -    if (only_migratable && test_bit(S390_FEAT_UNPACK, model->features)) {
>>> +    if (migration_mode_required(MIG_MODE_NORMAL) &&
>>> +        test_bit(S390_FEAT_UNPACK, model->features)) {
>>>           error_setg(errp, "The unpack facility is not compatible with "
>>>                      "the --only-migratable option. You must remove either "
>>>                      "the 'unpack' facility or the --only-migratable option");


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (26 preceding siblings ...)
  2024-05-02 16:13 ` cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec) Steven Sistare
@ 2024-05-20 18:30 ` Steven Sistare
  2024-05-20 22:28   ` Fabiano Rosas
  2024-05-24 13:02 ` Fabiano Rosas
  2024-05-27 18:07 ` Peter Xu
  29 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-20 18:30 UTC (permalink / raw)
  To: qemu-devel, Peter Xu, Fabiano Rosas

Hi Peter, Hi Fabiano,
   Will you have time to review the migration guts of this series any time soon?
In particular:

[PATCH V1 05/26] migration: precreate vmstate
[PATCH V1 06/26] migration: precreate vmstate for exec
[PATCH V1 12/26] migration: vmstate factory object
[PATCH V1 18/26] migration: cpr-exec-args parameter
[PATCH V1 20/26] migration: cpr-exec mode

- Steve

On 4/29/2024 11:55 AM, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
> stops the VM, writes VM state to the migration URI, and directly exec's a
> new version of QEMU on the same host, replacing the original process while
> retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
> addresses.  The user completes the migration by specifying the -incoming
> option, and by issuing the migrate-incoming command if necessary.  This
> saves and restores VM state, with minimal guest pause time, so that QEMU may
> be updated to a new version in between.
> 
> The new interfaces are:
>    * cpr-exec (MigMode migration parameter)
>    * cpr-exec-args (migration parameter)
>    * memfd-alloc=on (command-line option for -machine)
>    * only-migratable-modes (command-line argument)
> 
> The caller sets the mode parameter before invoking the migrate command.
> 
> Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
> The first argument should be the path of a new QEMU binary, or a prefix
> command that exec's the new QEMU binary, and the arguments should include
> the -incoming option.
> 
> Memory backend objects must have the share=on attribute, and must be mmap'able
> in the new QEMU process.  For example, memory-backend-file is acceptable,
> but memory-backend-ram is not.
> 
> QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
> implicit RAM blocks (those not explicitly described by a memory-backend
> object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
> and even guest RAM when it is specified without without reference to a
> memory-backend object.   The memfds are kept open across exec, their values
> are saved in vmstate which is retrieved after exec, and they are re-mmap'd.
> 
> The '-only-migratable-modes cpr-exec' option guarantees that the
> configuration supports cpr-exec.  QEMU will exit at start time if not.
> 
> Example:
> 
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would set a new QEMU binary path in cpr-exec-args.
> 
>    # qemu-kvm -monitor stdio -object
>    memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
>    -m 4G -machine memfd-alloc=on ...
> 
>    QEMU 9.1.50 monitor - type 'help' for more information
>    (qemu) info status
>    VM status: running
>    (qemu) migrate_set_parameter mode cpr-exec
>    (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
>    (qemu) migrate -d file:vm.state
>    (qemu) QEMU 9.1.50 monitor - type 'help' for more information
>    (qemu) info status
>    VM status: running
> 
> cpr-exec mode preserves attributes of outgoing devices that must be known
> before the device is created on the incoming side, such as the memfd descriptor
> number, but currently the migration stream is read after all devices are
> created.  To solve this problem, I add two VMStateDescription options:
> precreate and factory.  precreate objects are saved to their own migration
> stream, distinct from the main stream, and are read early by incoming QEMU,
> before devices are created.  Factory objects are allocated on demand, without
> relying on a pre-registered object's opaque address, which is necessary
> because the devices to which the state will apply have not been created yet
> and hence have not registered an opaque address to receive the state.
> 
> This patch series implements a minimal version of cpr-exec.  Future series
> will add support for:
>    * vfio
>    * chardev's without loss of connectivity
>    * vhost
>    * fine-grained seccomp controls
>    * hostmem-memfd
>    * cpr-exec migration test
> 
> 
> Steve Sistare (26):
>    oslib: qemu_clear_cloexec
>    vl: helper to request re-exec
>    migration: SAVEVM_FOREACH
>    migration: delete unused parameter mis
>    migration: precreate vmstate
>    migration: precreate vmstate for exec
>    migration: VMStateId
>    migration: vmstate_info_void_ptr
>    migration: vmstate_register_named
>    migration: vmstate_unregister_named
>    migration: vmstate_register at init time
>    migration: vmstate factory object
>    physmem: ram_block_create
>    physmem: hoist guest_memfd creation
>    physmem: hoist host memory allocation
>    physmem: set ram block idstr earlier
>    machine: memfd-alloc option
>    migration: cpr-exec-args parameter
>    physmem: preserve ram blocks for cpr
>    migration: cpr-exec mode
>    migration: migrate_add_blocker_mode
>    migration: ram block cpr-exec blockers
>    migration: misc cpr-exec blockers
>    seccomp: cpr-exec blocker
>    migration: fix mismatched GPAs during cpr-exec
>    migration: only-migratable-modes
> 
>   accel/xen/xen-all.c            |   5 +
>   backends/hostmem-epc.c         |  12 +-
>   hmp-commands.hx                |   2 +-
>   hw/core/machine.c              |  22 +++
>   hw/core/qdev.c                 |   1 +
>   hw/intc/apic_common.c          |   2 +-
>   hw/vfio/migration.c            |   3 +-
>   include/exec/cpu-common.h      |   3 +-
>   include/exec/memory.h          |  15 ++
>   include/exec/ramblock.h        |  10 +-
>   include/hw/boards.h            |   1 +
>   include/migration/blocker.h    |   7 +
>   include/migration/cpr.h        |  14 ++
>   include/migration/misc.h       |  11 ++
>   include/migration/vmstate.h    | 133 +++++++++++++++-
>   include/qemu/osdep.h           |   9 ++
>   include/sysemu/runstate.h      |   3 +
>   include/sysemu/seccomp.h       |   1 +
>   include/sysemu/sysemu.h        |   1 -
>   migration/cpr.c                | 131 ++++++++++++++++
>   migration/meson.build          |   3 +
>   migration/migration-hmp-cmds.c |  50 +++++-
>   migration/migration.c          |  48 +++++-
>   migration/migration.h          |   5 +-
>   migration/options.c            |  13 ++
>   migration/precreate.c          | 139 +++++++++++++++++
>   migration/ram.c                |  16 +-
>   migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
>   migration/savevm.h             |   3 +
>   migration/trace-events         |   7 +
>   migration/vmstate-factory.c    |  78 ++++++++++
>   migration/vmstate-types.c      |  24 +++
>   migration/vmstate.c            |   3 +-
>   qapi/migration.json            |  48 +++++-
>   qemu-options.hx                |  22 ++-
>   replay/replay.c                |   6 +
>   stubs/migr-blocker.c           |   5 +
>   stubs/vmstate.c                |  13 ++
>   system/globals.c               |   1 -
>   system/memory.c                |  19 ++-
>   system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
>   system/qemu-seccomp.c          |  10 +-
>   system/runstate.c              |  29 ++++
>   system/trace-events            |   4 +
>   system/vl.c                    |  26 +++-
>   target/s390x/cpu_models.c      |   4 +-
>   util/oslib-posix.c             |   9 ++
>   util/oslib-win32.c             |   4 +
>   48 files changed, 1417 insertions(+), 210 deletions(-)
>   create mode 100644 include/migration/cpr.h
>   create mode 100644 migration/cpr.c
>   create mode 100644 migration/precreate.c
>   create mode 100644 migration/vmstate-factory.c
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-20 18:30 ` [PATCH V1 00/26] Live update: cpr-exec Steven Sistare
@ 2024-05-20 22:28   ` Fabiano Rosas
  2024-05-21  2:31     ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-20 22:28 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel, Peter Xu

Steven Sistare <steven.sistare@oracle.com> writes:

> Hi Peter, Hi Fabiano,
>    Will you have time to review the migration guts of this series any time soon?
> In particular:
>
> [PATCH V1 05/26] migration: precreate vmstate
> [PATCH V1 06/26] migration: precreate vmstate for exec
> [PATCH V1 12/26] migration: vmstate factory object
> [PATCH V1 18/26] migration: cpr-exec-args parameter
> [PATCH V1 20/26] migration: cpr-exec mode
>

I'll get to them this week. I'm trying to make some progress with my own
code before I forget how to program. I'm also trying to find some time
to implement the device options in the migration tests so we can stop
these virtio-* breakages that have been popping up.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-20 22:28   ` Fabiano Rosas
@ 2024-05-21  2:31     ` Peter Xu
  2024-05-21 11:46       ` Steven Sistare
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-21  2:31 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: Steven Sistare, QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

Conference back then pto until today, so tomorrow will be my first working
day after those. Sorry Steve, will try my best to read it before next week.
I didn't dare to read too much my inbox yet.  A bit scared but need to face
it tomorrow.

On Mon, May 20, 2024, 6:28 p.m. Fabiano Rosas <farosas@suse.de> wrote:

> Steven Sistare <steven.sistare@oracle.com> writes:
>
> > Hi Peter, Hi Fabiano,
> >    Will you have time to review the migration guts of this series any
> time soon?
> > In particular:
> >
> > [PATCH V1 05/26] migration: precreate vmstate
> > [PATCH V1 06/26] migration: precreate vmstate for exec
> > [PATCH V1 12/26] migration: vmstate factory object
> > [PATCH V1 18/26] migration: cpr-exec-args parameter
> > [PATCH V1 20/26] migration: cpr-exec mode
> >
>
> I'll get to them this week. I'm trying to make some progress with my own
> code before I forget how to program. I'm also trying to find some time
> to implement the device options in the migration tests so we can stop
> these virtio-* breakages that have been popping up.
>
>

[-- Attachment #2: Type: text/html, Size: 1503 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 24/26] seccomp: cpr-exec blocker
  2024-05-13 19:29     ` Steven Sistare
@ 2024-05-21  7:14       ` Daniel P. Berrangé
  0 siblings, 0 replies; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-05-21  7:14 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Mon, May 13, 2024 at 03:29:48PM -0400, Steven Sistare wrote:
> On 5/10/2024 3:54 AM, Daniel P. Berrangé wrote:
> > On Mon, Apr 29, 2024 at 08:55:33AM -0700, Steve Sistare wrote:
> > > cpr-exec mode needs permission to exec.  Block it if permission is denied.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   include/sysemu/seccomp.h |  1 +
> > >   system/qemu-seccomp.c    | 10 ++++++++--
> > >   system/vl.c              |  6 ++++++
> > >   3 files changed, 15 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/sysemu/seccomp.h b/include/sysemu/seccomp.h
> > > index fe85989..023c0a1 100644
> > > --- a/include/sysemu/seccomp.h
> > > +++ b/include/sysemu/seccomp.h
> > > @@ -22,5 +22,6 @@
> > >   #define QEMU_SECCOMP_SET_RESOURCECTL (1 << 4)
> > >   int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp);
> > > +uint32_t qemu_seccomp_get_opts(void);
> > >   #endif
> > > diff --git a/system/qemu-seccomp.c b/system/qemu-seccomp.c
> > > index 5c20ac0..0d2a561 100644
> > > --- a/system/qemu-seccomp.c
> > > +++ b/system/qemu-seccomp.c
> > > @@ -360,12 +360,18 @@ static int seccomp_start(uint32_t seccomp_opts, Error **errp)
> > >       return rc < 0 ? -1 : 0;
> > >   }
> > > +static uint32_t seccomp_opts;
> > > +
> > > +uint32_t qemu_seccomp_get_opts(void)
> > > +{
> > > +    return seccomp_opts;
> > > +}
> > > +
> > >   int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp)
> > >   {
> > >       if (qemu_opt_get_bool(opts, "enable", false)) {
> > > -        uint32_t seccomp_opts = QEMU_SECCOMP_SET_DEFAULT
> > > -                | QEMU_SECCOMP_SET_OBSOLETE;
> > >           const char *value = NULL;
> > > +        seccomp_opts = QEMU_SECCOMP_SET_DEFAULT | QEMU_SECCOMP_SET_OBSOLETE;
> > >           value = qemu_opt_get(opts, "obsolete");
> > >           if (value) {
> > > diff --git a/system/vl.c b/system/vl.c
> > > index 7252100..b76881e 100644
> > > --- a/system/vl.c
> > > +++ b/system/vl.c
> > > @@ -76,6 +76,7 @@
> > >   #include "hw/block/block.h"
> > >   #include "hw/i386/x86.h"
> > >   #include "hw/i386/pc.h"
> > > +#include "migration/blocker.h"
> > >   #include "migration/cpr.h"
> > >   #include "migration/misc.h"
> > >   #include "migration/snapshot.h"
> > > @@ -2493,6 +2494,11 @@ static void qemu_process_early_options(void)
> > >       QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
> > >       if (olist) {
> > >           qemu_opts_foreach(olist, parse_sandbox, NULL, &error_fatal);
> > > +        if (qemu_seccomp_get_opts() & QEMU_SECCOMP_SET_SPAWN) {
> > > +            Error *blocker = NULL;
> > > +            error_setg(&blocker, "-sandbox denies exec for cpr-exec");
> > > +            migrate_add_blocker_mode(&blocker, MIG_MODE_CPR_EXEC, &error_fatal);
> > > +        }
> > >       }
> > >   #endi
> > 
> > There are a whole pile of features that get blocked wehn -sandbox is
> > used. I'm not convinced we should be adding code to check for specific
> > blocked features, as such a list will always be incomplete at best, and
> > incorrectly block things at worst.
> > 
> > I view this primarily as a documentation task for the cpr-exec command.
> 
> For cpr and live migration, we do our best to prevent breaking the guest
> for cases we know will fail.  Independently, a clear error message here
> will reduce error reports for this new cpr feature.

I would expect the QMP command that triggers the sandbox to report a
clear error when getting EPERM.

> Would it be more palatable if I move this blocker's creation to cpr_mig_init?

Not particularly, as I don't think other code in QEMU should be looking
at the sandbox impl.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 26/26] migration: only-migratable-modes
  2024-04-29 15:55 ` [PATCH V1 26/26] migration: only-migratable-modes Steve Sistare
  2024-05-09 19:14   ` Fabiano Rosas
@ 2024-05-21  8:05   ` Daniel P. Berrangé
  1 sibling, 0 replies; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-05-21  8:05 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:35AM -0700, Steve Sistare wrote:
> Add the only-migratable-modes option as a generalization of only-migratable.
> Only devices that support all requested modes are allowed.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/misc.h       |  3 +++
>  include/sysemu/sysemu.h        |  1 -
>  migration/migration-hmp-cmds.c | 26 +++++++++++++++++++++++++-
>  migration/migration.c          | 22 +++++++++++++++++-----
>  migration/savevm.c             |  2 +-
>  qemu-options.hx                | 16 ++++++++++++++--
>  system/globals.c               |  1 -
>  system/vl.c                    | 13 ++++++++++++-
>  target/s390x/cpu_models.c      |  4 +++-
>  9 files changed, 75 insertions(+), 13 deletions(-)

> diff --git a/qemu-options.hx b/qemu-options.hx
> index f0dfda5..946d731 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -4807,8 +4807,20 @@ DEF("only-migratable", 0, QEMU_OPTION_only_migratable, \
>      "-only-migratable     allow only migratable devices\n", QEMU_ARCH_ALL)
>  SRST
>  ``-only-migratable``
> -    Only allow migratable devices. Devices will not be allowed to enter
> -    an unmigratable state.
> +    Only allow devices that can migrate using normal mode. Devices will not
> +    be allowed to enter an unmigratable state.
> +ERST
> +
> +DEF("only-migratable-modes", HAS_ARG, QEMU_OPTION_only_migratable_modes, \
> +    "-only-migratable-modes mode1[,...]\n"
> +    "                allow only devices that are migratable using mode(s)\n",
> +    QEMU_ARCH_ALL)
> +SRST
> +``-only-migratable-modes mode1[,...]``
> +    Only allow devices which are migratable using all modes in the list,
> +    which guarantees that migration will not fail due to a blocker.
> +    If both only-migratable-modes and only-migratable are specified,
> +    or are specified multiple times, then the required modes accumulate.
>  ERST

Adding new top level CLI options is not something we much like doing
these days. Also its is preferrable to define args using QAPI rather
than creating hand written parsers

The pre-existing -only-migratable flag isn't ideal either, as a random
top level flag on its own.

I tend to think we should probably make both these arguments become
properties in the Machine class, and thus settable with the existing
-machine argument, since they're describing a requirement of the
machine and devices added to it.

The existing -only-migratable can be deprecated and simply made to
set the corresponding machien property

> diff --git a/system/globals.c b/system/globals.c
> index e353584..fdc263e 100644
> --- a/system/globals.c
> +++ b/system/globals.c
> @@ -48,7 +48,6 @@ const char *qemu_name;
>  unsigned int nb_prom_envs;
>  const char *prom_envs[MAX_PROM_ENVS];
>  uint8_t *boot_splash_filedata;
> -int only_migratable; /* turn it off unless user states otherwise */
>  int icount_align_option;
>  
>  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
> diff --git a/system/vl.c b/system/vl.c
> index b76881e..7e73be9 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -3458,7 +3458,18 @@ void qemu_init(int argc, char **argv)
>                  incoming = optarg;
>                  break;
>              case QEMU_OPTION_only_migratable:
> -                only_migratable = 1;
> +                migration_set_required_mode(MIG_MODE_NORMAL);
> +                break;
> +            case QEMU_OPTION_only_migratable_modes:
> +                {
> +                    int i, mode;
> +                    g_autofree char **words = g_strsplit(optarg, ",", -1);
> +                    for (i = 0; words[i]; i++) {
> +                        mode = qapi_enum_parse(&MigMode_lookup, words[i], -1,
> +                                               &error_fatal);
> +                        migration_set_required_mode(mode);
> +                    }
> +                }
>                  break;
>              case QEMU_OPTION_nodefaults:
>                  has_defaults = 0;
> diff --git a/target/s390x/cpu_models.c b/target/s390x/cpu_models.c
> index 8ed3bb6..42ad160 100644
> --- a/target/s390x/cpu_models.c
> +++ b/target/s390x/cpu_models.c
> @@ -16,6 +16,7 @@
>  #include "kvm/kvm_s390x.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/tcg.h"
> +#include "migration/misc.h"
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "qapi/visitor.h"
> @@ -526,7 +527,8 @@ static void check_compatibility(const S390CPUModel *max_model,
>      }
>  
>  #ifndef CONFIG_USER_ONLY
> -    if (only_migratable && test_bit(S390_FEAT_UNPACK, model->features)) {
> +    if (migration_mode_required(MIG_MODE_NORMAL) &&
> +        test_bit(S390_FEAT_UNPACK, model->features)) {
>          error_setg(errp, "The unpack facility is not compatible with "
>                     "the --only-migratable option. You must remove either "
>                     "the 'unpack' facility or the --only-migratable option");
> -- 
> 1.8.3.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 18/26] migration: cpr-exec-args parameter
  2024-04-29 15:55 ` [PATCH V1 18/26] migration: cpr-exec-args parameter Steve Sistare
  2024-05-02 12:23   ` Markus Armbruster
@ 2024-05-21  8:13   ` Daniel P. Berrangé
  1 sibling, 0 replies; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-05-21  8:13 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:27AM -0700, Steve Sistare wrote:
> Create the cpr-exec-args migration parameter, defined as a list of
> strings.  It will be used for cpr-exec migration mode in a subsequent
> patch.
> 
> No functional change, except that cpr-exec-args is shown by the
> 'info migrate' command.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hmp-commands.hx                |  2 +-
>  migration/migration-hmp-cmds.c | 24 ++++++++++++++++++++++++
>  migration/options.c            | 13 +++++++++++++
>  qapi/migration.json            | 18 +++++++++++++++---
>  4 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 2e2a3bc..39954ae 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -1012,7 +1012,7 @@ ERST
>  
>      {
>          .name       = "migrate_set_parameter",
> -        .args_type  = "parameter:s,value:s",
> +        .args_type  = "parameter:s,value:S",
>          .params     = "parameter value",
>          .help       = "Set the parameter for migration",
>          .cmd        = hmp_migrate_set_parameter,
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 7e96ae6..414c7e8 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -255,6 +255,18 @@ void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict)
>      qapi_free_MigrationCapabilityStatusList(caps);
>  }
>  
> +static void monitor_print_cpr_exec_args(Monitor *mon, strList *args)
> +{
> +    monitor_printf(mon, "%s:",
> +        MigrationParameter_str(MIGRATION_PARAMETER_CPR_EXEC_ARGS));
> +
> +    while (args) {
> +        monitor_printf(mon, " %s", args->value);
> +        args = args->next;
> +    }
> +    monitor_printf(mon, "\n");
> +}
> +
>  void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>  {
>      MigrationParameters *params;
> @@ -397,6 +409,8 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>          monitor_printf(mon, "%s: %s\n",
>              MigrationParameter_str(MIGRATION_PARAMETER_MODE),
>              qapi_enum_lookup(&MigMode_lookup, params->mode));
> +        assert(params->has_cpr_exec_args);
> +        monitor_print_cpr_exec_args(mon, params->cpr_exec_args);
>      }
>  
>      qapi_free_MigrationParameters(params);
> @@ -690,6 +704,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          p->has_mode = true;
>          visit_type_MigMode(v, param, &p->mode, &err);
>          break;
> +    case MIGRATION_PARAMETER_CPR_EXEC_ARGS: {
> +        g_autofree char **strv = g_strsplit(valuestr ?: "", " ", -1);

Splitting on whitespace means it'll break with any arguments containing
quoted whitespace. If we use g_shell_parse_argv then it should support
quoting in the normal shell mannre.

> +        strList **tail = &p->cpr_exec_args;
> +
> +        for (int i = 0; strv[i]; i++) {
> +            QAPI_LIST_APPEND(tail, strv[i]);
> +        }
> +        p->has_cpr_exec_args = true;
> +        break;
> +    }
>      default:
>          assert(0);
>      }

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 20/26] migration: cpr-exec mode
  2024-04-29 15:55 ` [PATCH V1 20/26] migration: cpr-exec mode Steve Sistare
  2024-05-02 12:23   ` Markus Armbruster
@ 2024-05-21  8:20   ` Daniel P. Berrangé
  2024-05-24 14:58   ` Fabiano Rosas
  2 siblings, 0 replies; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-05-21  8:20 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:29AM -0700, Steve Sistare wrote:
> Add the cpr-exec migration mode.  Usage:
>   qemu-system-$arch -machine memfd-alloc=on ...
>   migrate_set_parameter mode cpr-exec
>   migrate_set_parameter cpr-exec-args \
>     <arg1> <arg2> ... -incoming <uri>
>   migrate -d <uri>
> 
> The migrate command stops the VM, saves state to the URI,
> directly exec's a new version of QEMU on the same host,
> replacing the original process while retaining its PID, and
> loads state from the URI.  Guest RAM is preserved in place,
> albeit with new virtual addresses.
> 
> Arguments for the new QEMU process are taken from the
> @cpr-exec-args parameter.  The first argument should be the
> path of a new QEMU binary, or a prefix command that exec's the
> new QEMU binary.
> 
> Because old QEMU terminates when new QEMU starts, one cannot
> stream data between the two, so the URI must be a type, such as
> a file, that reads all data before old QEMU exits.
> 
> Memory backend objects must have the share=on attribute, and
> must be mmap'able in the new QEMU process.  For example,
> memory-backend-file is acceptable, but memory-backend-ram is
> not.
> 
> The VM must be started with the '-machine memfd-alloc=on'
> option.  This causes implicit ram blocks (those not explicitly
> described by a memory-backend object) to be allocated by
> mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> RAM when it is specified without a memory-backend object.
> 
> The implementation saves precreate vmstate at the end of normal
> migration in migrate_fd_cleanup, and tells the main loop to call
> cpr_exec.  Incoming qemu loads preceate state early, before objects
> are created.  The memfds are kept open across exec by clearing the
> close-on-exec flag, their values are saved in precreate vmstate,
> and they are mmap'd in new qemu.
> 
> Note that the memfd-alloc option is not related to memory-backend-memfd.
> Later patches add support for memory-backend-memfd, and for additional
> devices, including vfio, chardev, and more.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/cpr.h  |  14 +++++
>  include/migration/misc.h |   3 ++
>  migration/cpr.c          | 131 +++++++++++++++++++++++++++++++++++++++++++++++
>  migration/meson.build    |   1 +
>  migration/migration.c    |  21 ++++++++
>  migration/migration.h    |   5 +-
>  migration/ram.c          |   1 +
>  qapi/migration.json      |  30 ++++++++++-
>  system/physmem.c         |   2 +
>  system/vl.c              |   4 ++
>  10 files changed, 210 insertions(+), 2 deletions(-)
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr.c
> 

> +
> +void cpr_exec(char **argv)
> +{
> +    MigrationState *s = migrate_get_current();
> +    Error *err = NULL;
> +
> +    /*
> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> +     * earlier because they should not persist across miscellaneous fork and
> +     * exec calls that are performed during normal operation.
> +     */
> +    cpr_preserve_fds();
> +
> +    execvp(argv[0], argv);
> +
> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);

This is where you could give a more direct message about the sandbox.
eg

   if (errno == EPERM) {
      error_append_hint("sandbox is blocking ability to exec");
   }

this would also benefit the case where an external sandbox is
used, rather than qemu's built-in sandbox.

> +    error_report_err(err);
> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +    migrate_set_error(s, err);
> +    migration_precreate_unsave();
> +}

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-21  2:31     ` Peter Xu
@ 2024-05-21 11:46       ` Steven Sistare
  2024-05-27 17:45         ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare @ 2024-05-21 11:46 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas; +Cc: QEMU Developers

I understand, thanks.  If I can help with any of your todo list,
just ask - steve

On 5/20/2024 10:31 PM, Peter Xu wrote:
> Conference back then pto until today, so tomorrow will be my first working day 
> after those. Sorry Steve, will try my best to read it before next week. I didn't 
> dare to read too much my inbox yet.  A bit scared but need to face it tomorrow.
> 
> On Mon, May 20, 2024, 6:28 p.m. Fabiano Rosas <farosas@suse.de 
> <mailto:farosas@suse.de>> wrote:
> 
>     Steven Sistare <steven.sistare@oracle.com
>     <mailto:steven.sistare@oracle.com>> writes:
> 
>      > Hi Peter, Hi Fabiano,
>      >    Will you have time to review the migration guts of this series any
>     time soon?
>      > In particular:
>      >
>      > [PATCH V1 05/26] migration: precreate vmstate
>      > [PATCH V1 06/26] migration: precreate vmstate for exec
>      > [PATCH V1 12/26] migration: vmstate factory object
>      > [PATCH V1 18/26] migration: cpr-exec-args parameter
>      > [PATCH V1 20/26] migration: cpr-exec mode
>      >
> 
>     I'll get to them this week. I'm trying to make some progress with my own
>     code before I forget how to program. I'm also trying to find some time
>     to implement the device options in the migration tests so we can stop
>     these virtio-* breakages that have been popping up.
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 23/26] migration: misc cpr-exec blockers
  2024-04-29 15:55 ` [PATCH V1 23/26] migration: misc " Steve Sistare
  2024-05-09 18:05   ` Fabiano Rosas
@ 2024-05-24 12:40   ` Fabiano Rosas
  2024-05-27 19:02     ` Steven Sistare via
  1 sibling, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-24 12:40 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Add blockers for cpr-exec migration mode for devices and options that do
> not support it.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  accel/xen/xen-all.c    |  5 +++++
>  backends/hostmem-epc.c | 12 ++++++++++--
>  hw/vfio/migration.c    |  3 ++-
>  replay/replay.c        |  6 ++++++
>  4 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c
> index 0bdefce..9a7ed0f 100644
> --- a/accel/xen/xen-all.c
> +++ b/accel/xen/xen-all.c

This file is missing the migration/blocker.h include.



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (27 preceding siblings ...)
  2024-05-20 18:30 ` [PATCH V1 00/26] Live update: cpr-exec Steven Sistare
@ 2024-05-24 13:02 ` Fabiano Rosas
  2024-05-24 14:07   ` Steven Sistare
  2024-05-27 18:07 ` Peter Xu
  29 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-24 13:02 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
> stops the VM, writes VM state to the migration URI, and directly exec's a
> new version of QEMU on the same host, replacing the original process while
> retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
> addresses.  The user completes the migration by specifying the -incoming
> option, and by issuing the migrate-incoming command if necessary.  This
> saves and restores VM state, with minimal guest pause time, so that QEMU may
> be updated to a new version in between.
>
> The new interfaces are:
>   * cpr-exec (MigMode migration parameter)
>   * cpr-exec-args (migration parameter)
>   * memfd-alloc=on (command-line option for -machine)
>   * only-migratable-modes (command-line argument)
>
> The caller sets the mode parameter before invoking the migrate command.
>
> Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
> The first argument should be the path of a new QEMU binary, or a prefix
> command that exec's the new QEMU binary, and the arguments should include
> the -incoming option.
>
> Memory backend objects must have the share=on attribute, and must be mmap'able
> in the new QEMU process.  For example, memory-backend-file is acceptable,
> but memory-backend-ram is not.
>
> QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
> implicit RAM blocks (those not explicitly described by a memory-backend
> object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
> and even guest RAM when it is specified without without reference to a
> memory-backend object.   The memfds are kept open across exec, their values
> are saved in vmstate which is retrieved after exec, and they are re-mmap'd.
>
> The '-only-migratable-modes cpr-exec' option guarantees that the
> configuration supports cpr-exec.  QEMU will exit at start time if not.
>
> Example:
>
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would set a new QEMU binary path in cpr-exec-args.
>
>   # qemu-kvm -monitor stdio -object
>   memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
>   -m 4G -machine memfd-alloc=on ...
>
>   QEMU 9.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
>   (qemu) migrate_set_parameter mode cpr-exec
>   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
>   (qemu) migrate -d file:vm.state
>   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
>
> cpr-exec mode preserves attributes of outgoing devices that must be known
> before the device is created on the incoming side, such as the memfd descriptor
> number, but currently the migration stream is read after all devices are
> created.  To solve this problem, I add two VMStateDescription options:
> precreate and factory.  precreate objects are saved to their own migration
> stream, distinct from the main stream, and are read early by incoming QEMU,
> before devices are created.  Factory objects are allocated on demand, without
> relying on a pre-registered object's opaque address, which is necessary
> because the devices to which the state will apply have not been created yet
> and hence have not registered an opaque address to receive the state.
>
> This patch series implements a minimal version of cpr-exec.  Future series
> will add support for:
>   * vfio
>   * chardev's without loss of connectivity
>   * vhost
>   * fine-grained seccomp controls
>   * hostmem-memfd
>   * cpr-exec migration test
>
>
> Steve Sistare (26):
>   oslib: qemu_clear_cloexec
>   vl: helper to request re-exec
>   migration: SAVEVM_FOREACH
>   migration: delete unused parameter mis
>   migration: precreate vmstate
>   migration: precreate vmstate for exec
>   migration: VMStateId
>   migration: vmstate_info_void_ptr
>   migration: vmstate_register_named
>   migration: vmstate_unregister_named
>   migration: vmstate_register at init time
>   migration: vmstate factory object
>   physmem: ram_block_create
>   physmem: hoist guest_memfd creation
>   physmem: hoist host memory allocation
>   physmem: set ram block idstr earlier
>   machine: memfd-alloc option
>   migration: cpr-exec-args parameter
>   physmem: preserve ram blocks for cpr
>   migration: cpr-exec mode
>   migration: migrate_add_blocker_mode
>   migration: ram block cpr-exec blockers
>   migration: misc cpr-exec blockers
>   seccomp: cpr-exec blocker
>   migration: fix mismatched GPAs during cpr-exec
>   migration: only-migratable-modes
>
>  accel/xen/xen-all.c            |   5 +
>  backends/hostmem-epc.c         |  12 +-
>  hmp-commands.hx                |   2 +-
>  hw/core/machine.c              |  22 +++
>  hw/core/qdev.c                 |   1 +
>  hw/intc/apic_common.c          |   2 +-
>  hw/vfio/migration.c            |   3 +-
>  include/exec/cpu-common.h      |   3 +-
>  include/exec/memory.h          |  15 ++
>  include/exec/ramblock.h        |  10 +-
>  include/hw/boards.h            |   1 +
>  include/migration/blocker.h    |   7 +
>  include/migration/cpr.h        |  14 ++
>  include/migration/misc.h       |  11 ++
>  include/migration/vmstate.h    | 133 +++++++++++++++-
>  include/qemu/osdep.h           |   9 ++
>  include/sysemu/runstate.h      |   3 +
>  include/sysemu/seccomp.h       |   1 +
>  include/sysemu/sysemu.h        |   1 -
>  migration/cpr.c                | 131 ++++++++++++++++
>  migration/meson.build          |   3 +
>  migration/migration-hmp-cmds.c |  50 +++++-
>  migration/migration.c          |  48 +++++-
>  migration/migration.h          |   5 +-
>  migration/options.c            |  13 ++
>  migration/precreate.c          | 139 +++++++++++++++++
>  migration/ram.c                |  16 +-
>  migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
>  migration/savevm.h             |   3 +
>  migration/trace-events         |   7 +
>  migration/vmstate-factory.c    |  78 ++++++++++
>  migration/vmstate-types.c      |  24 +++
>  migration/vmstate.c            |   3 +-
>  qapi/migration.json            |  48 +++++-
>  qemu-options.hx                |  22 ++-
>  replay/replay.c                |   6 +
>  stubs/migr-blocker.c           |   5 +
>  stubs/vmstate.c                |  13 ++
>  system/globals.c               |   1 -
>  system/memory.c                |  19 ++-
>  system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
>  system/qemu-seccomp.c          |  10 +-
>  system/runstate.c              |  29 ++++
>  system/trace-events            |   4 +
>  system/vl.c                    |  26 +++-
>  target/s390x/cpu_models.c      |   4 +-
>  util/oslib-posix.c             |   9 ++
>  util/oslib-win32.c             |   4 +
>  48 files changed, 1417 insertions(+), 210 deletions(-)
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr.c
>  create mode 100644 migration/precreate.c
>  create mode 100644 migration/vmstate-factory.c

Hi Steve,

make check is failing. I applied the series on top of master @
70581940ca (Merge tag 'pull-tcg-20240523' of
https://gitlab.com/rth7680/qemu into staging, 2024-05-23).

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/ivshmem-test
...
qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
Assertion `!strcmp(new_block->idstr, idstr)' failed.

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 \
./tests/qtest/test-x86-cpuid-compat -p \
/x86_64/x86/cpuid/auto-level/pc-2.7
...
qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
Assertion `!strcmp(new_block->idstr, idstr)' failed.

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/qmp-cmd-test -p \
/x86_64/qmp/object-add-failure-modes
...
savevm_state_handler_insert: Detected duplicate SaveStateEntry:
id=ram1/RAMBlock, instance_id=0x0


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 05/26] migration: precreate vmstate
  2024-04-29 15:55 ` [PATCH V1 05/26] migration: precreate vmstate Steve Sistare
  2024-05-07 21:02   ` Fabiano Rosas
@ 2024-05-24 13:56   ` Fabiano Rosas
  2024-05-27 18:16   ` Peter Xu
  2 siblings, 0 replies; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-24 13:56 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Provide the VMStateDescription precreate field to mark objects that must
> be loaded on the incoming side before devices have been created, because
> they provide properties that will be needed at creation time.  They will
> be saved to and loaded from their own QEMUFile, via
> qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
> functions are not yet called in this patch.  Allow them to be called
> before or after normal migration is active, when current_migration and
> current_incoming are not valid.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-24 13:02 ` Fabiano Rosas
@ 2024-05-24 14:07   ` Steven Sistare
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare @ 2024-05-24 14:07 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/24/2024 9:02 AM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
>> stops the VM, writes VM state to the migration URI, and directly exec's a
>> new version of QEMU on the same host, replacing the original process while
>> retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
>> addresses.  The user completes the migration by specifying the -incoming
>> option, and by issuing the migrate-incoming command if necessary.  This
>> saves and restores VM state, with minimal guest pause time, so that QEMU may
>> be updated to a new version in between.
>>
>> The new interfaces are:
>>    * cpr-exec (MigMode migration parameter)
>>    * cpr-exec-args (migration parameter)
>>    * memfd-alloc=on (command-line option for -machine)
>>    * only-migratable-modes (command-line argument)
>>
>> The caller sets the mode parameter before invoking the migrate command.
>>
>> Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
>> The first argument should be the path of a new QEMU binary, or a prefix
>> command that exec's the new QEMU binary, and the arguments should include
>> the -incoming option.
>>
>> Memory backend objects must have the share=on attribute, and must be mmap'able
>> in the new QEMU process.  For example, memory-backend-file is acceptable,
>> but memory-backend-ram is not.
>>
>> QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
>> implicit RAM blocks (those not explicitly described by a memory-backend
>> object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
>> and even guest RAM when it is specified without without reference to a
>> memory-backend object.   The memfds are kept open across exec, their values
>> are saved in vmstate which is retrieved after exec, and they are re-mmap'd.
>>
>> The '-only-migratable-modes cpr-exec' option guarantees that the
>> configuration supports cpr-exec.  QEMU will exit at start time if not.
>>
>> Example:
>>
>> In this example, we simply restart the same version of QEMU, but in
>> a real scenario one would set a new QEMU binary path in cpr-exec-args.
>>
>>    # qemu-kvm -monitor stdio -object
>>    memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
>>    -m 4G -machine memfd-alloc=on ...
>>
>>    QEMU 9.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>    (qemu) migrate_set_parameter mode cpr-exec
>>    (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
>>    (qemu) migrate -d file:vm.state
>>    (qemu) QEMU 9.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>
>> cpr-exec mode preserves attributes of outgoing devices that must be known
>> before the device is created on the incoming side, such as the memfd descriptor
>> number, but currently the migration stream is read after all devices are
>> created.  To solve this problem, I add two VMStateDescription options:
>> precreate and factory.  precreate objects are saved to their own migration
>> stream, distinct from the main stream, and are read early by incoming QEMU,
>> before devices are created.  Factory objects are allocated on demand, without
>> relying on a pre-registered object's opaque address, which is necessary
>> because the devices to which the state will apply have not been created yet
>> and hence have not registered an opaque address to receive the state.
>>
>> This patch series implements a minimal version of cpr-exec.  Future series
>> will add support for:
>>    * vfio
>>    * chardev's without loss of connectivity
>>    * vhost
>>    * fine-grained seccomp controls
>>    * hostmem-memfd
>>    * cpr-exec migration test
>>
>>
>> Steve Sistare (26):
>>    oslib: qemu_clear_cloexec
>>    vl: helper to request re-exec
>>    migration: SAVEVM_FOREACH
>>    migration: delete unused parameter mis
>>    migration: precreate vmstate
>>    migration: precreate vmstate for exec
>>    migration: VMStateId
>>    migration: vmstate_info_void_ptr
>>    migration: vmstate_register_named
>>    migration: vmstate_unregister_named
>>    migration: vmstate_register at init time
>>    migration: vmstate factory object
>>    physmem: ram_block_create
>>    physmem: hoist guest_memfd creation
>>    physmem: hoist host memory allocation
>>    physmem: set ram block idstr earlier
>>    machine: memfd-alloc option
>>    migration: cpr-exec-args parameter
>>    physmem: preserve ram blocks for cpr
>>    migration: cpr-exec mode
>>    migration: migrate_add_blocker_mode
>>    migration: ram block cpr-exec blockers
>>    migration: misc cpr-exec blockers
>>    seccomp: cpr-exec blocker
>>    migration: fix mismatched GPAs during cpr-exec
>>    migration: only-migratable-modes
>>
>>   accel/xen/xen-all.c            |   5 +
>>   backends/hostmem-epc.c         |  12 +-
>>   hmp-commands.hx                |   2 +-
>>   hw/core/machine.c              |  22 +++
>>   hw/core/qdev.c                 |   1 +
>>   hw/intc/apic_common.c          |   2 +-
>>   hw/vfio/migration.c            |   3 +-
>>   include/exec/cpu-common.h      |   3 +-
>>   include/exec/memory.h          |  15 ++
>>   include/exec/ramblock.h        |  10 +-
>>   include/hw/boards.h            |   1 +
>>   include/migration/blocker.h    |   7 +
>>   include/migration/cpr.h        |  14 ++
>>   include/migration/misc.h       |  11 ++
>>   include/migration/vmstate.h    | 133 +++++++++++++++-
>>   include/qemu/osdep.h           |   9 ++
>>   include/sysemu/runstate.h      |   3 +
>>   include/sysemu/seccomp.h       |   1 +
>>   include/sysemu/sysemu.h        |   1 -
>>   migration/cpr.c                | 131 ++++++++++++++++
>>   migration/meson.build          |   3 +
>>   migration/migration-hmp-cmds.c |  50 +++++-
>>   migration/migration.c          |  48 +++++-
>>   migration/migration.h          |   5 +-
>>   migration/options.c            |  13 ++
>>   migration/precreate.c          | 139 +++++++++++++++++
>>   migration/ram.c                |  16 +-
>>   migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
>>   migration/savevm.h             |   3 +
>>   migration/trace-events         |   7 +
>>   migration/vmstate-factory.c    |  78 ++++++++++
>>   migration/vmstate-types.c      |  24 +++
>>   migration/vmstate.c            |   3 +-
>>   qapi/migration.json            |  48 +++++-
>>   qemu-options.hx                |  22 ++-
>>   replay/replay.c                |   6 +
>>   stubs/migr-blocker.c           |   5 +
>>   stubs/vmstate.c                |  13 ++
>>   system/globals.c               |   1 -
>>   system/memory.c                |  19 ++-
>>   system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
>>   system/qemu-seccomp.c          |  10 +-
>>   system/runstate.c              |  29 ++++
>>   system/trace-events            |   4 +
>>   system/vl.c                    |  26 +++-
>>   target/s390x/cpu_models.c      |   4 +-
>>   util/oslib-posix.c             |   9 ++
>>   util/oslib-win32.c             |   4 +
>>   48 files changed, 1417 insertions(+), 210 deletions(-)
>>   create mode 100644 include/migration/cpr.h
>>   create mode 100644 migration/cpr.c
>>   create mode 100644 migration/precreate.c
>>   create mode 100644 migration/vmstate-factory.c
> 
> Hi Steve,
> 
> make check is failing. I applied the series on top of master @
> 70581940ca (Merge tag 'pull-tcg-20240523' of
> https://gitlab.com/rth7680/qemu into staging, 2024-05-23).
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/ivshmem-test
> ...
> qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
> Assertion `!strcmp(new_block->idstr, idstr)' failed.
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 \
> ./tests/qtest/test-x86-cpuid-compat -p \
> /x86_64/x86/cpuid/auto-level/pc-2.7
> ...
> qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
> Assertion `!strcmp(new_block->idstr, idstr)' failed.
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/qmp-cmd-test -p \
> /x86_64/qmp/object-add-failure-modes
> ...
> savevm_state_handler_insert: Detected duplicate SaveStateEntry:
> id=ram1/RAMBlock, instance_id=0x0

Thank you very much, I will investigate.

I suspect the vmstate dup error is due to this bug which I hit after
posting the patches:
-------------------------------
diff --git a/migration/savevm.c b/migration/savevm.c
index bb7fd9f..54aa233 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1012,7 +1012,7 @@ void vmstate_unregister_named(const char *vmsd_name,
      SaveStateEntry *se, *new_se;
      VMStateId idstr;

-    snprintf(idstr, sizeof(idstr), "%s/%s", vmsd_name, instance_name);
+    snprintf(idstr, sizeof(idstr), "%s/%s", instance_name, vmsd_name);

      SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
          if (!strcmp(se->idstr, idstr) &&
-----------------------------------

- Steve


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 20/26] migration: cpr-exec mode
  2024-04-29 15:55 ` [PATCH V1 20/26] migration: cpr-exec mode Steve Sistare
  2024-05-02 12:23   ` Markus Armbruster
  2024-05-21  8:20   ` Daniel P. Berrangé
@ 2024-05-24 14:58   ` Fabiano Rosas
  2024-05-27 18:54     ` Steven Sistare via
  2 siblings, 1 reply; 122+ messages in thread
From: Fabiano Rosas @ 2024-05-24 14:58 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Add the cpr-exec migration mode.  Usage:
>   qemu-system-$arch -machine memfd-alloc=on ...
>   migrate_set_parameter mode cpr-exec
>   migrate_set_parameter cpr-exec-args \
>     <arg1> <arg2> ... -incoming <uri>
>   migrate -d <uri>
>
> The migrate command stops the VM, saves state to the URI,
> directly exec's a new version of QEMU on the same host,
> replacing the original process while retaining its PID, and
> loads state from the URI.  Guest RAM is preserved in place,
> albeit with new virtual addresses.
>
> Arguments for the new QEMU process are taken from the
> @cpr-exec-args parameter.  The first argument should be the
> path of a new QEMU binary, or a prefix command that exec's the
> new QEMU binary.
>
> Because old QEMU terminates when new QEMU starts, one cannot
> stream data between the two, so the URI must be a type, such as
> a file, that reads all data before old QEMU exits.
>
> Memory backend objects must have the share=on attribute, and
> must be mmap'able in the new QEMU process.  For example,
> memory-backend-file is acceptable, but memory-backend-ram is
> not.
>
> The VM must be started with the '-machine memfd-alloc=on'
> option.  This causes implicit ram blocks (those not explicitly
> described by a memory-backend object) to be allocated by
> mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> RAM when it is specified without a memory-backend object.
>
> The implementation saves precreate vmstate at the end of normal
> migration in migrate_fd_cleanup, and tells the main loop to call
> cpr_exec.  Incoming qemu loads preceate state early, before objects
> are created.  The memfds are kept open across exec by clearing the
> close-on-exec flag, their values are saved in precreate vmstate,
> and they are mmap'd in new qemu.
>
> Note that the memfd-alloc option is not related to memory-backend-memfd.
> Later patches add support for memory-backend-memfd, and for additional
> devices, including vfio, chardev, and more.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/cpr.h  |  14 +++++
>  include/migration/misc.h |   3 ++
>  migration/cpr.c          | 131 +++++++++++++++++++++++++++++++++++++++++++++++
>  migration/meson.build    |   1 +
>  migration/migration.c    |  21 ++++++++
>  migration/migration.h    |   5 +-
>  migration/ram.c          |   1 +
>  qapi/migration.json      |  30 ++++++++++-
>  system/physmem.c         |   2 +
>  system/vl.c              |   4 ++
>  10 files changed, 210 insertions(+), 2 deletions(-)
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr.c
>
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> new file mode 100644
> index 0000000..aa8316d
> --- /dev/null
> +++ b/include/migration/cpr.h
> @@ -0,0 +1,14 @@
> +/*
> + * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef MIGRATION_CPR_H
> +#define MIGRATION_CPR_H
> +
> +bool cpr_needed_for_exec(void *opaque);
> +void cpr_unpreserve_fds(void);
> +
> +#endif
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index cf30351..5b963ba 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -122,4 +122,7 @@ bool migration_in_bg_snapshot(void);
>  /* migration/block-dirty-bitmap.c */
>  void dirty_bitmap_mig_init(void);
>  
> +/* migration/cpr.c */
> +void cpr_exec(char **argv);
> +
>  #endif
> diff --git a/migration/cpr.c b/migration/cpr.c
> new file mode 100644
> index 0000000..d4703e1
> --- /dev/null
> +++ b/migration/cpr.c
> @@ -0,0 +1,131 @@
> +/*
> + * Copyright (c) 2021-2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "exec/ramblock.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/misc.h"
> +#include "migration/vmstate.h"
> +#include "sysemu/runstate.h"
> +#include "trace.h"
> +
> +/*************************************************************************/
> +#define CPR_STATE "CprState"
> +
> +typedef struct CprState {
> +    MigMode mode;
> +} CprState;
> +
> +static CprState cpr_state = {
> +    .mode = MIG_MODE_NORMAL,
> +};
> +
> +static int cpr_state_presave(void *opaque)
> +{
> +    cpr_state.mode = migrate_mode();
> +    return 0;
> +}
> +
> +bool cpr_needed_for_exec(void *opaque)
> +{
> +    return migrate_mode() == MIG_MODE_CPR_EXEC;
> +}
> +
> +static const VMStateDescription vmstate_cpr_state = {
> +    .name = CPR_STATE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = cpr_needed_for_exec,
> +    .pre_save = cpr_state_presave,
> +    .precreate = true,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32(mode, CprState),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +vmstate_register_init(NULL, 0, vmstate_cpr_state, &cpr_state);
> +
> +/*************************************************************************/
> +
> +typedef int (*cpr_walk_fd_cb)(int fd);
> +
> +static int walk_ramblock(FactoryObject *obj, void *opaque)
> +{
> +    RAMBlock *rb = obj->opaque;
> +    cpr_walk_fd_cb cb = opaque;
> +    return cb(rb->fd);
> +}
> +
> +static int cpr_walk_fd(cpr_walk_fd_cb cb)
> +{
> +    int ret = vmstate_walk_factory_outgoing(RAM_BLOCK, walk_ramblock, cb);
> +    return ret;
> +}
> +
> +static int preserve_fd(int fd)
> +{
> +    qemu_clear_cloexec(fd);
> +    return 0;
> +}
> +
> +static int unpreserve_fd(int fd)
> +{
> +    qemu_set_cloexec(fd);
> +    return 0;
> +}
> +
> +static void cpr_preserve_fds(void)
> +{
> +    cpr_walk_fd(preserve_fd);
> +}
> +
> +void cpr_unpreserve_fds(void)
> +{
> +    cpr_walk_fd(unpreserve_fd);
> +}
> +
> +static int cpr_fd_notifier_func(NotifierWithReturn *notifier,
> +                                 MigrationEvent *e, Error **errp)
> +{
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
> +        e->type == MIG_EVENT_PRECOPY_FAILED) {
> +        cpr_unpreserve_fds();
> +    }
> +    return 0;
> +}
> +
> +void cpr_mig_init(void)
> +{
> +    static NotifierWithReturn cpr_fd_notifier;
> +
> +    migrate_get_current()->parameters.mode = cpr_state.mode;
> +    migration_add_notifier(&cpr_fd_notifier, cpr_fd_notifier_func);
> +}
> +
> +void cpr_exec(char **argv)
> +{
> +    MigrationState *s = migrate_get_current();
> +    Error *err = NULL;
> +
> +    /*
> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> +     * earlier because they should not persist across miscellaneous fork and
> +     * exec calls that are performed during normal operation.
> +     */
> +    cpr_preserve_fds();
> +
> +    execvp(argv[0], argv);
> +
> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
> +    error_report_err(err);
> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +    migrate_set_error(s, err);
> +    migration_precreate_unsave();
> +}
> diff --git a/migration/meson.build b/migration/meson.build
> index e667b40..d9e9c60 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -14,6 +14,7 @@ system_ss.add(files(
>    'block-dirty-bitmap.c',
>    'channel.c',
>    'channel-block.c',
> +  'cpr.c',
>    'dirtyrate.c',
>    'exec.c',
>    'fd.c',
> diff --git a/migration/migration.c b/migration/migration.c
> index b5af6b5..0d91531 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -239,6 +239,7 @@ void migration_object_init(void)
>      blk_mig_init();
>      ram_mig_init();
>      dirty_bitmap_mig_init();
> +    cpr_mig_init();
>  }
>  
>  typedef struct {
> @@ -1395,6 +1396,15 @@ static void migrate_fd_cleanup(MigrationState *s)
>          qemu_fclose(tmp);
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> +        Error *err = NULL;
> +        if (migration_precreate_save(&err)) {
> +            migrate_set_error(s, err);
> +            error_report_err(err);

There's an error_report_err() call already a few lines down.

> +            migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +        }
> +    }

Not a fan of saving state in the middle of the cleanup function. This
adds extra restrictions to migrate_fd_cleanup() which already tends to
be the source of a bunch of bugs.

Can this be done either entirely before or after migrate_fd_cleanup()?
There's only one callsite from which you actually want to do cpr-exec,
migration_iteration_finish(). It's no big deal if we have to play a bit
with the notifier call placement.

static void migration_iteration_finish(MigrationState *s)
{
...
    migration_bh_schedule(migrate_cpr_exec_bh, s);
    migration_bh_schedule(migrate_fd_cleanup_bh, s);
    bql_unlock();
}

IIUC, the BQL ensures the ordering here, but if that doesn't work we
could just call the cpr function right at the end of
migrate_fd_cleanup(). That would already be better than interleaving.

static void migrate_cpr_exec_bh(void *opaque)
{
    MigrationState *s = opaque;
    Error *err = NULL;    

    if (migration_has_failed(s) || migrate_mode() != MIG_MODE_CPR_EXEC) {
        return;
    }

    assert(s->state == MIGRATION_STATUS_COMPLETED);

    if (migration_precreate_save(&err)) {
        migrate_set_error(s, err);
        error_report_err(err);
        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
        migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);

        return;
    }

    qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_args);
}

> +
>      assert(!migration_is_active());
>  
>      if (s->state == MIGRATION_STATUS_CANCELLING) {
> @@ -1410,6 +1420,11 @@ static void migrate_fd_cleanup(MigrationState *s)
>                                       MIG_EVENT_PRECOPY_DONE;
>      migration_call_notifiers(s, type, NULL);
>      block_cleanup_parameters();
> +
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC && !migration_has_failed(s)) {
> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
> +        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_args);
> +    }
>      yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>  }
>  
> @@ -1977,6 +1992,12 @@ static bool migrate_prepare(MigrationState *s, bool blk, bool blk_inc,
>          return false;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_EXEC &&
> +        !s->parameters.has_cpr_exec_args) {
> +        error_setg(errp, "cpr-exec mode requires setting cpr-exec-args");
> +        return false;
> +    }
> +
>      if (migration_is_blocked(errp)) {
>          return false;
>      }
> diff --git a/migration/migration.h b/migration/migration.h
> index 8045e39..2ad2163 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -490,7 +490,6 @@ bool migration_in_postcopy(void);
>  bool migration_postcopy_is_alive(int state);
>  MigrationState *migrate_get_current(void);
>  bool migration_has_failed(MigrationState *);
> -bool migrate_mode_is_cpr(MigrationState *);
>  
>  uint64_t ram_get_total_transferred_pages(void);
>  
> @@ -544,4 +543,8 @@ int migration_rp_wait(MigrationState *s);
>   */
>  void migration_rp_kick(MigrationState *s);
>  
> +/* CPR */
> +bool migrate_mode_is_cpr(MigrationState *);
> +void cpr_mig_init(void);
> +
>  #endif
> diff --git a/migration/ram.c b/migration/ram.c
> index a975c5a..add285b 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -219,6 +219,7 @@ static bool postcopy_preempt_active(void)
>  bool migrate_ram_is_ignored(RAMBlock *block)
>  {
>      return !qemu_ram_is_migratable(block) ||
> +           migrate_mode() == MIG_MODE_CPR_EXEC ||
>             (migrate_ignore_shared() && qemu_ram_is_shared(block)
>                                      && qemu_ram_is_named_file(block));
>  }
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 49710e7..7c5f45f 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -665,9 +665,37 @@
>  #     or COLO.
>  #
>  #     (since 8.2)
> +#
> +# @cpr-exec: The migrate command stops the VM, saves state to the URI,
> +#     directly exec's a new version of QEMU on the same host,
> +#     replacing the original process while retaining its PID, and
> +#     loads state from the URI.  Guest RAM is preserved in place,
> +#     albeit with new virtual addresses.
> +#
> +#     Arguments for the new QEMU process are taken from the
> +#     @cpr-exec-args parameter.  The first argument should be the
> +#     path of a new QEMU binary, or a prefix command that exec's the
> +#     new QEMU binary.
> +#
> +#     Because old QEMU terminates when new QEMU starts, one cannot
> +#     stream data between the two, so the URI must be a type, such as
> +#     a file, that reads all data before old QEMU exits.
> +#
> +#     Memory backend objects must have the share=on attribute, and
> +#     must be mmap'able in the new QEMU process.  For example,
> +#     memory-backend-file is acceptable, but memory-backend-ram is
> +#     not.
> +#
> +#     The VM must be started with the '-machine memfd-alloc=on'
> +#     option.  This causes implicit ram blocks -- those not explicitly
> +#     described by a memory-backend object -- to be allocated by
> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> +#     RAM when it is specified without a memory-backend object.
> +#
> +#     (since 9.1)
>  ##
>  { 'enum': 'MigMode',
> -  'data': [ 'normal', 'cpr-reboot' ] }
> +  'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] }
>  
>  ##
>  # @ZeroPageDetection:
> diff --git a/system/physmem.c b/system/physmem.c
> index 3019284..87ad441 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -69,6 +69,7 @@
>  
>  #include "qemu/pmem.h"
>  
> +#include "migration/cpr.h"
>  #include "migration/vmstate.h"
>  
>  #include "qemu/range.h"
> @@ -2069,6 +2070,7 @@ const VMStateDescription vmstate_ram_block = {
>      .minimum_version_id = 1,
>      .precreate = true,
>      .factory = true,
> +    .needed = cpr_needed_for_exec,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64(align, RAMBlock),
>          VMSTATE_VOID_PTR(host, RAMBlock),
> diff --git a/system/vl.c b/system/vl.c
> index 7797206..7252100 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -76,6 +76,7 @@
>  #include "hw/block/block.h"
>  #include "hw/i386/x86.h"
>  #include "hw/i386/pc.h"
> +#include "migration/cpr.h"
>  #include "migration/misc.h"
>  #include "migration/snapshot.h"
>  #include "migration/vmstate.h"
> @@ -3665,6 +3666,9 @@ void qemu_init(int argc, char **argv)
>      qemu_create_machine(machine_opts_dict);
>  
>      vmstate_register_init_all();
> +    migration_precreate_load(&error_fatal);
> +    /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
> +    cpr_unpreserve_fds();
>  
>      suspend_mux_open();


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-21 11:46       ` Steven Sistare
@ 2024-05-27 17:45         ` Peter Xu
  2024-05-28 15:10           ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-27 17:45 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Fabiano Rosas, QEMU Developers

On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> I understand, thanks.  If I can help with any of your todo list,
> just ask - steve

Thanks for offering the help, Steve.  Started looking at this today, then I
found that I miss something high-level.  Let me ask here, and let me
apologize already for starting to throw multiple questions..

IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
this case not host kernel but QEMU-only, and/or upper.

Is there any justification on why the complexity is needed here?  It looks
to me this one is more involved than cpr-reboot, so I'm thinking how much
we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
the min support, and if we even expect more to come, that's really
important, IMHO.

For example, what's the major motivation of this whole work?  Is that more
on performance, or is it more for supporting the special devices like VFIO
which we used to not support, or something else?  I can't find them in
whatever cover letter I can find, including this one.

Firstly, regarding performance, IMHO it'll be always nice to share even
some very fundamental downtime measurement comparisons using the new exec
mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
have some number on hand when you started working on this feature years
ago?  Or maybe some old links on the list would help too, as I didn't
follow this work since the start.

On VFIO, IIUC you started out this project without VFIO migration being
there.  Now we have VFIO migration so not sure how much it would work for
the upgrade use case. Even with current VFIO migration, we may not want to
migrate device states for a local upgrade I suppose, as that can be a lot
depending on the type of device assigned.  However it'll be nice to discuss
this too if this is the major purpose of the series.

I think one other challenge on QEMU upgrade with VFIO devices is that the
dest QEMU won't be able to open the VFIO device when the src QEMU is still
using it as the owner.  IIUC this is a similar condition where QEMU wants
to have proper ownership transfer of a shared block device, and AFAIR right
now we resolved that issue using some form of file lock on the image file.
In this case it won't easily apply to a VFIO dev fd, but maybe we still
have other approaches, not sure whether you investigated any.  E.g. could
the VFIO handle be passed over using unix scm rights?  I think this might
remove one dependency of using exec which can cause quite some difference
v.s. a generic migration (from which regard, cpr-reboot is still a pretty
generic migration).

You also mentioned vhost/tap, is that also a major goal of this series in
the follow up patchsets?  Is this a problem only because this solution will
do exec?  Can it work if either the exec()ed qemu or dst qemu create the
vhost/tap fds when boot?

Meanwhile, could you elaborate a bit on the implication on chardevs?  From
what I read in the doc update it looks like a major part of work in the
future, but I don't yet understand the issue..  Is it also relevant to the
exec() approach?

In all cases, some of such discussion would be really appreciated.  And if
you used to consider other approaches to solve this problem it'll be great
to mention how you chose this way.  Considering this work contains too many
things, it'll be nice if such discussion can start with the fundamentals,
e.g. on why exec() is a must.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 04/26] migration: delete unused parameter mis
  2024-04-29 15:55 ` [PATCH V1 04/26] migration: delete unused parameter mis Steve Sistare
  2024-05-06 21:50   ` Fabiano Rosas
@ 2024-05-27 18:02   ` Peter Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Peter Xu @ 2024-05-27 18:02 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:13AM -0700, Steve Sistare wrote:
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
                   ` (28 preceding siblings ...)
  2024-05-24 13:02 ` Fabiano Rosas
@ 2024-05-27 18:07 ` Peter Xu
  29 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2024-05-27 18:07 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:09AM -0700, Steve Sistare wrote:
> This patch series implements a minimal version of cpr-exec.  Future series
> will add support for:
>   * vfio
>   * chardev's without loss of connectivity
>   * vhost
>   * fine-grained seccomp controls
>   * hostmem-memfd
>   * cpr-exec migration test

Another request besides the questions I threw already.. could you push to a
tree where it has everything?  Maybe that'll help to review also the min set.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 03/26] migration: SAVEVM_FOREACH
  2024-05-13 19:27     ` Steven Sistare
@ 2024-05-27 18:14       ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2024-05-27 18:14 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, May 13, 2024 at 03:27:30PM -0400, Steven Sistare wrote:
> On 5/6/2024 7:17 PM, Fabiano Rosas wrote:
> > Steve Sistare <steven.sistare@oracle.com> writes:
> > 
> > > Define an abstraction SAVEVM_FOREACH to loop over all savevm state
> > > handlers, and replace QTAILQ_FOREACH.  Define variants for ALL so
> > > we can loop over all handlers vs a subset of handlers in a subsequent
> > > patch, but at this time there is no distinction between the two.
> > > No functional change.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   migration/savevm.c | 55 +++++++++++++++++++++++++++++++-----------------------
> > >   1 file changed, 32 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 4509482..6829ba3 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -237,6 +237,15 @@ static SaveState savevm_state = {
> > >       .global_section_id = 0,
> > >   };
> > > +#define SAVEVM_FOREACH(se, entry)                                    \
> > > +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
> > > +
> > > +#define SAVEVM_FOREACH_ALL(se, entry)                                \
> > > +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
> > 
> > This feels worse than SAVEVM_FOREACH_NOT_PRECREATED. We'll have to keep
> > coming back to the definition to figure out which FOREACH is the real
> > deal.
> 
> I take your point, but the majority of the loops do not care about precreated
> objects, so it seems backwards to make them more verbose with
> SAVEVM_FOREACH_NOT_PRECREATE.  I can go either way, but we need
> Peter's opinion also.

I don't have a strong opinion yet on the name, I think it'll be clearer
indeed when the _ALL() (or whatever other name) is used only when with the
real users.

OTOH, besides the name (which is much more trivial..) the precreated idea
in general is a bit scary to me.. if that was for a "workaround" to some
new ordering issue due to newly added dependencies on exec() support.
Maybe I'll understand better when I get to know better on the whole design.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 05/26] migration: precreate vmstate
  2024-04-29 15:55 ` [PATCH V1 05/26] migration: precreate vmstate Steve Sistare
  2024-05-07 21:02   ` Fabiano Rosas
  2024-05-24 13:56   ` Fabiano Rosas
@ 2024-05-27 18:16   ` Peter Xu
  2024-05-28 15:09     ` Steven Sistare via
  2 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-27 18:16 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:
> Provide the VMStateDescription precreate field to mark objects that must
> be loaded on the incoming side before devices have been created, because
> they provide properties that will be needed at creation time.  They will
> be saved to and loaded from their own QEMUFile, via
> qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
> functions are not yet called in this patch.  Allow them to be called
> before or after normal migration is active, when current_migration and
> current_incoming are not valid.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/vmstate.h |  6 ++++
>  migration/savevm.c          | 69 +++++++++++++++++++++++++++++++++++++++++----
>  migration/savevm.h          |  3 ++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 294d2d8..4691334 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -198,6 +198,12 @@ struct VMStateDescription {
>       * a QEMU_VM_SECTION_START section.
>       */
>      bool early_setup;
> +
> +    /*
> +     * Send/receive this object in the precreate migration stream.
> +     */
> +    bool precreate;
> +
>      int version_id;
>      int minimum_version_id;
>      MigrationPriority priority;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 9789823..a30bcd9 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -239,6 +239,7 @@ static SaveState savevm_state = {
>  
>  #define SAVEVM_FOREACH(se, entry)                                    \
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
> +        if (!se->vmsd || !se->vmsd->precreate)
>  
>  #define SAVEVM_FOREACH_ALL(se, entry)                                \
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
> @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se,
>      }
>  }
>  
> +static bool send_section_footer(SaveStateEntry *se)
> +{
> +    return (se->vmsd && se->vmsd->precreate) ||
> +           migrate_get_current()->send_section_footer;
> +}

Does the precreate vmsd "require" the footer?  Or it should also work?
IMHO it's less optimal to bind features without good reasons.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-04-29 15:55 ` [PATCH V1 07/26] migration: VMStateId Steve Sistare
  2024-05-07 21:03   ` Fabiano Rosas
@ 2024-05-27 18:20   ` Peter Xu
  2024-05-28 15:10     ` Steven Sistare via
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-27 18:20 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:
> Define a type for the 256 byte id string to guarantee the same length is
> used and enforced everywhere.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/exec/ramblock.h     | 3 ++-
>  include/migration/vmstate.h | 2 ++
>  migration/savevm.c          | 8 ++++----
>  migration/vmstate.c         | 3 ++-
>  4 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 0babd10..61deefe 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -23,6 +23,7 @@
>  #include "cpu-common.h"
>  #include "qemu/rcu.h"
>  #include "exec/ramlist.h"
> +#include "migration/vmstate.h"
>  
>  struct RAMBlock {
>      struct rcu_head rcu;
> @@ -35,7 +36,7 @@ struct RAMBlock {
>      void (*resized)(const char*, uint64_t length, void *host);
>      uint32_t flags;
>      /* Protected by the BQL.  */
> -    char idstr[256];
> +    VMStateId idstr;
>      /* RCU-enabled, writes protected by the ramlist lock */
>      QLIST_ENTRY(RAMBlock) next;
>      QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;

Hmm.. Don't look like a good idea to include a migration header in
ramblock.h?  Is this ramblock change needed for this work?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr
  2024-04-29 15:55 ` [PATCH V1 08/26] migration: vmstate_info_void_ptr Steve Sistare
  2024-05-07 21:33   ` Fabiano Rosas
@ 2024-05-27 18:31   ` Peter Xu
  2024-05-28 15:10     ` Steven Sistare via
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-27 18:31 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> can be saved in the migration stream.  This will be needed for CPR.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

This is really tricky.

From a first glance, I don't think migrating a VA is valid at all for
migration even if with exec.. and looks insane to me for a cross-process
migration, which seems to be allowed to use as a generic VMSD helper.. as
VA is the address space barrier for different processes and I think it
normally even apply to generic execve(), and we're trying to jailbreak for
some reason..

It definitely won't work for any generic migration as sizeof(void*) can be
different afaict between hosts, e.g. 32bit -> 64bit migrations.

Some description would be really helpful in this commit message,
e.g. explain the users and why.  Do we need to poison that for generic VMSD
use (perhaps with prefixed underscores)?  I think I'll need to read on the
rest to tell..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 20/26] migration: cpr-exec mode
  2024-05-24 14:58   ` Fabiano Rosas
@ 2024-05-27 18:54     ` Steven Sistare via
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare via @ 2024-05-27 18:54 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/24/2024 10:58 AM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Add the cpr-exec migration mode.  Usage:
>>    qemu-system-$arch -machine memfd-alloc=on ...
>>    migrate_set_parameter mode cpr-exec
>>    migrate_set_parameter cpr-exec-args \
>>      <arg1> <arg2> ... -incoming <uri>
>>    migrate -d <uri>
>>
>> The migrate command stops the VM, saves state to the URI,
>> directly exec's a new version of QEMU on the same host,
>> replacing the original process while retaining its PID, and
>> loads state from the URI.  Guest RAM is preserved in place,
>> albeit with new virtual addresses.
>>
>> Arguments for the new QEMU process are taken from the
>> @cpr-exec-args parameter.  The first argument should be the
>> path of a new QEMU binary, or a prefix command that exec's the
>> new QEMU binary.
>>
>> Because old QEMU terminates when new QEMU starts, one cannot
>> stream data between the two, so the URI must be a type, such as
>> a file, that reads all data before old QEMU exits.
>>
>> Memory backend objects must have the share=on attribute, and
>> must be mmap'able in the new QEMU process.  For example,
>> memory-backend-file is acceptable, but memory-backend-ram is
>> not.
>>
>> The VM must be started with the '-machine memfd-alloc=on'
>> option.  This causes implicit ram blocks (those not explicitly
>> described by a memory-backend object) to be allocated by
>> mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>> RAM when it is specified without a memory-backend object.
>>
>> The implementation saves precreate vmstate at the end of normal
>> migration in migrate_fd_cleanup, and tells the main loop to call
>> cpr_exec.  Incoming qemu loads preceate state early, before objects
>> are created.  The memfds are kept open across exec by clearing the
>> close-on-exec flag, their values are saved in precreate vmstate,
>> and they are mmap'd in new qemu.
>>
>> Note that the memfd-alloc option is not related to memory-backend-memfd.
>> Later patches add support for memory-backend-memfd, and for additional
>> devices, including vfio, chardev, and more.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/cpr.h  |  14 +++++
>>   include/migration/misc.h |   3 ++
>>   migration/cpr.c          | 131 +++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/meson.build    |   1 +
>>   migration/migration.c    |  21 ++++++++
>>   migration/migration.h    |   5 +-
>>   migration/ram.c          |   1 +
>>   qapi/migration.json      |  30 ++++++++++-
>>   system/physmem.c         |   2 +
>>   system/vl.c              |   4 ++
>>   10 files changed, 210 insertions(+), 2 deletions(-)
>>   create mode 100644 include/migration/cpr.h
>>   create mode 100644 migration/cpr.c
>> [...]
>> diff --git a/migration/migration.c b/migration/migration.c
>> index b5af6b5..0d91531 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -239,6 +239,7 @@ void migration_object_init(void)
>>       blk_mig_init();
>>       ram_mig_init();
>>       dirty_bitmap_mig_init();
>> +    cpr_mig_init();
>>   }
>>   
>>   typedef struct {
>> @@ -1395,6 +1396,15 @@ static void migrate_fd_cleanup(MigrationState *s)
>>           qemu_fclose(tmp);
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>> +        Error *err = NULL;
>> +        if (migration_precreate_save(&err)) {
>> +            migrate_set_error(s, err);
>> +            error_report_err(err);
> 
> There's an error_report_err() call already a few lines down.
> 
>> +            migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>> +        }
>> +    }
> 
> Not a fan of saving state in the middle of the cleanup function. This
> adds extra restrictions to migrate_fd_cleanup() which already tends to
> be the source of a bunch of bugs.
> 
> Can this be done either entirely before or after migrate_fd_cleanup()?
> There's only one callsite from which you actually want to do cpr-exec,
> migration_iteration_finish(). It's no big deal if we have to play a bit
> with the notifier call placement.
> 
> static void migration_iteration_finish(MigrationState *s)
> {
> ...
>      migration_bh_schedule(migrate_cpr_exec_bh, s);
>      migration_bh_schedule(migrate_fd_cleanup_bh, s);
>      bql_unlock();
> }
> 
> IIUC, the BQL ensures the ordering here, but if that doesn't work we
> could just call the cpr function right at the end of
> migrate_fd_cleanup(). That would already be better than interleaving.
> 
> static void migrate_cpr_exec_bh(void *opaque)
> {
>      MigrationState *s = opaque;
>      Error *err = NULL;
> 
>      if (migration_has_failed(s) || migrate_mode() != MIG_MODE_CPR_EXEC) {
>          return;
>      }
> 
>      assert(s->state == MIGRATION_STATUS_COMPLETED);
> 
>      if (migration_precreate_save(&err)) {
>          migrate_set_error(s, err);
>          error_report_err(err);
>          migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>          migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
> 
>          return;
>      }
> 
>      qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_args);
> }

No problem, I can hoist the cpr exec logic out of migrate_fd_cleanup.
I'll call migration_precreate_save prior, and I'll register a notifier
for MIG_EVENT_PRECOPY_DONE that calls qemu_system_exec_request.

BTW the following does not work because the order of bh execution is not defined
by the qemu_bh_schedule API (and in fact the current implementation prepends
to bh_list, executing these in reverse order):

   migration_bh_schedule(migrate_cpr_exec_bh, s);
   migration_bh_schedule(migrate_fd_cleanup_bh, s);

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 23/26] migration: misc cpr-exec blockers
  2024-05-24 12:40   ` Fabiano Rosas
@ 2024-05-27 19:02     ` Steven Sistare via
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare via @ 2024-05-27 19:02 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Igor Mammedov, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 5/24/2024 8:40 AM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Add blockers for cpr-exec migration mode for devices and options that do
>> not support it.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   accel/xen/xen-all.c    |  5 +++++
>>   backends/hostmem-epc.c | 12 ++++++++++--
>>   hw/vfio/migration.c    |  3 ++-
>>   replay/replay.c        |  6 ++++++
>>   4 files changed, 23 insertions(+), 3 deletions(-)
>>
>> diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c
>> index 0bdefce..9a7ed0f 100644
>> --- a/accel/xen/xen-all.c
>> +++ b/accel/xen/xen-all.c
> 
> This file is missing the migration/blocker.h include.

Good eyes, will fix - steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 05/26] migration: precreate vmstate
  2024-05-27 18:16   ` Peter Xu
@ 2024-05-28 15:09     ` Steven Sistare via
  2024-05-29 18:39       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-28 15:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/27/2024 2:16 PM, Peter Xu wrote:
> On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:
>> Provide the VMStateDescription precreate field to mark objects that must
>> be loaded on the incoming side before devices have been created, because
>> they provide properties that will be needed at creation time.  They will
>> be saved to and loaded from their own QEMUFile, via
>> qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
>> functions are not yet called in this patch.  Allow them to be called
>> before or after normal migration is active, when current_migration and
>> current_incoming are not valid.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/vmstate.h |  6 ++++
>>   migration/savevm.c          | 69 +++++++++++++++++++++++++++++++++++++++++----
>>   migration/savevm.h          |  3 ++
>>   3 files changed, 73 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index 294d2d8..4691334 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -198,6 +198,12 @@ struct VMStateDescription {
>>        * a QEMU_VM_SECTION_START section.
>>        */
>>       bool early_setup;
>> +
>> +    /*
>> +     * Send/receive this object in the precreate migration stream.
>> +     */
>> +    bool precreate;
>> +
>>       int version_id;
>>       int minimum_version_id;
>>       MigrationPriority priority;
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 9789823..a30bcd9 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -239,6 +239,7 @@ static SaveState savevm_state = {
>>   
>>   #define SAVEVM_FOREACH(se, entry)                                    \
>>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
>> +        if (!se->vmsd || !se->vmsd->precreate)
>>   
>>   #define SAVEVM_FOREACH_ALL(se, entry)                                \
>>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
>> @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se,
>>       }
>>   }
>>   
>> +static bool send_section_footer(SaveStateEntry *se)
>> +{
>> +    return (se->vmsd && se->vmsd->precreate) ||
>> +           migrate_get_current()->send_section_footer;
>> +}
> 
> Does the precreate vmsd "require" the footer?  Or it should also work?
> IMHO it's less optimal to bind features without good reasons.

It is not required.  However, IMO we should not treat send-section-footer as
a fungible feature.  It is strictly an improvement, as was added to catch
misformated sections.  It is only registered as a feature for backwards
compatibility with qemu 2.3 and xen.

For a brand new data stream such as precreate, where we are not constrained
by backwards compatibility, we should unconditionally use the better protocol,
and always send the footer.

- Steve



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-05-27 18:20   ` Peter Xu
@ 2024-05-28 15:10     ` Steven Sistare via
  2024-05-28 17:44       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-28 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/27/2024 2:20 PM, Peter Xu wrote:
> On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:
>> Define a type for the 256 byte id string to guarantee the same length is
>> used and enforced everywhere.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/exec/ramblock.h     | 3 ++-
>>   include/migration/vmstate.h | 2 ++
>>   migration/savevm.c          | 8 ++++----
>>   migration/vmstate.c         | 3 ++-
>>   4 files changed, 10 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>> index 0babd10..61deefe 100644
>> --- a/include/exec/ramblock.h
>> +++ b/include/exec/ramblock.h
>> @@ -23,6 +23,7 @@
>>   #include "cpu-common.h"
>>   #include "qemu/rcu.h"
>>   #include "exec/ramlist.h"
>> +#include "migration/vmstate.h"
>>   
>>   struct RAMBlock {
>>       struct rcu_head rcu;
>> @@ -35,7 +36,7 @@ struct RAMBlock {
>>       void (*resized)(const char*, uint64_t length, void *host);
>>       uint32_t flags;
>>       /* Protected by the BQL.  */
>> -    char idstr[256];
>> +    VMStateId idstr;
>>       /* RCU-enabled, writes protected by the ramlist lock */
>>       QLIST_ENTRY(RAMBlock) next;
>>       QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
> 
> Hmm.. Don't look like a good idea to include a migration header in
> ramblock.h?  Is this ramblock change needed for this work?

Well, entities that are migrated include migration headers, and now that
includes RAMBlock.  There is precedent:

0 include/exec/ramblock.h   26 #include "migration/vmstate.h"
1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h"
2 include/hw/display/ramfb.  4 #include "migration/vmstate.h"
3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h"
4 include/hw/input/pl050.h  14 #include "migration/vmstate.h"
5 include/hw/pci/shpc.h      7 #include "migration/vmstate.h"
6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h"
7 include/migration/cpu.h    8 #include "migration/vmstate.h"

Granted, only some of the C files that include ramblock.h need all of vmstate.h.
I could define VMStateId in a smaller file such as migration/misc.h, or a
new file migration/vmstateid.h, and include that in ramblock.h.
Any preference?

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr
  2024-05-27 18:31   ` Peter Xu
@ 2024-05-28 15:10     ` Steven Sistare via
  2024-05-28 18:21       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-28 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/27/2024 2:31 PM, Peter Xu wrote:
> On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
>> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
>> can be saved in the migration stream.  This will be needed for CPR.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> This is really tricky.
> 
>  From a first glance, I don't think migrating a VA is valid at all for
> migration even if with exec.. and looks insane to me for a cross-process
> migration, which seems to be allowed to use as a generic VMSD helper.. as
> VA is the address space barrier for different processes and I think it
> normally even apply to generic execve(), and we're trying to jailbreak for
> some reason..
> 
> It definitely won't work for any generic migration as sizeof(void*) can be
> different afaict between hosts, e.g. 32bit -> 64bit migrations.
> 
> Some description would be really helpful in this commit message,
> e.g. explain the users and why.  Do we need to poison that for generic VMSD
> use (perhaps with prefixed underscores)?  I think I'll need to read on the
> rest to tell..

Short answer: we never dereference the void* in the new process.  And must not.

Longer answer:

During CPR for vfio, each mapped DMA region is re-registered in the new
process using the new VA.  The ioctl to re-register identifies the mapping
by IOVA and length.

The same requirement holds for CPR of iommufd devices.  However, in the
iommufd framework, IOVA does not uniquely identify a dma mapping, and we
need to use the old VA as the unique identifier.  The new process
re-registers each mapping, passing the old VA and new VA to the kernel.
The old VA is never dereferenced in the new process, we just need its value.

I suspected that the void* which must not be dereferenced might make people
uncomfortable.  I have an older version of my code which adds a uint64_t
field to RAMBlock for recording and migrating the old VA.  The saving and
loading code is slightly less elegant, but no big deal.  Would you prefer
that?

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-27 17:45         ` Peter Xu
@ 2024-05-28 15:10           ` Steven Sistare via
  2024-05-28 16:42             ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-28 15:10 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fabiano Rosas, QEMU Developers

On 5/27/2024 1:45 PM, Peter Xu wrote:
> On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
>> I understand, thanks.  If I can help with any of your todo list,
>> just ask - steve
> 
> Thanks for offering the help, Steve.  Started looking at this today, then I
> found that I miss something high-level.  Let me ask here, and let me
> apologize already for starting to throw multiple questions..
> 
> IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> this case not host kernel but QEMU-only, and/or upper.
> 
> Is there any justification on why the complexity is needed here?  It looks
> to me this one is more involved than cpr-reboot, so I'm thinking how much
> we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
> the min support, and if we even expect more to come, that's really
> important, IMHO.
> 
> For example, what's the major motivation of this whole work?  Is that more
> on performance, or is it more for supporting the special devices like VFIO
> which we used to not support, or something else?  I can't find them in
> whatever cover letter I can find, including this one.
> 
> Firstly, regarding performance, IMHO it'll be always nice to share even
> some very fundamental downtime measurement comparisons using the new exec
> mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> have some number on hand when you started working on this feature years
> ago?  Or maybe some old links on the list would help too, as I didn't
> follow this work since the start.
> 
> On VFIO, IIUC you started out this project without VFIO migration being
> there.  Now we have VFIO migration so not sure how much it would work for
> the upgrade use case. Even with current VFIO migration, we may not want to
> migrate device states for a local upgrade I suppose, as that can be a lot
> depending on the type of device assigned.  However it'll be nice to discuss
> this too if this is the major purpose of the series.
> 
> I think one other challenge on QEMU upgrade with VFIO devices is that the
> dest QEMU won't be able to open the VFIO device when the src QEMU is still
> using it as the owner.  IIUC this is a similar condition where QEMU wants
> to have proper ownership transfer of a shared block device, and AFAIR right
> now we resolved that issue using some form of file lock on the image file.
> In this case it won't easily apply to a VFIO dev fd, but maybe we still
> have other approaches, not sure whether you investigated any.  E.g. could
> the VFIO handle be passed over using unix scm rights?  I think this might
> remove one dependency of using exec which can cause quite some difference
> v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> generic migration).
> 
> You also mentioned vhost/tap, is that also a major goal of this series in
> the follow up patchsets?  Is this a problem only because this solution will
> do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> vhost/tap fds when boot?
> 
> Meanwhile, could you elaborate a bit on the implication on chardevs?  From
> what I read in the doc update it looks like a major part of work in the
> future, but I don't yet understand the issue..  Is it also relevant to the
> exec() approach?
> 
> In all cases, some of such discussion would be really appreciated.  And if
> you used to consider other approaches to solve this problem it'll be great
> to mention how you chose this way.  Considering this work contains too many
> things, it'll be nice if such discussion can start with the fundamentals,
> e.g. on why exec() is a must.

The main goal of cpr-exec is providing a fast and reliable way to update
qemu. cpr-reboot is not fast enough or general enough.  It requires the
guest to support suspend and resume for all devices, and that takes seconds.
If one actually reboots the host, that adds more seconds, depending on
system services.  cpr-exec takes 0.1 secs, and works every time, unlike
like migration which can fail to converge on a busy system.  Live migration
also consumes more system and network resources.  cpr-exec seamlessly
preserves client connections by preserving chardevs, and overall provides
a much nicer user experience.

chardev's are preserved by keeping their fd open across the exec, and
remembering the value of the fd in precreate vmstate so that new qemu
can associate the fd with the chardev rather than opening a new one.

The approach of preserving open file descriptors is very general and applicable
to all kinds of devices, regardless of whether they support live migration
in hardware.  Device fd's are preserved using the same mechanism as for
chardevs.

Devices that support live migration in hardware do not like to live migrate
in place to the same node.  It is not what they are designed for, and some
implementations will flat out fail because the source and target interfaces
are the same.

For vhost/tap, sometimes the management layer opens the dev and passes an
fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
allows both.  For the case where qemu opens the dev, the fd is preserved
using the same mechanism as for chardevs.

The fundamental requirements of this work are:
   - precreate vmstate
   - preserve open file descriptors

Direct exec from old to new qemu is not a hard requirement.   However,
it is simple, with few complications, and works with Oracle's cloud
containers, so it is the method I am most interested in finishing first.

I believe everything could also be made to work by using SCM_RIGHTS to
send fd's to a new qemu process that is started by some external means.
It would be requested with MIG_MODE_CPR_SCM (or some better name), and
would co-exist with MIG_MODE_CPR_EXEC.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-28 15:10           ` Steven Sistare via
@ 2024-05-28 16:42             ` Peter Xu
  2024-05-30 17:17               ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-28 16:42 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Fabiano Rosas, QEMU Developers

On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> On 5/27/2024 1:45 PM, Peter Xu wrote:
> > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > I understand, thanks.  If I can help with any of your todo list,
> > > just ask - steve
> > 
> > Thanks for offering the help, Steve.  Started looking at this today, then I
> > found that I miss something high-level.  Let me ask here, and let me
> > apologize already for starting to throw multiple questions..
> > 
> > IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> > this case not host kernel but QEMU-only, and/or upper.
> > 
> > Is there any justification on why the complexity is needed here?  It looks
> > to me this one is more involved than cpr-reboot, so I'm thinking how much
> > we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
> > the min support, and if we even expect more to come, that's really
> > important, IMHO.
> > 
> > For example, what's the major motivation of this whole work?  Is that more
> > on performance, or is it more for supporting the special devices like VFIO
> > which we used to not support, or something else?  I can't find them in
> > whatever cover letter I can find, including this one.
> > 
> > Firstly, regarding performance, IMHO it'll be always nice to share even
> > some very fundamental downtime measurement comparisons using the new exec
> > mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> > have some number on hand when you started working on this feature years
> > ago?  Or maybe some old links on the list would help too, as I didn't
> > follow this work since the start.
> > 
> > On VFIO, IIUC you started out this project without VFIO migration being
> > there.  Now we have VFIO migration so not sure how much it would work for
> > the upgrade use case. Even with current VFIO migration, we may not want to
> > migrate device states for a local upgrade I suppose, as that can be a lot
> > depending on the type of device assigned.  However it'll be nice to discuss
> > this too if this is the major purpose of the series.
> > 
> > I think one other challenge on QEMU upgrade with VFIO devices is that the
> > dest QEMU won't be able to open the VFIO device when the src QEMU is still
> > using it as the owner.  IIUC this is a similar condition where QEMU wants
> > to have proper ownership transfer of a shared block device, and AFAIR right
> > now we resolved that issue using some form of file lock on the image file.
> > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > have other approaches, not sure whether you investigated any.  E.g. could
> > the VFIO handle be passed over using unix scm rights?  I think this might
> > remove one dependency of using exec which can cause quite some difference
> > v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> > generic migration).
> > 
> > You also mentioned vhost/tap, is that also a major goal of this series in
> > the follow up patchsets?  Is this a problem only because this solution will
> > do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> > vhost/tap fds when boot?
> > 
> > Meanwhile, could you elaborate a bit on the implication on chardevs?  From
> > what I read in the doc update it looks like a major part of work in the
> > future, but I don't yet understand the issue..  Is it also relevant to the
> > exec() approach?
> > 
> > In all cases, some of such discussion would be really appreciated.  And if
> > you used to consider other approaches to solve this problem it'll be great
> > to mention how you chose this way.  Considering this work contains too many
> > things, it'll be nice if such discussion can start with the fundamentals,
> > e.g. on why exec() is a must.
> 
> The main goal of cpr-exec is providing a fast and reliable way to update
> qemu. cpr-reboot is not fast enough or general enough.  It requires the
> guest to support suspend and resume for all devices, and that takes seconds.
> If one actually reboots the host, that adds more seconds, depending on
> system services.  cpr-exec takes 0.1 secs, and works every time, unlike
> like migration which can fail to converge on a busy system.  Live migration
> also consumes more system and network resources.

Right, but note that when I was thinking of a comparison between cpr-exec
v.s. normal migration, I didn't mean a "normal live migration".  I think
it's more of the case whether exec() can be avoided.  I had a feeling that
this exec() will cause a major part of work elsewhere but maybe I am wrong
as I didn't see the whole branch.

AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
least should be relevant to what devices are attached to the VM, right?

E.g., I observed at least two things that can drastically enlarge the
blackout window:

  1) vcpu save/load sometimes can take ridiculously long time, even if 99%
  of them are fine.  I still didn't spend time looking at this issue, but
  the outlier (of a single cpu save/load, while I don't remember whether
  it's save or load, both will contribute to the downtime anyway) can cause
  100+ms already for that single vcpu.  It'll already get more than 0.1sec.

  2) virtio device loads can be sometimes very slow due to virtqueue
  manipulations.  We used to have developers working in this area,
  e.g. this thread:

  https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com

  I don't yet have time to further look.  Since you mentioned vhost I was
  wondering whether you hit similar issues, and if not why yet.  IIRC it
  was only during VM loads so dest QEMU only.  Again that'll contribute to
  the overall downtime too and that can also be 100ms or more, but that may
  depend on VM memory topology and device setup.

When we compare the solutions, we definitely don't need to make it "live":
it could be a migration starting with VM paused already, skipping all dirty
tracking just like cpr-reboot, but in this case it's can be a relatively
normal migration, so that we still invoke the new qemu binary and load that
on the fly, perhaps taking the fds via scm rights.  Then compare these two
solutions with/without exec().  Note that I'm not requesting for such data;
it's not fair if that takes a lot of work already first to implement such
idea, but what I wanted to say is that it might be interesting to first
analyze what caused the downtime, and whether that can be logically
resolved too without exec(); hence the below question on "why exec()" in
the first place, as I still feel like that's somewhere we should avoid
unless extremely necessary..

> cpr-exec seamlessly preserves client connections by preserving chardevs,
> and overall provides a much nicer user experience.

I see.  However this is a common issue to migration, am I right?  I mean,
if we have some chardevs on src host, then we migrate the VM from src to
dst, then a reconnect will be needed anyway.  It looks to me that as long
as the old live migration is supported, there's already a solution and apps
are ok with reconnecting to the new ports.  From that POV, I am curious
whether this can be seen as a (kind of separate) work besides the cpr-exec,
however perhaps only a new feature only be valid for cpr-exec?

Meanwhile, is there some elaborations on what would be the major change of
nicer user experience with the new solution?

> 
> chardev's are preserved by keeping their fd open across the exec, and
> remembering the value of the fd in precreate vmstate so that new qemu
> can associate the fd with the chardev rather than opening a new one.
> 
> The approach of preserving open file descriptors is very general and applicable
> to all kinds of devices, regardless of whether they support live migration
> in hardware.  Device fd's are preserved using the same mechanism as for
> chardevs.
> 
> Devices that support live migration in hardware do not like to live migrate
> in place to the same node.  It is not what they are designed for, and some
> implementations will flat out fail because the source and target interfaces
> are the same.
> 
> For vhost/tap, sometimes the management layer opens the dev and passes an
> fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
> allows both.  For the case where qemu opens the dev, the fd is preserved
> using the same mechanism as for chardevs.
> 
> The fundamental requirements of this work are:
>   - precreate vmstate
>   - preserve open file descriptors
> 
> Direct exec from old to new qemu is not a hard requirement.

Great to know..

> However, it is simple, with few complications, and works with Oracle's
> cloud containers, so it is the method I am most interested in finishing
> first.
> 
> I believe everything could also be made to work by using SCM_RIGHTS to
> send fd's to a new qemu process that is started by some external means.
> It would be requested with MIG_MODE_CPR_SCM (or some better name), and
> would co-exist with MIG_MODE_CPR_EXEC.

That sounds like a better thing to me, so that live migration framework is
not changed as drastic.  I just still feel like exec() is too powerful, and
evil can reside, just like black magic in the fairy tales; magicians try to
avoid using it unless extremely necessary.

I think the next step for my review is to understand what is implied with
exec().  I'll wait for you to push your tree somewhere so maybe I can read
that and understand better.  A base commit would work too if you can share
so I can apply the series, as it doesn't seem to apply to master now.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-05-28 15:10     ` Steven Sistare via
@ 2024-05-28 17:44       ` Peter Xu
  2024-05-29 17:30         ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-28 17:44 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Tue, May 28, 2024 at 11:10:03AM -0400, Steven Sistare via wrote:
> On 5/27/2024 2:20 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:
> > > Define a type for the 256 byte id string to guarantee the same length is
> > > used and enforced everywhere.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   include/exec/ramblock.h     | 3 ++-
> > >   include/migration/vmstate.h | 2 ++
> > >   migration/savevm.c          | 8 ++++----
> > >   migration/vmstate.c         | 3 ++-
> > >   4 files changed, 10 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> > > index 0babd10..61deefe 100644
> > > --- a/include/exec/ramblock.h
> > > +++ b/include/exec/ramblock.h
> > > @@ -23,6 +23,7 @@
> > >   #include "cpu-common.h"
> > >   #include "qemu/rcu.h"
> > >   #include "exec/ramlist.h"
> > > +#include "migration/vmstate.h"
> > >   struct RAMBlock {
> > >       struct rcu_head rcu;
> > > @@ -35,7 +36,7 @@ struct RAMBlock {
> > >       void (*resized)(const char*, uint64_t length, void *host);
> > >       uint32_t flags;
> > >       /* Protected by the BQL.  */
> > > -    char idstr[256];
> > > +    VMStateId idstr;
> > >       /* RCU-enabled, writes protected by the ramlist lock */
> > >       QLIST_ENTRY(RAMBlock) next;
> > >       QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
> > 
> > Hmm.. Don't look like a good idea to include a migration header in
> > ramblock.h?  Is this ramblock change needed for this work?
> 
> Well, entities that are migrated include migration headers, and now that
> includes RAMBlock.  There is precedent:
> 
> 0 include/exec/ramblock.h   26 #include "migration/vmstate.h"
> 1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h"
> 2 include/hw/display/ramfb.  4 #include "migration/vmstate.h"
> 3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h"
> 4 include/hw/input/pl050.h  14 #include "migration/vmstate.h"
> 5 include/hw/pci/shpc.h      7 #include "migration/vmstate.h"
> 6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h"
> 7 include/migration/cpu.h    8 #include "migration/vmstate.h"
> 
> Granted, only some of the C files that include ramblock.h need all of vmstate.h.
> I could define VMStateId in a smaller file such as migration/misc.h, or a
> new file migration/vmstateid.h, and include that in ramblock.h.
> Any preference?

One issue here is currently the idstr[] of ramblock is a verbose name of
the ramblock, and logically it doesn't need to have anything to do with
vmstate.

I'll continue to read to see why we need VMStateID here for a ramblock.  So
if it is necessary, maybe the name VMStateID to be used here is misleading?
It'll also be nice to separate the changes, as ramblock.h doesn't belong to
migration subsystem but memory api.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr
  2024-05-28 15:10     ` Steven Sistare via
@ 2024-05-28 18:21       ` Peter Xu
  2024-05-29 17:30         ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-28 18:21 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Tue, May 28, 2024 at 11:10:16AM -0400, Steven Sistare wrote:
> On 5/27/2024 2:31 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
> > > Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> > > can be saved in the migration stream.  This will be needed for CPR.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > This is really tricky.
> > 
> >  From a first glance, I don't think migrating a VA is valid at all for
> > migration even if with exec.. and looks insane to me for a cross-process
> > migration, which seems to be allowed to use as a generic VMSD helper.. as
> > VA is the address space barrier for different processes and I think it
> > normally even apply to generic execve(), and we're trying to jailbreak for
> > some reason..
> > 
> > It definitely won't work for any generic migration as sizeof(void*) can be
> > different afaict between hosts, e.g. 32bit -> 64bit migrations.
> > 
> > Some description would be really helpful in this commit message,
> > e.g. explain the users and why.  Do we need to poison that for generic VMSD
> > use (perhaps with prefixed underscores)?  I think I'll need to read on the
> > rest to tell..
> 
> Short answer: we never dereference the void* in the new process.  And must not.
> 
> Longer answer:
> 
> During CPR for vfio, each mapped DMA region is re-registered in the new
> process using the new VA.  The ioctl to re-register identifies the mapping
> by IOVA and length.
> 
> The same requirement holds for CPR of iommufd devices.  However, in the
> iommufd framework, IOVA does not uniquely identify a dma mapping, and we
> need to use the old VA as the unique identifier.  The new process
> re-registers each mapping, passing the old VA and new VA to the kernel.
> The old VA is never dereferenced in the new process, we just need its value.
> 
> I suspected that the void* which must not be dereferenced might make people
> uncomfortable.  I have an older version of my code which adds a uint64_t
> field to RAMBlock for recording and migrating the old VA.  The saving and
> loading code is slightly less elegant, but no big deal.  Would you prefer
> that?

I see, thanks for explaining.  Yes that sounds better to me.  Re the
ugliness: is that about a pre_save() plus one extra uint64_t field?  In
that case it looks better comparing to migrating "void*".

I'm trying to read some context on the vaddr remap thing from you, and I
found this:

https://lore.kernel.org/all/Y90bvBnrvRAcEQ%2F%2F@nvidia.com/

So it will work with iommufd now?  Meanwhile, what's the status for mdev?
Looks like it isn't supported yet for both.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-04-29 15:55 ` [PATCH V1 17/26] machine: memfd-alloc option Steve Sistare
@ 2024-05-28 21:12   ` Peter Xu
  2024-05-29 17:31     ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-28 21:12 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote:
> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> option is set.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/core/machine.c   | 22 ++++++++++++++++++++++
>  include/hw/boards.h |  1 +
>  qemu-options.hx     |  6 ++++++
>  system/memory.c     |  9 ++++++---
>  system/physmem.c    | 18 +++++++++++++++++-
>  system/trace-events |  1 +
>  6 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 582c2df..9567b97 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -443,6 +443,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>      ms->mem_merge = value;
>  }
>  
> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->memfd_alloc;
> +}
> +
> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->memfd_alloc = value;
> +}
> +
>  static bool machine_get_usb(Object *obj, Error **errp)
>  {
>      MachineState *ms = MACHINE(obj);
> @@ -1044,6 +1058,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "mem-merge",
>          "Enable/disable memory merge support");
>  
> +    object_class_property_add_bool(oc, "memfd-alloc",
> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> +    object_class_property_set_description(oc, "memfd-alloc",
> +        "Enable/disable allocating anonymous memory using memfd_create");
> +
>      object_class_property_add_bool(oc, "usb",
>          machine_get_usb, machine_set_usb);
>      object_class_property_set_description(oc, "usb",
> @@ -1387,6 +1406,9 @@ static bool create_default_memdev(MachineState *ms, const char *path, Error **er
>      if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
>          goto out;
>      }
> +    if (!object_property_set_bool(obj, "share", ms->memfd_alloc, errp)) {
> +        goto out;
> +    }
>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>                                obj);
>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 69c1ba4..96259c3 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -372,6 +372,7 @@ struct MachineState {
>      bool dump_guest_core;
>      bool mem_merge;
>      bool require_guest_memfd;
> +    bool memfd_alloc;
>      bool usb;
>      bool usb_disabled;
>      char *firmware;
> diff --git a/qemu-options.hx b/qemu-options.hx
> index cf61f6b..f0dfda5 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> @@ -79,6 +80,11 @@ SRST
>          supported by the host, de-duplicates identical memory pages
>          among VMs instances (enabled by default).
>  
> +    ``memfd-alloc=on|off``
> +        Enables or disables allocation of anonymous guest RAM using
> +        memfd_create.  Any associated memory-backend objects are created with
> +        share=on.  The memfd-alloc default is off.
> +
>      ``aes-key-wrap=on|off``
>          Enables or disables AES key wrapping support on s390-ccw hosts.
>          This feature controls whether AES wrapping keys will be created
> diff --git a/system/memory.c b/system/memory.c
> index 49f1cb2..ca04a0e 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
>                                        uint64_t size,
>                                        Error **errp)
>  {
> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;

If there's a machine option to "use memfd for allocations", then it's
shared mem... Hmm..

It is a bit confusing to me in quite a few levels:

  - Why memory allocation method will be defined by a machine property,
    even if we have memory-backend-* which should cover everything?

  - Even if we have such a machine property, why setting "memfd" will
    always imply shared?  why not private?  After all it's not called
    "memfd-shared-alloc", and we can create private mappings using
    e.g. memory-backend-memfd,share=off.

>      return memory_region_init_ram_flags_nomigrate(mr, owner, name,
> -                                                  size, 0, errp);
> +                                                  size, flags, errp);
>  }
>  
>  bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
> @@ -1713,8 +1714,9 @@ bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
>                                        uint64_t size,
>                                        Error **errp)
>  {
> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>      if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
> -                                                size, 0, errp)) {
> +                                                size, flags, errp)) {
>           return false;
>      }
>      mr->readonly = true;
> @@ -1731,6 +1733,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>                                               Error **errp)
>  {
>      Error *err = NULL;
> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>      assert(ops);
>      memory_region_init(mr, owner, name, size);
>      mr->ops = ops;
> @@ -1738,7 +1741,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>      mr->terminates = true;
>      mr->rom_device = true;
>      mr->destructor = memory_region_destructor_ram;
> -    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
> +    mr->ram_block = qemu_ram_alloc(size, flags, mr, &err);
>      if (err) {
>          mr->size = int128_zero();
>          object_unparent(OBJECT(mr));
> diff --git a/system/physmem.c b/system/physmem.c
> index c736af5..36d97ec 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -45,6 +45,7 @@
>  #include "qemu/qemu-print.h"
>  #include "qemu/log.h"
>  #include "qemu/memalign.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -1825,6 +1826,19 @@ static void *ram_block_alloc_host(RAMBlock *rb, Error **errp)
>      if (xen_enabled()) {
>          xen_ram_alloc(rb->offset, rb->max_length, mr, errp);
>  
> +    } else if (rb->flags & RAM_SHARED) {
> +        if (rb->fd == -1) {
> +            mr->align = QEMU_VMALLOC_ALIGN;
> +            rb->fd = qemu_memfd_create(rb->idstr, rb->max_length + mr->align,
> +                                       0, 0, 0, errp);
> +        }

We used to have case where RAM_SHARED && rb->fd==-1, I think.

We have some code that checks explicitly on rb->fd against -1 to know
whether it's a fd based.  I'm not sure whether there'll be implications to
affect those codes.

Maybe it's mostly fine, OTOH I worry more on the whole idea.  I'm not sure
whether this is relevant to "we want to be able to share the mem with the
new process", in this case can we simply require the user to use file based
memory backends, rather than such change?

Thanks,

> +        if (rb->fd >= 0) {
> +            int mfd = rb->fd;
> +            qemu_set_cloexec(mfd);
> +            host = file_ram_alloc(rb, rb->max_length, mfd, false, 0, errp);
> +            trace_qemu_anon_memfd_alloc(rb->idstr, rb->max_length, mfd, host);
> +        }
> +
>      } else {
>          host = qemu_anon_ram_alloc(rb->max_length, &mr->align,
>                                     qemu_ram_is_shared(rb),
> @@ -2106,8 +2120,10 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
>                                                       void *host),
>                                       MemoryRegion *mr, Error **errp)
>  {
> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> +    flags |= RAM_RESIZEABLE;
>      return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
> -                                   RAM_RESIZEABLE, mr, errp);
> +                                   flags, mr, errp);
>  }
>  
>  static void reclaim_ramblock(RAMBlock *block)
> diff --git a/system/trace-events b/system/trace-events
> index f0a80ba..0092734 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -41,3 +41,4 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
>  
>  # physmem.c
>  ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"
> +qemu_anon_memfd_alloc(const char *name, size_t size, int fd, void *ptr) "%s size %zu fd %d -> %p"
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-04-29 15:55 ` [PATCH V1 19/26] physmem: preserve ram blocks for cpr Steve Sistare
@ 2024-05-28 21:44   ` Peter Xu
  2024-05-29 17:31     ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-28 21:44 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> Preserve fields of RAMBlocks that allocate their host memory during CPR so
> the RAM allocation can be recovered.

This sentence itself did not explain much, IMHO.  QEMU can share memory
using fd based memory already of all kinds, as long as the memory backend
is path-based it can be shared by sharing the same paths to dst.

This reads very confusing as a generic concept.  I mean, QEMU migration
relies on so many things to work right.  We mostly asks the users to "use
exactly the same cmdline for src/dst QEMU unless you know what you're
doing", otherwise many things can break.  That should also include ramblock
being matched between src/dst due to the same cmdlines provided on both
sides.  It'll be confusing to mention this when we thought the ramblocks
also rely on that fact.

So IIUC this sentence should be dropped in the real patch, and I'll try to
guess the real reason with below..

> Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> Preserve the old host address, even though it is immediately discarded,
> as it will be needed in the future for CPR with iommufd.  Preserve
> guest_memfd, even though CPR does not yet support it, to maintain vmstate
> compatibility when it becomes supported.

.. It could be about the vfio vaddr update feature that you mentioned and
only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
help here)?

If so, IMHO we should have this patch (or any variance form) to be there
for your upcoming vfio support.  Keeping this around like this will make
the series harder to review.  Or is it needed even before VFIO?

Another thing to ask: does this idea also need to rely on some future
iommufd kernel support?  If there's anything that's not merged in current
Linux upstream, this series needs to be marked as RFC, so it's not target
for merging.  This will also be true if this patch is "preparing" for that
work.  It means if this patch only services iommufd purpose, even if it
doesn't require any kernel header to be referenced, we should only merge it
together with the full iommufd support comes later (and that'll be after
iommufd kernel supports land).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-05-28 17:44       ` Peter Xu
@ 2024-05-29 17:30         ` Steven Sistare via
  2024-05-29 18:53           ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-29 17:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/28/2024 1:44 PM, Peter Xu wrote:
> On Tue, May 28, 2024 at 11:10:03AM -0400, Steven Sistare via wrote:
>> On 5/27/2024 2:20 PM, Peter Xu wrote:
>>> On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:
>>>> Define a type for the 256 byte id string to guarantee the same length is
>>>> used and enforced everywhere.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    include/exec/ramblock.h     | 3 ++-
>>>>    include/migration/vmstate.h | 2 ++
>>>>    migration/savevm.c          | 8 ++++----
>>>>    migration/vmstate.c         | 3 ++-
>>>>    4 files changed, 10 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>>>> index 0babd10..61deefe 100644
>>>> --- a/include/exec/ramblock.h
>>>> +++ b/include/exec/ramblock.h
>>>> @@ -23,6 +23,7 @@
>>>>    #include "cpu-common.h"
>>>>    #include "qemu/rcu.h"
>>>>    #include "exec/ramlist.h"
>>>> +#include "migration/vmstate.h"
>>>>    struct RAMBlock {
>>>>        struct rcu_head rcu;
>>>> @@ -35,7 +36,7 @@ struct RAMBlock {
>>>>        void (*resized)(const char*, uint64_t length, void *host);
>>>>        uint32_t flags;
>>>>        /* Protected by the BQL.  */
>>>> -    char idstr[256];
>>>> +    VMStateId idstr;
>>>>        /* RCU-enabled, writes protected by the ramlist lock */
>>>>        QLIST_ENTRY(RAMBlock) next;
>>>>        QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
>>>
>>> Hmm.. Don't look like a good idea to include a migration header in
>>> ramblock.h?  Is this ramblock change needed for this work?
>>
>> Well, entities that are migrated include migration headers, and now that
>> includes RAMBlock.  There is precedent:
>>
>> 0 include/exec/ramblock.h   26 #include "migration/vmstate.h"
>> 1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h"
>> 2 include/hw/display/ramfb.  4 #include "migration/vmstate.h"
>> 3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h"
>> 4 include/hw/input/pl050.h  14 #include "migration/vmstate.h"
>> 5 include/hw/pci/shpc.h      7 #include "migration/vmstate.h"
>> 6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h"
>> 7 include/migration/cpu.h    8 #include "migration/vmstate.h"
>>
>> Granted, only some of the C files that include ramblock.h need all of vmstate.h.
>> I could define VMStateId in a smaller file such as migration/misc.h, or a
>> new file migration/vmstateid.h, and include that in ramblock.h.
>> Any preference?
> 
> One issue here is currently the idstr[] of ramblock is a verbose name of
> the ramblock, and logically it doesn't need to have anything to do with
> vmstate.
> 
> I'll continue to read to see why we need VMStateID here for a ramblock.  So
> if it is necessary, maybe the name VMStateID to be used here is misleading?
> It'll also be nice to separate the changes, as ramblock.h doesn't belong to
> migration subsystem but memory api.

cpr requires migrating vmstate for ramblock.  See the physmem patches for
why. vmstate requires a unique id, and ramblock already defines a unique
id which is used to identify the block and dirty bitmap in the migration
stream.

How about a more general name for the type:

migration/misc.h
     typedef char (MigrationId)[256];

exec/ramblock.h
     struct RAMBlock {
         MigrationId idstr;

migration/savevm.c
     typedef struct CompatEntry {
         MigrationId idstr;

     typedef struct SaveStateEntry {
         MigrationId idstr;


- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr
  2024-05-28 18:21       ` Peter Xu
@ 2024-05-29 17:30         ` Steven Sistare via
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare via @ 2024-05-29 17:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/28/2024 2:21 PM, Peter Xu wrote:
> On Tue, May 28, 2024 at 11:10:16AM -0400, Steven Sistare wrote:
>> On 5/27/2024 2:31 PM, Peter Xu wrote:
>>> On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
>>>> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
>>>> can be saved in the migration stream.  This will be needed for CPR.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> This is really tricky.
>>>
>>>   From a first glance, I don't think migrating a VA is valid at all for
>>> migration even if with exec.. and looks insane to me for a cross-process
>>> migration, which seems to be allowed to use as a generic VMSD helper.. as
>>> VA is the address space barrier for different processes and I think it
>>> normally even apply to generic execve(), and we're trying to jailbreak for
>>> some reason..
>>>
>>> It definitely won't work for any generic migration as sizeof(void*) can be
>>> different afaict between hosts, e.g. 32bit -> 64bit migrations.
>>>
>>> Some description would be really helpful in this commit message,
>>> e.g. explain the users and why.  Do we need to poison that for generic VMSD
>>> use (perhaps with prefixed underscores)?  I think I'll need to read on the
>>> rest to tell..
>>
>> Short answer: we never dereference the void* in the new process.  And must not.
>>
>> Longer answer:
>>
>> During CPR for vfio, each mapped DMA region is re-registered in the new
>> process using the new VA.  The ioctl to re-register identifies the mapping
>> by IOVA and length.
>>
>> The same requirement holds for CPR of iommufd devices.  However, in the
>> iommufd framework, IOVA does not uniquely identify a dma mapping, and we
>> need to use the old VA as the unique identifier.  The new process
>> re-registers each mapping, passing the old VA and new VA to the kernel.
>> The old VA is never dereferenced in the new process, we just need its value.
>>
>> I suspected that the void* which must not be dereferenced might make people
>> uncomfortable.  I have an older version of my code which adds a uint64_t
>> field to RAMBlock for recording and migrating the old VA.  The saving and
>> loading code is slightly less elegant, but no big deal.  Would you prefer
>> that?
> 
> I see, thanks for explaining.  Yes that sounds better to me.  Re the
> ugliness: is that about a pre_save() plus one extra uint64_t field?  In
> that case it looks better comparing to migrating "void*".

Yes.  Will do.

> I'm trying to read some context on the vaddr remap thing from you, and I
> found this:
> 
> https://lore.kernel.org/all/Y90bvBnrvRAcEQ%2F%2F@nvidia.com/
> 
> So it will work with iommufd now?  Meanwhile, what's the status for mdev?
> Looks like it isn't supported yet for both.

I am currently working on the kernel and qemu code to support iommufd.
It is an RFE on top of this initial cpr-exec work that I will post separately
later.  I do know that it will require the old VA, so I am proposing to
preserve old VA now in the migration stream to avoid adding extra backwards
compatibility code later.

I have prototyped userland code that supports mdev, based on jason's suggestion to
fork an extra process to handle mdev translations during the transition from old
to new qemu, but it is a work in progres and adds complexity, and I do not plan to
submit that any time soon.  Another RFE.  For now mdev is not supported.

- Steve



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-05-28 21:12   ` Peter Xu
@ 2024-05-29 17:31     ` Steven Sistare via
  2024-05-29 19:14       ` Peter Xu
  2024-06-03 10:17       ` Daniel P. Berrangé
  0 siblings, 2 replies; 122+ messages in thread
From: Steven Sistare via @ 2024-05-29 17:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/28/2024 5:12 PM, Peter Xu wrote:
> On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote:
>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>> option is set.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/core/machine.c   | 22 ++++++++++++++++++++++
>>   include/hw/boards.h |  1 +
>>   qemu-options.hx     |  6 ++++++
>>   system/memory.c     |  9 ++++++---
>>   system/physmem.c    | 18 +++++++++++++++++-
>>   system/trace-events |  1 +
>>   6 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 582c2df..9567b97 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -443,6 +443,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>       ms->mem_merge = value;
>>   }
>>   
>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    return ms->memfd_alloc;
>> +}
>> +
>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    ms->memfd_alloc = value;
>> +}
>> +
>>   static bool machine_get_usb(Object *obj, Error **errp)
>>   {
>>       MachineState *ms = MACHINE(obj);
>> @@ -1044,6 +1058,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>       object_class_property_set_description(oc, "mem-merge",
>>           "Enable/disable memory merge support");
>>   
>> +    object_class_property_add_bool(oc, "memfd-alloc",
>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>> +    object_class_property_set_description(oc, "memfd-alloc",
>> +        "Enable/disable allocating anonymous memory using memfd_create");
>> +
>>       object_class_property_add_bool(oc, "usb",
>>           machine_get_usb, machine_set_usb);
>>       object_class_property_set_description(oc, "usb",
>> @@ -1387,6 +1406,9 @@ static bool create_default_memdev(MachineState *ms, const char *path, Error **er
>>       if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
>>           goto out;
>>       }
>> +    if (!object_property_set_bool(obj, "share", ms->memfd_alloc, errp)) {
>> +        goto out;
>> +    }
>>       object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>>                                 obj);
>>       /* Ensure backend's memory region name is equal to mc->default_ram_id */
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index 69c1ba4..96259c3 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -372,6 +372,7 @@ struct MachineState {
>>       bool dump_guest_core;
>>       bool mem_merge;
>>       bool require_guest_memfd;
>> +    bool memfd_alloc;
>>       bool usb;
>>       bool usb_disabled;
>>       char *firmware;
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index cf61f6b..f0dfda5 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>       "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>       "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>       "                mem-merge=on|off controls memory merge support (default: on)\n"
>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
>>       "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>>       "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>>       "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
>> @@ -79,6 +80,11 @@ SRST
>>           supported by the host, de-duplicates identical memory pages
>>           among VMs instances (enabled by default).
>>   
>> +    ``memfd-alloc=on|off``
>> +        Enables or disables allocation of anonymous guest RAM using
>> +        memfd_create.  Any associated memory-backend objects are created with
>> +        share=on.  The memfd-alloc default is off.
>> +
>>       ``aes-key-wrap=on|off``
>>           Enables or disables AES key wrapping support on s390-ccw hosts.
>>           This feature controls whether AES wrapping keys will be created
>> diff --git a/system/memory.c b/system/memory.c
>> index 49f1cb2..ca04a0e 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
>>                                         uint64_t size,
>>                                         Error **errp)
>>   {
>> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> 
> If there's a machine option to "use memfd for allocations", then it's
> shared mem... Hmm..
>
> It is a bit confusing to me in quite a few levels:
> 
>    - Why memory allocation method will be defined by a machine property,
>      even if we have memory-backend-* which should cover everything?

Some memory regions are implicitly created, and have no explicit representation
on the qemu command line.  memfd-alloc affects those.

More generally, memfd-alloc affects all ramblock allocations that are
not explicitly represented by memory-backend object.  Thus the simple
command line "qemu -m 1G" does not explicitly describe an object, so it
goes through the anonymous allocation path, and is affected by memfd-alloc.

Internally, create_default_memdev does create a memory-backend object.
That is what my doc comment above refers to:
   Any associated memory-backend objects are created with share=on

An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.

The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:

+#     Memory backend objects must have the share=on attribute, and
+#     must be mmap'able in the new QEMU process.  For example,
+#     memory-backend-file is acceptable, but memory-backend-ram is
+#     not.
+#
+#     The VM must be started with the '-machine memfd-alloc=on'
+#     option.  This causes implicit ram blocks -- those not explicitly
+#     described by a memory-backend object -- to be allocated by
+#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
+#     RAM when it is specified without a memory-backend object.

>    - Even if we have such a machine property, why setting "memfd" will
>      always imply shared?  why not private?  After all it's not called
>      "memfd-shared-alloc", and we can create private mappings using
>      e.g. memory-backend-memfd,share=off.

There is no use case for memfd-alloc with share=off, so no point IMO in
making the option more verbose.  For cpr, the mapping with all its modifications
must be visible to new qemu when qemu mmaps it.

>>       return memory_region_init_ram_flags_nomigrate(mr, owner, name,
>> -                                                  size, 0, errp);
>> +                                                  size, flags, errp);
>>   }
>>   
>>   bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
>> @@ -1713,8 +1714,9 @@ bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
>>                                         uint64_t size,
>>                                         Error **errp)
>>   {
>> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>>       if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
>> -                                                size, 0, errp)) {
>> +                                                size, flags, errp)) {
>>            return false;
>>       }
>>       mr->readonly = true;
>> @@ -1731,6 +1733,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>>                                                Error **errp)
>>   {
>>       Error *err = NULL;
>> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>>       assert(ops);
>>       memory_region_init(mr, owner, name, size);
>>       mr->ops = ops;
>> @@ -1738,7 +1741,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>>       mr->terminates = true;
>>       mr->rom_device = true;
>>       mr->destructor = memory_region_destructor_ram;
>> -    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
>> +    mr->ram_block = qemu_ram_alloc(size, flags, mr, &err);
>>       if (err) {
>>           mr->size = int128_zero();
>>           object_unparent(OBJECT(mr));
>> diff --git a/system/physmem.c b/system/physmem.c
>> index c736af5..36d97ec 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -45,6 +45,7 @@
>>   #include "qemu/qemu-print.h"
>>   #include "qemu/log.h"
>>   #include "qemu/memalign.h"
>> +#include "qemu/memfd.h"
>>   #include "exec/memory.h"
>>   #include "exec/ioport.h"
>>   #include "sysemu/dma.h"
>> @@ -1825,6 +1826,19 @@ static void *ram_block_alloc_host(RAMBlock *rb, Error **errp)
>>       if (xen_enabled()) {
>>           xen_ram_alloc(rb->offset, rb->max_length, mr, errp);
>>   
>> +    } else if (rb->flags & RAM_SHARED) {
>> +        if (rb->fd == -1) {
>> +            mr->align = QEMU_VMALLOC_ALIGN;
>> +            rb->fd = qemu_memfd_create(rb->idstr, rb->max_length + mr->align,
>> +                                       0, 0, 0, errp);
>> +        }
> 
> We used to have case where RAM_SHARED && rb->fd==-1, I think.

file_ram_alloc is the only other function that assigns to rb->fd, and
hence the only function that cares about fd > 0.  It does not
interfere with memfd_alloc.  All calls to create ram blocks funnel
through these two functions:

qemu_ram_alloc_from_fd()
   ram_block_create()
   file_ram_alloc()

qemu_ram_alloc_internal()
   ram_block_create()
   ram_block_alloc_host()
     if (rb->flags & RAM_SHARED) {
         if (rb->fd == -1) {
             rb->fd = qemu_memfd_create()
         }
         if (rb->fd >= 0) {
             file_ram_alloc(rb->fd)

> We have some code that checks explicitly on rb->fd against -1 to know
> whether it's a fd based.  I'm not sure whether there'll be implications to
> affect those codes.

That's OK, the memfd allocation completely acts like an fd based ramblock.
   rb->fd = qemu_memfd_create()

> Maybe it's mostly fine, OTOH I worry more on the whole idea.  I'm not sure
> whether this is relevant to "we want to be able to share the mem with the
> new process", in this case can we simply require the user to use file based
> memory backends, rather than such change?

That does not work for implicitly created memory regions.

- Steve

>> +        if (rb->fd >= 0) {
>> +            int mfd = rb->fd;
>> +            qemu_set_cloexec(mfd);
>> +            host = file_ram_alloc(rb, rb->max_length, mfd, false, 0, errp);
>> +            trace_qemu_anon_memfd_alloc(rb->idstr, rb->max_length, mfd, host);
>> +        }
>> +
>>       } else {
>>           host = qemu_anon_ram_alloc(rb->max_length, &mr->align,
>>                                      qemu_ram_is_shared(rb),
>> @@ -2106,8 +2120,10 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
>>                                                        void *host),
>>                                        MemoryRegion *mr, Error **errp)
>>   {
>> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>> +    flags |= RAM_RESIZEABLE;
>>       return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
>> -                                   RAM_RESIZEABLE, mr, errp);
>> +                                   flags, mr, errp);
>>   }
>>   
>>   static void reclaim_ramblock(RAMBlock *block)
>> diff --git a/system/trace-events b/system/trace-events
>> index f0a80ba..0092734 100644
>> --- a/system/trace-events
>> +++ b/system/trace-events
>> @@ -41,3 +41,4 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
>>   
>>   # physmem.c
>>   ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"
>> +qemu_anon_memfd_alloc(const char *name, size_t size, int fd, void *ptr) "%s size %zu fd %d -> %p"
>> -- 
>> 1.8.3.1
>>
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-05-28 21:44   ` Peter Xu
@ 2024-05-29 17:31     ` Steven Sistare via
  2024-05-29 19:25       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-29 17:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/28/2024 5:44 PM, Peter Xu wrote:
> On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
>> Preserve fields of RAMBlocks that allocate their host memory during CPR so
>> the RAM allocation can be recovered.
> 
> This sentence itself did not explain much, IMHO.  QEMU can share memory
> using fd based memory already of all kinds, as long as the memory backend
> is path-based it can be shared by sharing the same paths to dst.
> 
> This reads very confusing as a generic concept.  I mean, QEMU migration
> relies on so many things to work right.  We mostly asks the users to "use
> exactly the same cmdline for src/dst QEMU unless you know what you're
> doing", otherwise many things can break.  That should also include ramblock
> being matched between src/dst due to the same cmdlines provided on both
> sides.  It'll be confusing to mention this when we thought the ramblocks
> also rely on that fact.
> 
> So IIUC this sentence should be dropped in the real patch, and I'll try to
> guess the real reason with below..

The properties of the implicitly created ramblocks must be preserved.
The defaults can and do change between qemu releases, even when the command-line
parameters do not change for the explicit objects that cause these implicit
ramblocks to be created.

>> Mirror the mr->align field in the RAMBlock to simplify the vmstate.
>> Preserve the old host address, even though it is immediately discarded,
>> as it will be needed in the future for CPR with iommufd.  Preserve
>> guest_memfd, even though CPR does not yet support it, to maintain vmstate
>> compatibility when it becomes supported.
> 
> .. It could be about the vfio vaddr update feature that you mentioned and
> only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
> help here)?
> 
> If so, IMHO we should have this patch (or any variance form) to be there
> for your upcoming vfio support.  Keeping this around like this will make
> the series harder to review.  Or is it needed even before VFIO?

This patch is needed independently of vfio or iommufd.

guest_memfd is independent of vfio or iommufd.  It is a recent addition
which I have not tried to support, but I added this placeholder field
to it can be supported in the future without adding a new field later
and maintaining backwards compatibility.

> Another thing to ask: does this idea also need to rely on some future
> iommufd kernel support?  If there's anything that's not merged in current
> Linux upstream, this series needs to be marked as RFC, so it's not target
> for merging.  This will also be true if this patch is "preparing" for that
> work.  It means if this patch only services iommufd purpose, even if it
> doesn't require any kernel header to be referenced, we should only merge it
> together with the full iommufd support comes later (and that'll be after
> iommufd kernel supports land).

It does not rely on future kernel support.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 05/26] migration: precreate vmstate
  2024-05-28 15:09     ` Steven Sistare via
@ 2024-05-29 18:39       ` Peter Xu
  2024-05-30 17:04         ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-29 18:39 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Tue, May 28, 2024 at 11:09:49AM -0400, Steven Sistare wrote:
> On 5/27/2024 2:16 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:
> > > Provide the VMStateDescription precreate field to mark objects that must
> > > be loaded on the incoming side before devices have been created, because
> > > they provide properties that will be needed at creation time.  They will
> > > be saved to and loaded from their own QEMUFile, via
> > > qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
> > > functions are not yet called in this patch.  Allow them to be called
> > > before or after normal migration is active, when current_migration and
> > > current_incoming are not valid.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   include/migration/vmstate.h |  6 ++++
> > >   migration/savevm.c          | 69 +++++++++++++++++++++++++++++++++++++++++----
> > >   migration/savevm.h          |  3 ++
> > >   3 files changed, 73 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > > index 294d2d8..4691334 100644
> > > --- a/include/migration/vmstate.h
> > > +++ b/include/migration/vmstate.h
> > > @@ -198,6 +198,12 @@ struct VMStateDescription {
> > >        * a QEMU_VM_SECTION_START section.
> > >        */
> > >       bool early_setup;
> > > +
> > > +    /*
> > > +     * Send/receive this object in the precreate migration stream.
> > > +     */
> > > +    bool precreate;
> > > +
> > >       int version_id;
> > >       int minimum_version_id;
> > >       MigrationPriority priority;
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 9789823..a30bcd9 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -239,6 +239,7 @@ static SaveState savevm_state = {
> > >   #define SAVEVM_FOREACH(se, entry)                                    \
> > >       QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
> > > +        if (!se->vmsd || !se->vmsd->precreate)
> > >   #define SAVEVM_FOREACH_ALL(se, entry)                                \
> > >       QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
> > > @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se,
> > >       }
> > >   }
> > > +static bool send_section_footer(SaveStateEntry *se)
> > > +{
> > > +    return (se->vmsd && se->vmsd->precreate) ||
> > > +           migrate_get_current()->send_section_footer;
> > > +}
> > 
> > Does the precreate vmsd "require" the footer?  Or it should also work?
> > IMHO it's less optimal to bind features without good reasons.
> 
> It is not required.  However, IMO we should not treat send-section-footer as
> a fungible feature.  It is strictly an improvement, as was added to catch
> misformated sections.  It is only registered as a feature for backwards
> compatibility with qemu 2.3 and xen.
> 
> For a brand new data stream such as precreate, where we are not constrained
> by backwards compatibility, we should unconditionally use the better protocol,
> and always send the footer.

I see your point, but I still don't think we should mangle these things.
It makes future justification harder on whether section footer should be
sent.

Take example of whatever new feature added for migration like mapped-ram,
we might also want to enforce it by adding "return migrate_mapped_ram() ||
..." but it means we keep growing this with no benefit.

What you worry on "what if this is turned off" isn't a real one: nobody
will turn it off!  We started to deprecate machines, and I had a feeling
it will be enforced at some point by default.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-05-29 17:30         ` Steven Sistare via
@ 2024-05-29 18:53           ` Peter Xu
  2024-05-30 17:11             ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-29 18:53 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:
> How about a more general name for the type:
> 
> migration/misc.h
>     typedef char (MigrationId)[256];

How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
the loop) may have a better idea.

Meanwhile, s/MigrationID/IDString/?

> 
> exec/ramblock.h
>     struct RAMBlock {
>         MigrationId idstr;
> 
> migration/savevm.c
>     typedef struct CompatEntry {
>         MigrationId idstr;
> 
>     typedef struct SaveStateEntry {
>         MigrationId idstr;
> 
> 
> - Steve
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-05-29 17:31     ` Steven Sistare via
@ 2024-05-29 19:14       ` Peter Xu
  2024-05-30 17:11         ` Steven Sistare via
  2024-06-03 10:17       ` Daniel P. Berrangé
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-29 19:14 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> > > diff --git a/system/memory.c b/system/memory.c
> > > index 49f1cb2..ca04a0e 100644
> > > --- a/system/memory.c
> > > +++ b/system/memory.c
> > > @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
> > >                                         uint64_t size,
> > >                                         Error **errp)
> > >   {
> > > +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> > 
> > If there's a machine option to "use memfd for allocations", then it's
> > shared mem... Hmm..
> > 
> > It is a bit confusing to me in quite a few levels:
> > 
> >    - Why memory allocation method will be defined by a machine property,
> >      even if we have memory-backend-* which should cover everything?
> 
> Some memory regions are implicitly created, and have no explicit representation
> on the qemu command line.  memfd-alloc affects those.
> 
> More generally, memfd-alloc affects all ramblock allocations that are
> not explicitly represented by memory-backend object.  Thus the simple
> command line "qemu -m 1G" does not explicitly describe an object, so it
> goes through the anonymous allocation path, and is affected by memfd-alloc.

Can we simply now allow "qemu -m 1G" to work for cpr-exec?  AFAIU that's
what we do with cpr-reboot: we ask the user to specify the right things to
make other thing work.  Otherwise it won't.

> 
> Internally, create_default_memdev does create a memory-backend object.
> That is what my doc comment above refers to:
>   Any associated memory-backend objects are created with share=on
> 
> An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
> 
> The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
> 
> +#     Memory backend objects must have the share=on attribute, and
> +#     must be mmap'able in the new QEMU process.  For example,
> +#     memory-backend-file is acceptable, but memory-backend-ram is
> +#     not.
> +#
> +#     The VM must be started with the '-machine memfd-alloc=on'
> +#     option.  This causes implicit ram blocks -- those not explicitly
> +#     described by a memory-backend object -- to be allocated by
> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> +#     RAM when it is specified without a memory-backend object.

VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
memory-backend-file,share=on propertly, these should be the only outliers?

Are these important enough for the downtime?  Can we put them into the
migrated image alongside with the rest device states?

> 
> >    - Even if we have such a machine property, why setting "memfd" will
> >      always imply shared?  why not private?  After all it's not called
> >      "memfd-shared-alloc", and we can create private mappings using
> >      e.g. memory-backend-memfd,share=off.
> 
> There is no use case for memfd-alloc with share=off, so no point IMO in
> making the option more verbose.

Unfortunately this fact doesn't make the property easier to understand. :-(

> For cpr, the mapping with all its modifications must be visible to new
> qemu when qemu mmaps it.

So this might be the important part - do you mean migrating
VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
Could you elaborate?

Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
restrict that specialty to minimal, making the interfacing as clear as
possible, or (at least migration) maintainers will start to be soon scared
and running away, if such proposal was not shot down.

In short, I hope when we introduce new knobs for cpr, we shouldn't always
keep cpr-* modes in mind, but consider whenever the user can use it without
cpr-*.  I'm not sure whether it'll be always possible, but we should try.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-05-29 17:31     ` Steven Sistare via
@ 2024-05-29 19:25       ` Peter Xu
  2024-05-30 17:12         ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-29 19:25 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
> On 5/28/2024 5:44 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> > > Preserve fields of RAMBlocks that allocate their host memory during CPR so
> > > the RAM allocation can be recovered.
> > 
> > This sentence itself did not explain much, IMHO.  QEMU can share memory
> > using fd based memory already of all kinds, as long as the memory backend
> > is path-based it can be shared by sharing the same paths to dst.
> > 
> > This reads very confusing as a generic concept.  I mean, QEMU migration
> > relies on so many things to work right.  We mostly asks the users to "use
> > exactly the same cmdline for src/dst QEMU unless you know what you're
> > doing", otherwise many things can break.  That should also include ramblock
> > being matched between src/dst due to the same cmdlines provided on both
> > sides.  It'll be confusing to mention this when we thought the ramblocks
> > also rely on that fact.
> > 
> > So IIUC this sentence should be dropped in the real patch, and I'll try to
> > guess the real reason with below..
> 
> The properties of the implicitly created ramblocks must be preserved.
> The defaults can and do change between qemu releases, even when the command-line
> parameters do not change for the explicit objects that cause these implicit
> ramblocks to be created.

AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
have an example?  Would that already cause issue when migrate?

> 
> > > Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> > > Preserve the old host address, even though it is immediately discarded,
> > > as it will be needed in the future for CPR with iommufd.  Preserve
> > > guest_memfd, even though CPR does not yet support it, to maintain vmstate
> > > compatibility when it becomes supported.
> > 
> > .. It could be about the vfio vaddr update feature that you mentioned and
> > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
> > help here)?
> > 
> > If so, IMHO we should have this patch (or any variance form) to be there
> > for your upcoming vfio support.  Keeping this around like this will make
> > the series harder to review.  Or is it needed even before VFIO?
> 
> This patch is needed independently of vfio or iommufd.
> 
> guest_memfd is independent of vfio or iommufd.  It is a recent addition
> which I have not tried to support, but I added this placeholder field
> to it can be supported in the future without adding a new field later
> and maintaining backwards compatibility.

Is guest_memfd the only user so far, then?  If so, would it be possible we
split it as a separate effort on top of the base cpr-exec support?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 05/26] migration: precreate vmstate
  2024-05-29 18:39       ` Peter Xu
@ 2024-05-30 17:04         ` Steven Sistare via
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare via @ 2024-05-30 17:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/29/2024 2:39 PM, Peter Xu wrote:
> On Tue, May 28, 2024 at 11:09:49AM -0400, Steven Sistare wrote:
>> On 5/27/2024 2:16 PM, Peter Xu wrote:
>>> On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:
>>>> Provide the VMStateDescription precreate field to mark objects that must
>>>> be loaded on the incoming side before devices have been created, because
>>>> they provide properties that will be needed at creation time.  They will
>>>> be saved to and loaded from their own QEMUFile, via
>>>> qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
>>>> functions are not yet called in this patch.  Allow them to be called
>>>> before or after normal migration is active, when current_migration and
>>>> current_incoming are not valid.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    include/migration/vmstate.h |  6 ++++
>>>>    migration/savevm.c          | 69 +++++++++++++++++++++++++++++++++++++++++----
>>>>    migration/savevm.h          |  3 ++
>>>>    3 files changed, 73 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>>>> index 294d2d8..4691334 100644
>>>> --- a/include/migration/vmstate.h
>>>> +++ b/include/migration/vmstate.h
>>>> @@ -198,6 +198,12 @@ struct VMStateDescription {
>>>>         * a QEMU_VM_SECTION_START section.
>>>>         */
>>>>        bool early_setup;
>>>> +
>>>> +    /*
>>>> +     * Send/receive this object in the precreate migration stream.
>>>> +     */
>>>> +    bool precreate;
>>>> +
>>>>        int version_id;
>>>>        int minimum_version_id;
>>>>        MigrationPriority priority;
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index 9789823..a30bcd9 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -239,6 +239,7 @@ static SaveState savevm_state = {
>>>>    #define SAVEVM_FOREACH(se, entry)                                    \
>>>>        QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
>>>> +        if (!se->vmsd || !se->vmsd->precreate)
>>>>    #define SAVEVM_FOREACH_ALL(se, entry)                                \
>>>>        QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
>>>> @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se,
>>>>        }
>>>>    }
>>>> +static bool send_section_footer(SaveStateEntry *se)
>>>> +{
>>>> +    return (se->vmsd && se->vmsd->precreate) ||
>>>> +           migrate_get_current()->send_section_footer;
>>>> +}
>>>
>>> Does the precreate vmsd "require" the footer?  Or it should also work?
>>> IMHO it's less optimal to bind features without good reasons.
>>
>> It is not required.  However, IMO we should not treat send-section-footer as
>> a fungible feature.  It is strictly an improvement, as was added to catch
>> misformated sections.  It is only registered as a feature for backwards
>> compatibility with qemu 2.3 and xen.
>>
>> For a brand new data stream such as precreate, where we are not constrained
>> by backwards compatibility, we should unconditionally use the better protocol,
>> and always send the footer.
> 
> I see your point, but I still don't think we should mangle these things.
> It makes future justification harder on whether section footer should be
> sent.
> 
> Take example of whatever new feature added for migration like mapped-ram,
> we might also want to enforce it by adding "return migrate_mapped_ram() ||
> ..." but it means we keep growing this with no benefit.
> 
> What you worry on "what if this is turned off" isn't a real one: nobody
> will turn it off!  We started to deprecate machines, and I had a feeling
> it will be enforced at some point by default.

That's fine, I'll delete the send_section_footer() function.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-05-29 19:14       ` Peter Xu
@ 2024-05-30 17:11         ` Steven Sistare via
  2024-05-30 18:14           ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-30 17:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/29/2024 3:14 PM, Peter Xu wrote:
> On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
>>>> diff --git a/system/memory.c b/system/memory.c
>>>> index 49f1cb2..ca04a0e 100644
>>>> --- a/system/memory.c
>>>> +++ b/system/memory.c
>>>> @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
>>>>                                          uint64_t size,
>>>>                                          Error **errp)
>>>>    {
>>>> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>>>
>>> If there's a machine option to "use memfd for allocations", then it's
>>> shared mem... Hmm..
>>>
>>> It is a bit confusing to me in quite a few levels:
>>>
>>>     - Why memory allocation method will be defined by a machine property,
>>>       even if we have memory-backend-* which should cover everything?
>>
>> Some memory regions are implicitly created, and have no explicit representation
>> on the qemu command line.  memfd-alloc affects those.
>>
>> More generally, memfd-alloc affects all ramblock allocations that are
>> not explicitly represented by memory-backend object.  Thus the simple
>> command line "qemu -m 1G" does not explicitly describe an object, so it
>> goes through the anonymous allocation path, and is affected by memfd-alloc.
> 
> Can we simply now allow "qemu -m 1G" to work for cpr-exec?  

I assume you meant "simply not allow".

Yes, I could do that, but I would need to explicitly add code to exclude this
case, and add a blocker.  Right now it "just works" for all paths that lead to
ram_block_alloc_host, without any special logic at the memory-backend level.
And, I'm not convinced that simplifies the docs, as now I would need to tell
the user that "-m 1G" and similar constructions do not work with cpr.

I can try to clarify the doc for -memfd-alloc as currently defined.

> AFAIU that's
> what we do with cpr-reboot: we ask the user to specify the right things to
> make other thing work.  Otherwise it won't.
> 
>>
>> Internally, create_default_memdev does create a memory-backend object.
>> That is what my doc comment above refers to:
>>    Any associated memory-backend objects are created with share=on
>>
>> An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
>>
>> The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
>>
>> +#     Memory backend objects must have the share=on attribute, and
>> +#     must be mmap'able in the new QEMU process.  For example,
>> +#     memory-backend-file is acceptable, but memory-backend-ram is
>> +#     not.
>> +#
>> +#     The VM must be started with the '-machine memfd-alloc=on'
>> +#     option.  This causes implicit ram blocks -- those not explicitly
>> +#     described by a memory-backend object -- to be allocated by
>> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>> +#     RAM when it is specified without a memory-backend object.
> 
> VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
> memory-backend-file,share=on propertly, these should be the only outliers?
> 
> Are these important enough for the downtime?  Can we put them into the
> migrated image alongside with the rest device states?

It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
The pages must remain pinned during CPR to support ongoing DMA activity
which could target those pages (which we do not quiesce), and the same
physical pages must be used for the ramblocks in the new qemu process.

>>>     - Even if we have such a machine property, why setting "memfd" will
>>>       always imply shared?  why not private?  After all it's not called
>>>       "memfd-shared-alloc", and we can create private mappings using
>>>       e.g. memory-backend-memfd,share=off.
>>
>> There is no use case for memfd-alloc with share=off, so no point IMO in
>> making the option more verbose.
> 
> Unfortunately this fact doesn't make the property easier to understand. :-( >
>> For cpr, the mapping with all its modifications must be visible to new
>> qemu when qemu mmaps it.
> 
> So this might be the important part - do you mean migrating
> VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
> Could you elaborate?

Pinning.

> Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
> restrict that specialty to minimal, making the interfacing as clear as
> possible, or (at least migration) maintainers will start to be soon scared
> and running away, if such proposal was not shot down.
> 
> In short, I hope when we introduce new knobs for cpr, we shouldn't always
> keep cpr-* modes in mind, but consider whenever the user can use it without
> cpr-*.  I'm not sure whether it'll be always possible, but we should try.

I agree in principle.  FWIW, I have tried to generalize the functionality needed
by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers,
precreate vmstate, factory objects; to base it on migration internals with
minimal change (vmstate); and to make minimal changes in the migration control
paths.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-05-29 18:53           ` Peter Xu
@ 2024-05-30 17:11             ` Steven Sistare via
  2024-05-30 18:03               ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-30 17:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/29/2024 2:53 PM, Peter Xu wrote:
> On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:
>> How about a more general name for the type:
>>
>> migration/misc.h
>>      typedef char (MigrationId)[256];
> 
> How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
> the loop) may have a better idea.
> 
> Meanwhile, s/MigrationID/IDString/?

typedefs.h has a different purpose; giving short names to types
defined in internal include files.

This id is specific to migration, so I still think its name should reflect
migration and it belongs in some include/migration/*.h file.

ramblocks and migration are already closely related.  There is nothing wrong
with including a migration header in ramblock.h so it can use a migration type.
We already have:
   include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h"
   include/hw/display/ramfb.h:#include "migration/vmstate.h"
   include/hw/input/pl050.h:#include "migration/vmstate.h"
   include/hw/pci/shpc.h:#include "migration/vmstate.h"
   include/hw/virtio/virtio.h:#include "migration/vmstate.h"
   include/hw/hyperv/vmbus.h:#include "migration/vmstate.h"

The 256 byte magic length already appears in too many places, and my code
would add more occurrences, so I really think that abstracting this type
would be cleaner.

- Steve

>> exec/ramblock.h
>>      struct RAMBlock {
>>          MigrationId idstr;
>>
>> migration/savevm.c
>>      typedef struct CompatEntry {
>>          MigrationId idstr;
>>
>>      typedef struct SaveStateEntry {
>>          MigrationId idstr;
>>
>>
>> - Steve
>>
> 


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-05-29 19:25       ` Peter Xu
@ 2024-05-30 17:12         ` Steven Sistare via
  2024-05-30 18:39           ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-30 17:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/29/2024 3:25 PM, Peter Xu wrote:
> On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
>> On 5/28/2024 5:44 PM, Peter Xu wrote:
>>> On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
>>>> Preserve fields of RAMBlocks that allocate their host memory during CPR so
>>>> the RAM allocation can be recovered.
>>>
>>> This sentence itself did not explain much, IMHO.  QEMU can share memory
>>> using fd based memory already of all kinds, as long as the memory backend
>>> is path-based it can be shared by sharing the same paths to dst.
>>>
>>> This reads very confusing as a generic concept.  I mean, QEMU migration
>>> relies on so many things to work right.  We mostly asks the users to "use
>>> exactly the same cmdline for src/dst QEMU unless you know what you're
>>> doing", otherwise many things can break.  That should also include ramblock
>>> being matched between src/dst due to the same cmdlines provided on both
>>> sides.  It'll be confusing to mention this when we thought the ramblocks
>>> also rely on that fact.
>>>
>>> So IIUC this sentence should be dropped in the real patch, and I'll try to
>>> guess the real reason with below..
>>
>> The properties of the implicitly created ramblocks must be preserved.
>> The defaults can and do change between qemu releases, even when the command-line
>> parameters do not change for the explicit objects that cause these implicit
>> ramblocks to be created.
> 
> AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
> have an example?  Would that already cause issue when migrate?

Alignment has changed, and used_length vs max_length changed when
resizeable ramblocks were introduced.  I have dealt with these issues
while supporting cpr for our internal use, and the learned lesson is to
explicitly communicate the creation-time parameters to new qemu.

These are not an issue for migration because the ramblock is re-created
and the data copied into the new memory.

>>>> Mirror the mr->align field in the RAMBlock to simplify the vmstate.
>>>> Preserve the old host address, even though it is immediately discarded,
>>>> as it will be needed in the future for CPR with iommufd.  Preserve
>>>> guest_memfd, even though CPR does not yet support it, to maintain vmstate
>>>> compatibility when it becomes supported.
>>>
>>> .. It could be about the vfio vaddr update feature that you mentioned and
>>> only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
>>> help here)?
>>>
>>> If so, IMHO we should have this patch (or any variance form) to be there
>>> for your upcoming vfio support.  Keeping this around like this will make
>>> the series harder to review.  Or is it needed even before VFIO?
>>
>> This patch is needed independently of vfio or iommufd.
>>
>> guest_memfd is independent of vfio or iommufd.  It is a recent addition
>> which I have not tried to support, but I added this placeholder field
>> to it can be supported in the future without adding a new field later
>> and maintaining backwards compatibility.
> 
> Is guest_memfd the only user so far, then?  If so, would it be possible we
> split it as a separate effort on top of the base cpr-exec support?

I don't understand the question.  I am indeed deferring support for guest_memfd
to a future time.  For now, I am adding a blocker, and reserving a field for
it in the preserved ramblock attributes, to avoid adding a subsection later.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-28 16:42             ` Peter Xu
@ 2024-05-30 17:17               ` Steven Sistare via
  2024-05-30 19:23                 ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-30 17:17 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fabiano Rosas, QEMU Developers

On 5/28/2024 12:42 PM, Peter Xu wrote:
> On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
>> On 5/27/2024 1:45 PM, Peter Xu wrote:
>>> On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
>>>> I understand, thanks.  If I can help with any of your todo list,
>>>> just ask - steve
>>>
>>> Thanks for offering the help, Steve.  Started looking at this today, then I
>>> found that I miss something high-level.  Let me ask here, and let me
>>> apologize already for starting to throw multiple questions..
>>>
>>> IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
>>> this case not host kernel but QEMU-only, and/or upper.
>>>
>>> Is there any justification on why the complexity is needed here?  It looks
>>> to me this one is more involved than cpr-reboot, so I'm thinking how much
>>> we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
>>> the min support, and if we even expect more to come, that's really
>>> important, IMHO.
>>>
>>> For example, what's the major motivation of this whole work?  Is that more
>>> on performance, or is it more for supporting the special devices like VFIO
>>> which we used to not support, or something else?  I can't find them in
>>> whatever cover letter I can find, including this one.
>>>
>>> Firstly, regarding performance, IMHO it'll be always nice to share even
>>> some very fundamental downtime measurement comparisons using the new exec
>>> mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
>>> have some number on hand when you started working on this feature years
>>> ago?  Or maybe some old links on the list would help too, as I didn't
>>> follow this work since the start.
>>>
>>> On VFIO, IIUC you started out this project without VFIO migration being
>>> there.  Now we have VFIO migration so not sure how much it would work for
>>> the upgrade use case. Even with current VFIO migration, we may not want to
>>> migrate device states for a local upgrade I suppose, as that can be a lot
>>> depending on the type of device assigned.  However it'll be nice to discuss
>>> this too if this is the major purpose of the series.
>>>
>>> I think one other challenge on QEMU upgrade with VFIO devices is that the
>>> dest QEMU won't be able to open the VFIO device when the src QEMU is still
>>> using it as the owner.  IIUC this is a similar condition where QEMU wants
>>> to have proper ownership transfer of a shared block device, and AFAIR right
>>> now we resolved that issue using some form of file lock on the image file.
>>> In this case it won't easily apply to a VFIO dev fd, but maybe we still
>>> have other approaches, not sure whether you investigated any.  E.g. could
>>> the VFIO handle be passed over using unix scm rights?  I think this might
>>> remove one dependency of using exec which can cause quite some difference
>>> v.s. a generic migration (from which regard, cpr-reboot is still a pretty
>>> generic migration).
>>>
>>> You also mentioned vhost/tap, is that also a major goal of this series in
>>> the follow up patchsets?  Is this a problem only because this solution will
>>> do exec?  Can it work if either the exec()ed qemu or dst qemu create the
>>> vhost/tap fds when boot?
>>>
>>> Meanwhile, could you elaborate a bit on the implication on chardevs?  From
>>> what I read in the doc update it looks like a major part of work in the
>>> future, but I don't yet understand the issue..  Is it also relevant to the
>>> exec() approach?
>>>
>>> In all cases, some of such discussion would be really appreciated.  And if
>>> you used to consider other approaches to solve this problem it'll be great
>>> to mention how you chose this way.  Considering this work contains too many
>>> things, it'll be nice if such discussion can start with the fundamentals,
>>> e.g. on why exec() is a must.
>>
>> The main goal of cpr-exec is providing a fast and reliable way to update
>> qemu. cpr-reboot is not fast enough or general enough.  It requires the
>> guest to support suspend and resume for all devices, and that takes seconds.
>> If one actually reboots the host, that adds more seconds, depending on
>> system services.  cpr-exec takes 0.1 secs, and works every time, unlike
>> like migration which can fail to converge on a busy system.  Live migration
>> also consumes more system and network resources.
> 
> Right, but note that when I was thinking of a comparison between cpr-exec
> v.s. normal migration, I didn't mean a "normal live migration".  I think
> it's more of the case whether exec() can be avoided.  I had a feeling that
> this exec() will cause a major part of work elsewhere but maybe I am wrong
> as I didn't see the whole branch.

The only parts of this work that are specific to exec are these patches
and the qemu_clear_cloexec() calls in cpr.c.
   vl: helper to request re-exec
   migration: precreate vmstate for exec

The rest would be the same if some other mechanism were used to start
new qemu.   Additional code would be needed for the new mechanism, such
as SCM_RIGHTS sends.

> AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
> least should be relevant to what devices are attached to the VM, right?
>
> E.g., I observed at least two things that can drastically enlarge the
> blackout window:
> 
>    1) vcpu save/load sometimes can take ridiculously long time, even if 99%
>    of them are fine.  I still didn't spend time looking at this issue, but
>    the outlier (of a single cpu save/load, while I don't remember whether
>    it's save or load, both will contribute to the downtime anyway) can cause
>    100+ms already for that single vcpu.  It'll already get more than 0.1sec.
> 
>    2) virtio device loads can be sometimes very slow due to virtqueue
>    manipulations.  We used to have developers working in this area,
>    e.g. this thread:
> 
>    https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com
> 
>    I don't yet have time to further look.  Since you mentioned vhost I was
>    wondering whether you hit similar issues, and if not why yet.  IIRC it
>    was only during VM loads so dest QEMU only.  Again that'll contribute to
>    the overall downtime too and that can also be 100ms or more, but that may
>    depend on VM memory topology and device setup.

100 ms is not a promise, it is an order-of-magnitude characterization. A typical
result.

> When we compare the solutions, we definitely don't need to make it "live":

Agreed.  The key metric is guest blackout time.  In fact, the 100 ms I quote
is blackout time, not elapsed time, though the latter is not much longer.

> it could be a migration starting with VM paused already, skipping all dirty
> tracking just like cpr-reboot, but in this case it's can be a relatively
> normal migration, so that we still invoke the new qemu binary and load that
> on the fly, perhaps taking the fds via scm rights.  Then compare these two
> solutions with/without exec().  Note that I'm not requesting for such data;
> it's not fair if that takes a lot of work already first to implement such
> idea, but what I wanted to say is that it might be interesting to first
> analyze what caused the downtime, and whether that can be logically
> resolved too without exec(); hence the below question on "why exec()" in
> the first place, as I still feel like that's somewhere we should avoid
> unless extremely necessary..

Exec is not a key requirement, but it works well.  Please give it fair
consideration.

>> cpr-exec seamlessly preserves client connections by preserving chardevs,
>> and overall provides a much nicer user experience.
> 
> I see.  However this is a common issue to migration, am I right?  I mean,
> if we have some chardevs on src host, then we migrate the VM from src to
> dst, then a reconnect will be needed anyway.  It looks to me that as long
> as the old live migration is supported, there's already a solution and apps
> are ok with reconnecting to the new ports.  

Apps may be OK with it, but I offer a better experience.
To be clear, chardev preservation is a nice feature that is easy to implement
with the cpr-exec framework, but is not the primary motivation for my work.

> From that POV, I am curious
> whether this can be seen as a (kind of separate) work besides the cpr-exec,
> however perhaps only a new feature only be valid for cpr-exec?

You need much of the cpr-exec (or cpr-scm) framework to support it:
a mechanism to preserve the open descriptor, and precreate vmstate to
identify the descriptor for new qemu.

> Meanwhile, is there some elaborations on what would be the major change of
> nicer user experience with the new solution?
> 
>>
>> chardev's are preserved by keeping their fd open across the exec, and
>> remembering the value of the fd in precreate vmstate so that new qemu
>> can associate the fd with the chardev rather than opening a new one.
>>
>> The approach of preserving open file descriptors is very general and applicable
>> to all kinds of devices, regardless of whether they support live migration
>> in hardware.  Device fd's are preserved using the same mechanism as for
>> chardevs.
>>
>> Devices that support live migration in hardware do not like to live migrate
>> in place to the same node.  It is not what they are designed for, and some
>> implementations will flat out fail because the source and target interfaces
>> are the same.
>>
>> For vhost/tap, sometimes the management layer opens the dev and passes an
>> fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
>> allows both.  For the case where qemu opens the dev, the fd is preserved
>> using the same mechanism as for chardevs.
>>
>> The fundamental requirements of this work are:
>>    - precreate vmstate
>>    - preserve open file descriptors
>>
>> Direct exec from old to new qemu is not a hard requirement.
> 
> Great to know..
> 
>> However, it is simple, with few complications, and works with Oracle's
>> cloud containers, so it is the method I am most interested in finishing
>> first.
>>
>> I believe everything could also be made to work by using SCM_RIGHTS to
>> send fd's to a new qemu process that is started by some external means.
>> It would be requested with MIG_MODE_CPR_SCM (or some better name), and
>> would co-exist with MIG_MODE_CPR_EXEC.
> 
> That sounds like a better thing to me, so that live migration framework is
> not changed as drastic.  I just still feel like exec() is too powerful, and
> evil can reside, just like black magic in the fairy tales; magicians try to
> avoid using it unless extremely necessary.

Fork is scarier; it preserves almost everything, with a few exceptions.
Exec destroys almost everything, with a few exceptions.
Please give it a chance.  The theorized cpr-scm would no doubt be useful
for some cloud vendors, but so is cpr-exec.  cpr-scm is intellectually
interesting to me, and I might work on it at some point, but cpr-exec is
what I need for our cloud.

> I think the next step for my review is to understand what is implied with
> exec().  I'll wait for you to push your tree somewhere so maybe I can read
> that and understand better.  A base commit would work too if you can share
> so I can apply the series, as it doesn't seem to apply to master now.

Try these tracepoints:
-trace enable=qemu_anon_memfd_alloc
-trace enable=ram_block_create
-trace enable='*factory*'
-trace enable='vmstate_*register'

I sent this to Peter already, but for others benefit, this series applies to
commit 5da72194df36535d77.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 07/26] migration: VMStateId
  2024-05-30 17:11             ` Steven Sistare via
@ 2024-05-30 18:03               ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2024-05-30 18:03 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Thu, May 30, 2024 at 01:11:26PM -0400, Steven Sistare wrote:
> On 5/29/2024 2:53 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:
> > > How about a more general name for the type:
> > > 
> > > migration/misc.h
> > >      typedef char (MigrationId)[256];
> > 
> > How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
> > the loop) may have a better idea.
> > 
> > Meanwhile, s/MigrationID/IDString/?
> 
> typedefs.h has a different purpose; giving short names to types
> defined in internal include files.
> 
> This id is specific to migration, so I still think its name should reflect
> migration and it belongs in some include/migration/*.h file.
> 
> ramblocks and migration are already closely related.  There is nothing wrong
> with including a migration header in ramblock.h so it can use a migration type.
> We already have:
>   include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h"
>   include/hw/display/ramfb.h:#include "migration/vmstate.h"
>   include/hw/input/pl050.h:#include "migration/vmstate.h"
>   include/hw/pci/shpc.h:#include "migration/vmstate.h"
>   include/hw/virtio/virtio.h:#include "migration/vmstate.h"
>   include/hw/hyperv/vmbus.h:#include "migration/vmstate.h"
> 
> The 256 byte magic length already appears in too many places, and my code
> would add more occurrences, so I really think that abstracting this type
> would be cleaner.

I agree having a typedef is nicer, but I don't understand why it must be
migration related.  It can be the type QEMU uses to represent any string
based ID, and that's a generic concept to me.

Migration can have a wrapper to process that type, but then migration will
include the generic header in that case, it feels more natural that way?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-05-30 17:11         ` Steven Sistare via
@ 2024-05-30 18:14           ` Peter Xu
  2024-05-31 19:32             ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-30 18:14 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote:
> On 5/29/2024 3:14 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > index 49f1cb2..ca04a0e 100644
> > > > > --- a/system/memory.c
> > > > > +++ b/system/memory.c
> > > > > @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
> > > > >                                          uint64_t size,
> > > > >                                          Error **errp)
> > > > >    {
> > > > > +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> > > > 
> > > > If there's a machine option to "use memfd for allocations", then it's
> > > > shared mem... Hmm..
> > > > 
> > > > It is a bit confusing to me in quite a few levels:
> > > > 
> > > >     - Why memory allocation method will be defined by a machine property,
> > > >       even if we have memory-backend-* which should cover everything?
> > > 
> > > Some memory regions are implicitly created, and have no explicit representation
> > > on the qemu command line.  memfd-alloc affects those.
> > > 
> > > More generally, memfd-alloc affects all ramblock allocations that are
> > > not explicitly represented by memory-backend object.  Thus the simple
> > > command line "qemu -m 1G" does not explicitly describe an object, so it
> > > goes through the anonymous allocation path, and is affected by memfd-alloc.
> > 
> > Can we simply now allow "qemu -m 1G" to work for cpr-exec?
> 
> I assume you meant "simply not allow".
> 
> Yes, I could do that, but I would need to explicitly add code to exclude this
> case, and add a blocker.  Right now it "just works" for all paths that lead to
> ram_block_alloc_host, without any special logic at the memory-backend level.
> And, I'm not convinced that simplifies the docs, as now I would need to tell
> the user that "-m 1G" and similar constructions do not work with cpr.
> 
> I can try to clarify the doc for -memfd-alloc as currently defined.

Why do we need to keep cpr working for existing qemu cmdlines?  We'll
already need to add more new cmdline options already anyway, right?

cpr-reboot wasn't doing it, and that made sense to me, so that new features
will require the user to opt-in for it, starting with changing its
cmdlines.

> 
> > AFAIU that's
> > what we do with cpr-reboot: we ask the user to specify the right things to
> > make other thing work.  Otherwise it won't.
> > 
> > > 
> > > Internally, create_default_memdev does create a memory-backend object.
> > > That is what my doc comment above refers to:
> > >    Any associated memory-backend objects are created with share=on
> > > 
> > > An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
> > > 
> > > The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
> > > 
> > > +#     Memory backend objects must have the share=on attribute, and
> > > +#     must be mmap'able in the new QEMU process.  For example,
> > > +#     memory-backend-file is acceptable, but memory-backend-ram is
> > > +#     not.
> > > +#
> > > +#     The VM must be started with the '-machine memfd-alloc=on'
> > > +#     option.  This causes implicit ram blocks -- those not explicitly
> > > +#     described by a memory-backend object -- to be allocated by
> > > +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> > > +#     RAM when it is specified without a memory-backend object.
> > 
> > VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
> > memory-backend-file,share=on propertly, these should be the only outliers?
> > 
> > Are these important enough for the downtime?  Can we put them into the
> > migrated image alongside with the rest device states?
> 
> It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
> The pages must remain pinned during CPR to support ongoing DMA activity
> which could target those pages (which we do not quiesce), and the same
> physical pages must be used for the ramblocks in the new qemu process.

Ah ok, yes DMA can happen on the fly.

Guest mem is definitely the major DMA target and that can be covered by
-object memory-backend-*,shared=on cmdlines.

ROM is definitely not a DMA target.  So is VGA ram a target for, perhaps,
an assigned vGPU device?  Do we have a list of things that will need that?
Can we make them work somehow by sharing them like guest mem?

It'll be a complete tragedy if we introduced this whole thing only because
of some minority.  I want to understand whether there's any generic way to
solve this problem rather than this magical machine property.  IMHO it's
very not trivial to maintain.

> 
> > > >     - Even if we have such a machine property, why setting "memfd" will
> > > >       always imply shared?  why not private?  After all it's not called
> > > >       "memfd-shared-alloc", and we can create private mappings using
> > > >       e.g. memory-backend-memfd,share=off.
> > > 
> > > There is no use case for memfd-alloc with share=off, so no point IMO in
> > > making the option more verbose.
> > 
> > Unfortunately this fact doesn't make the property easier to understand. :-( >
> > > For cpr, the mapping with all its modifications must be visible to new
> > > qemu when qemu mmaps it.
> > 
> > So this might be the important part - do you mean migrating
> > VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
> > Could you elaborate?
> 
> Pinning.
> 
> > Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
> > restrict that specialty to minimal, making the interfacing as clear as
> > possible, or (at least migration) maintainers will start to be soon scared
> > and running away, if such proposal was not shot down.
> > 
> > In short, I hope when we introduce new knobs for cpr, we shouldn't always
> > keep cpr-* modes in mind, but consider whenever the user can use it without
> > cpr-*.  I'm not sure whether it'll be always possible, but we should try.
> 
> I agree in principle.  FWIW, I have tried to generalize the functionality needed
> by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers,
> precreate vmstate, factory objects; to base it on migration internals with
> minimal change (vmstate); and to make minimal changes in the migration control
> paths.

Thanks.

For this one I think reusing -object interface (hopefully without
introducing a knob) would be a great step if that can fully describe what
cpr-exec is looking for.  E.g., when cpr-exec mode enabled it can sanity
check the memory backends making sure all things satisfy its need, and fail
migration otherwise upfront.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-05-30 17:12         ` Steven Sistare via
@ 2024-05-30 18:39           ` Peter Xu
  2024-05-31 19:32             ` Steven Sistare via
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-05-30 18:39 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:
> On 5/29/2024 3:25 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
> > > On 5/28/2024 5:44 PM, Peter Xu wrote:
> > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> > > > > Preserve fields of RAMBlocks that allocate their host memory during CPR so
> > > > > the RAM allocation can be recovered.
> > > > 
> > > > This sentence itself did not explain much, IMHO.  QEMU can share memory
> > > > using fd based memory already of all kinds, as long as the memory backend
> > > > is path-based it can be shared by sharing the same paths to dst.
> > > > 
> > > > This reads very confusing as a generic concept.  I mean, QEMU migration
> > > > relies on so many things to work right.  We mostly asks the users to "use
> > > > exactly the same cmdline for src/dst QEMU unless you know what you're
> > > > doing", otherwise many things can break.  That should also include ramblock
> > > > being matched between src/dst due to the same cmdlines provided on both
> > > > sides.  It'll be confusing to mention this when we thought the ramblocks
> > > > also rely on that fact.
> > > > 
> > > > So IIUC this sentence should be dropped in the real patch, and I'll try to
> > > > guess the real reason with below..
> > > 
> > > The properties of the implicitly created ramblocks must be preserved.
> > > The defaults can and do change between qemu releases, even when the command-line
> > > parameters do not change for the explicit objects that cause these implicit
> > > ramblocks to be created.
> > 
> > AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
> > have an example?  Would that already cause issue when migrate?
> 
> Alignment has changed, and used_length vs max_length changed when
> resizeable ramblocks were introduced.  I have dealt with these issues
> while supporting cpr for our internal use, and the learned lesson is to
> explicitly communicate the creation-time parameters to new qemu.

Why used_length can change?  I'm looking at ram_mig_ram_block_resized():

    if (!migration_is_idle()) {
        /*
         * Precopy code on the source cannot deal with the size of RAM blocks
         * changing at random points in time - especially after sending the
         * RAM block sizes in the migration stream, they must no longer change.
         * Abort and indicate a proper reason.
         */
        error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr);
        migration_cancel(err);
        error_free(err);
    }

We sent used_length upfront of a migration during SETUP phase.  Looks like
what you're describing can be something different, though?

Regarding to rb->align: isn't that mostly a constant, reflecting the MR's
alignment?  It's set when ramblock is created IIUC:

    rb->align = mr->align;

When will the alignment change?

> 
> These are not an issue for migration because the ramblock is re-created
> and the data copied into the new memory.
> 
> > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> > > > > Preserve the old host address, even though it is immediately discarded,
> > > > > as it will be needed in the future for CPR with iommufd.  Preserve
> > > > > guest_memfd, even though CPR does not yet support it, to maintain vmstate
> > > > > compatibility when it becomes supported.
> > > > 
> > > > .. It could be about the vfio vaddr update feature that you mentioned and
> > > > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
> > > > help here)?
> > > > 
> > > > If so, IMHO we should have this patch (or any variance form) to be there
> > > > for your upcoming vfio support.  Keeping this around like this will make
> > > > the series harder to review.  Or is it needed even before VFIO?
> > > 
> > > This patch is needed independently of vfio or iommufd.
> > > 
> > > guest_memfd is independent of vfio or iommufd.  It is a recent addition
> > > which I have not tried to support, but I added this placeholder field
> > > to it can be supported in the future without adding a new field later
> > > and maintaining backwards compatibility.
> > 
> > Is guest_memfd the only user so far, then?  If so, would it be possible we
> > split it as a separate effort on top of the base cpr-exec support?
> 
> I don't understand the question.  I am indeed deferring support for guest_memfd
> to a future time.  For now, I am adding a blocker, and reserving a field for
> it in the preserved ramblock attributes, to avoid adding a subsection later.

I meant I'm thinking whether the new ramblock vmsd may not be required for
the initial implementation.

E.g., IIUC vaddr is required by iommufd, and so far that's not part of the
initial support.

Then I think a major thing is about the fds to be managed that will need to
be shared.  If we put guest_memfd aside, it can be really, mostly, about
VFIO fds.  For that, I'm wondering whether you looked into something like
this:

commit da3e04b26fd8d15b344944504d5ffa9c5f20b54b
Author: Zhenzhong Duan <zhenzhong.duan@intel.com>
Date:   Tue Nov 21 16:44:10 2023 +0800

    vfio/pci: Make vfio cdev pre-openable by passing a file handle

I just notice this when I was thinking of a way where it might be possible
to avoid QEMU vfio-pci open the device at all, then I found we have
something like that already..

Then if the mgmt wants, IIUC that fd can be passed down from Libvirt
cleanly to dest qemu in a no-exec context.  Would this work too, and
cleaner / reusing existing infrastructures?

I think it's nice to always have libvirt managing most, or possible, all
fds that qemu uses, then we don't even need scm_rights.  But I didn't look
deeper into this, just a thought.

When thinking about this, I also wonder how cpr-exec handles the limited
environments like cgroups and especially seccomps.  I'm not sure what's the
status of that in most cloud environments, but I think exec() / fork() is
definitely not always on the seccomp whitelist, and I think that's also
another reason why we can think about avoid using them.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 00/26] Live update: cpr-exec
  2024-05-30 17:17               ` Steven Sistare via
@ 2024-05-30 19:23                 ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2024-05-30 19:23 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Fabiano Rosas, QEMU Developers

On Thu, May 30, 2024 at 01:17:05PM -0400, Steven Sistare wrote:
> On 5/28/2024 12:42 PM, Peter Xu wrote:
> > On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> > > On 5/27/2024 1:45 PM, Peter Xu wrote:
> > > > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > > > I understand, thanks.  If I can help with any of your todo list,
> > > > > just ask - steve
> > > > 
> > > > Thanks for offering the help, Steve.  Started looking at this today, then I
> > > > found that I miss something high-level.  Let me ask here, and let me
> > > > apologize already for starting to throw multiple questions..
> > > > 
> > > > IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> > > > this case not host kernel but QEMU-only, and/or upper.
> > > > 
> > > > Is there any justification on why the complexity is needed here?  It looks
> > > > to me this one is more involved than cpr-reboot, so I'm thinking how much
> > > > we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
> > > > the min support, and if we even expect more to come, that's really
> > > > important, IMHO.
> > > > 
> > > > For example, what's the major motivation of this whole work?  Is that more
> > > > on performance, or is it more for supporting the special devices like VFIO
> > > > which we used to not support, or something else?  I can't find them in
> > > > whatever cover letter I can find, including this one.
> > > > 
> > > > Firstly, regarding performance, IMHO it'll be always nice to share even
> > > > some very fundamental downtime measurement comparisons using the new exec
> > > > mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> > > > have some number on hand when you started working on this feature years
> > > > ago?  Or maybe some old links on the list would help too, as I didn't
> > > > follow this work since the start.
> > > > 
> > > > On VFIO, IIUC you started out this project without VFIO migration being
> > > > there.  Now we have VFIO migration so not sure how much it would work for
> > > > the upgrade use case. Even with current VFIO migration, we may not want to
> > > > migrate device states for a local upgrade I suppose, as that can be a lot
> > > > depending on the type of device assigned.  However it'll be nice to discuss
> > > > this too if this is the major purpose of the series.
> > > > 
> > > > I think one other challenge on QEMU upgrade with VFIO devices is that the
> > > > dest QEMU won't be able to open the VFIO device when the src QEMU is still
> > > > using it as the owner.  IIUC this is a similar condition where QEMU wants
> > > > to have proper ownership transfer of a shared block device, and AFAIR right
> > > > now we resolved that issue using some form of file lock on the image file.
> > > > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > > > have other approaches, not sure whether you investigated any.  E.g. could
> > > > the VFIO handle be passed over using unix scm rights?  I think this might
> > > > remove one dependency of using exec which can cause quite some difference
> > > > v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> > > > generic migration).
> > > > 
> > > > You also mentioned vhost/tap, is that also a major goal of this series in
> > > > the follow up patchsets?  Is this a problem only because this solution will
> > > > do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> > > > vhost/tap fds when boot?
> > > > 
> > > > Meanwhile, could you elaborate a bit on the implication on chardevs?  From
> > > > what I read in the doc update it looks like a major part of work in the
> > > > future, but I don't yet understand the issue..  Is it also relevant to the
> > > > exec() approach?
> > > > 
> > > > In all cases, some of such discussion would be really appreciated.  And if
> > > > you used to consider other approaches to solve this problem it'll be great
> > > > to mention how you chose this way.  Considering this work contains too many
> > > > things, it'll be nice if such discussion can start with the fundamentals,
> > > > e.g. on why exec() is a must.
> > > 
> > > The main goal of cpr-exec is providing a fast and reliable way to update
> > > qemu. cpr-reboot is not fast enough or general enough.  It requires the
> > > guest to support suspend and resume for all devices, and that takes seconds.
> > > If one actually reboots the host, that adds more seconds, depending on
> > > system services.  cpr-exec takes 0.1 secs, and works every time, unlike
> > > like migration which can fail to converge on a busy system.  Live migration
> > > also consumes more system and network resources.
> > 
> > Right, but note that when I was thinking of a comparison between cpr-exec
> > v.s. normal migration, I didn't mean a "normal live migration".  I think
> > it's more of the case whether exec() can be avoided.  I had a feeling that
> > this exec() will cause a major part of work elsewhere but maybe I am wrong
> > as I didn't see the whole branch.
> 
> The only parts of this work that are specific to exec are these patches
> and the qemu_clear_cloexec() calls in cpr.c.
>   vl: helper to request re-exec
>   migration: precreate vmstate for exec
> 
> The rest would be the same if some other mechanism were used to start
> new qemu.   Additional code would be needed for the new mechanism, such
> as SCM_RIGHTS sends.

Please see my other reply; I feel like there's chance to avoid more, but I
don't think we finished discussion on the e.g. vga ram implications, or the
vfio-pci fd reuse. So we can keep the discussion there.

> 
> > AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
> > least should be relevant to what devices are attached to the VM, right?
> > 
> > E.g., I observed at least two things that can drastically enlarge the
> > blackout window:
> > 
> >    1) vcpu save/load sometimes can take ridiculously long time, even if 99%
> >    of them are fine.  I still didn't spend time looking at this issue, but
> >    the outlier (of a single cpu save/load, while I don't remember whether
> >    it's save or load, both will contribute to the downtime anyway) can cause
> >    100+ms already for that single vcpu.  It'll already get more than 0.1sec.
> > 
> >    2) virtio device loads can be sometimes very slow due to virtqueue
> >    manipulations.  We used to have developers working in this area,
> >    e.g. this thread:
> > 
> >    https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com
> > 
> >    I don't yet have time to further look.  Since you mentioned vhost I was
> >    wondering whether you hit similar issues, and if not why yet.  IIRC it
> >    was only during VM loads so dest QEMU only.  Again that'll contribute to
> >    the overall downtime too and that can also be 100ms or more, but that may
> >    depend on VM memory topology and device setup.
> 
> 100 ms is not a promise, it is an order-of-magnitude characterization. A typical
> result.
> 
> > When we compare the solutions, we definitely don't need to make it "live":
> 
> Agreed.  The key metric is guest blackout time.  In fact, the 100 ms I quote
> is blackout time, not elapsed time, though the latter is not much longer.

Here I think what would be interesting is how exec() could help reduce the
blackout time comparing to invoking another qemu.

The major device states save/load look like to be a shared contribution.
Then ram sharing is also a shared attribute that can be leveraged without
exec() approach.

FDs passover is indeed another good point on reducing blackout window
(including your vfio vaddr update work), but that also doesn't seem like
relevant to exec().

> 
> > it could be a migration starting with VM paused already, skipping all dirty
> > tracking just like cpr-reboot, but in this case it's can be a relatively
> > normal migration, so that we still invoke the new qemu binary and load that
> > on the fly, perhaps taking the fds via scm rights.  Then compare these two
> > solutions with/without exec().  Note that I'm not requesting for such data;
> > it's not fair if that takes a lot of work already first to implement such
> > idea, but what I wanted to say is that it might be interesting to first
> > analyze what caused the downtime, and whether that can be logically
> > resolved too without exec(); hence the below question on "why exec()" in
> > the first place, as I still feel like that's somewhere we should avoid
> > unless extremely necessary..
> 
> Exec is not a key requirement, but it works well.  Please give it fair
> consideration.

Right, I think I'm still trying to understand what it can bring.  Even
though I must confess personally I definitely prefer anything but it.. So
maybe I'll be convinced at some point, so far just not fully yet.

> 
> > > cpr-exec seamlessly preserves client connections by preserving chardevs,
> > > and overall provides a much nicer user experience.
> > 
> > I see.  However this is a common issue to migration, am I right?  I mean,
> > if we have some chardevs on src host, then we migrate the VM from src to
> > dst, then a reconnect will be needed anyway.  It looks to me that as long
> > as the old live migration is supported, there's already a solution and apps
> > are ok with reconnecting to the new ports.
> 
> Apps may be OK with it, but I offer a better experience.
> To be clear, chardev preservation is a nice feature that is easy to implement
> with the cpr-exec framework, but is not the primary motivation for my
> work.

E.g., libvirt used to have a connection to a chardev backend, with legacy
code it will need a reconnect?  Now libvirt can avoid that reconnect
operation.  Is that the case?

The issue is I still don't see why it's a major benefit if libvirt already
supports the reconnections; it can be another story if it didn't.  I don't
think chardev usages should be sensitive to performance / reconnects
either?

Meanwhile, do we need to modify all chardev call sites to support them one
by one?  please bare with me if there can be silly questions, I'm not
familiar with that area.

> 
> > From that POV, I am curious
> > whether this can be seen as a (kind of separate) work besides the cpr-exec,
> > however perhaps only a new feature only be valid for cpr-exec?
> 
> You need much of the cpr-exec (or cpr-scm) framework to support it:
> a mechanism to preserve the open descriptor, and precreate vmstate to
> identify the descriptor for new qemu.

Let's see how you think about it when you read the vfio commit on the
pre-opened device.  I feel like it's a pretty good idea to provide such a
generic interface so that fds are more flexibly managed in QEMU.

I'd be more than glad if libvirt can manage all the fds, so that the
pre-create approach isn't required, maybe?  That's a major tricky part that
I feel nervous in this series besides exec() itself.  I'm not sure whether
that can also extend to chardevs too, but there'll be similar question on
whether it'll be worthwhile to avoid the reconnection if major effort is
needed.

> 
> > Meanwhile, is there some elaborations on what would be the major change of
> > nicer user experience with the new solution?
> > 
> > > 
> > > chardev's are preserved by keeping their fd open across the exec, and
> > > remembering the value of the fd in precreate vmstate so that new qemu
> > > can associate the fd with the chardev rather than opening a new one.
> > > 
> > > The approach of preserving open file descriptors is very general and applicable
> > > to all kinds of devices, regardless of whether they support live migration
> > > in hardware.  Device fd's are preserved using the same mechanism as for
> > > chardevs.
> > > 
> > > Devices that support live migration in hardware do not like to live migrate
> > > in place to the same node.  It is not what they are designed for, and some
> > > implementations will flat out fail because the source and target interfaces
> > > are the same.
> > > 
> > > For vhost/tap, sometimes the management layer opens the dev and passes an
> > > fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
> > > allows both.  For the case where qemu opens the dev, the fd is preserved
> > > using the same mechanism as for chardevs.
> > > 
> > > The fundamental requirements of this work are:
> > >    - precreate vmstate
> > >    - preserve open file descriptors
> > > 
> > > Direct exec from old to new qemu is not a hard requirement.
> > 
> > Great to know..
> > 
> > > However, it is simple, with few complications, and works with Oracle's
> > > cloud containers, so it is the method I am most interested in finishing
> > > first.
> > > 
> > > I believe everything could also be made to work by using SCM_RIGHTS to
> > > send fd's to a new qemu process that is started by some external means.
> > > It would be requested with MIG_MODE_CPR_SCM (or some better name), and
> > > would co-exist with MIG_MODE_CPR_EXEC.
> > 
> > That sounds like a better thing to me, so that live migration framework is
> > not changed as drastic.  I just still feel like exec() is too powerful, and
> > evil can reside, just like black magic in the fairy tales; magicians try to
> > avoid using it unless extremely necessary.
> 
> Fork is scarier; it preserves almost everything, with a few exceptions.
> Exec destroys almost everything, with a few exceptions.

Hmm this is a very interesting angle to see the syscalls, thanks.  And
OTOH.. I'm definitely not suggesting fork()..

> Please give it a chance.  The theorized cpr-scm would no doubt be useful
> for some cloud vendors, but so is cpr-exec.  cpr-scm is intellectually
> interesting to me, and I might work on it at some point, but cpr-exec is
> what I need for our cloud.

I kind of understand, and as an individual that I worked with you on
multiple series I have my own personal feelings. You're definitely one of
the good developers I've been working with, if not fall into great
category.  It's all about the hat, not the red one..

CPR was floating around for too long, and part of that was because there
weren't enough people reviewing, which I'd blame QEMU if that's a "person"
alone, and a person can "die" if he/she does too many wrong things.

However from a company's pov, that's the trade-off for upstreaming-first
approach, and company needs to make a decision irrelevant of community
behavior, I guess. While when a company decided to go further without
upstreaming there's the risk of "tech debt".

Please keep convicing that cpr-exec is the best.  I don't think we have a
conclusion yet.

> 
> > I think the next step for my review is to understand what is implied with
> > exec().  I'll wait for you to push your tree somewhere so maybe I can read
> > that and understand better.  A base commit would work too if you can share
> > so I can apply the series, as it doesn't seem to apply to master now.
> 
> Try these tracepoints:
> -trace enable=qemu_anon_memfd_alloc
> -trace enable=ram_block_create
> -trace enable='*factory*'
> -trace enable='vmstate_*register'
> 
> I sent this to Peter already, but for others benefit, this series applies to
> commit 5da72194df36535d77.

Yes, thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-05-30 18:14           ` Peter Xu
@ 2024-05-31 19:32             ` Steven Sistare via
  2024-06-03 21:48               ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-31 19:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/30/2024 2:14 PM, Peter Xu wrote:
> On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote:
>> On 5/29/2024 3:14 PM, Peter Xu wrote:
>>> On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
>>>>>> diff --git a/system/memory.c b/system/memory.c
>>>>>> index 49f1cb2..ca04a0e 100644
>>>>>> --- a/system/memory.c
>>>>>> +++ b/system/memory.c
>>>>>> @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
>>>>>>                                           uint64_t size,
>>>>>>                                           Error **errp)
>>>>>>     {
>>>>>> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>>>>>
>>>>> If there's a machine option to "use memfd for allocations", then it's
>>>>> shared mem... Hmm..
>>>>>
>>>>> It is a bit confusing to me in quite a few levels:
>>>>>
>>>>>      - Why memory allocation method will be defined by a machine property,
>>>>>        even if we have memory-backend-* which should cover everything?
>>>>
>>>> Some memory regions are implicitly created, and have no explicit representation
>>>> on the qemu command line.  memfd-alloc affects those.
>>>>
>>>> More generally, memfd-alloc affects all ramblock allocations that are
>>>> not explicitly represented by memory-backend object.  Thus the simple
>>>> command line "qemu -m 1G" does not explicitly describe an object, so it
>>>> goes through the anonymous allocation path, and is affected by memfd-alloc.
>>>
>>> Can we simply now allow "qemu -m 1G" to work for cpr-exec?
>>
>> I assume you meant "simply not allow".
>>
>> Yes, I could do that, but I would need to explicitly add code to exclude this
>> case, and add a blocker.  Right now it "just works" for all paths that lead to
>> ram_block_alloc_host, without any special logic at the memory-backend level.
>> And, I'm not convinced that simplifies the docs, as now I would need to tell
>> the user that "-m 1G" and similar constructions do not work with cpr.
>>
>> I can try to clarify the doc for -memfd-alloc as currently defined.
> 
> Why do we need to keep cpr working for existing qemu cmdlines?  We'll
> already need to add more new cmdline options already anyway, right?
> 
> cpr-reboot wasn't doing it, and that made sense to me, so that new features
> will require the user to opt-in for it, starting with changing its
> cmdlines.

I agree.  We need a new option to opt-in to cpr-friendly memory allocation, and I
am proposing -machine memfd-alloc. I am simply saying that I can try to do a better
job explaining the functionality in my proposed text for memfd-alloc, instead of
changing the functionality to exclude "-m 1G".  I believe excluding "-m 1G" is the
wrong approach, for the reasons I stated - messier implementation *and* documentation.

I am open to different syntax for opting in.

>>> AFAIU that's
>>> what we do with cpr-reboot: we ask the user to specify the right things to
>>> make other thing work.  Otherwise it won't.
>>>
>>>>
>>>> Internally, create_default_memdev does create a memory-backend object.
>>>> That is what my doc comment above refers to:
>>>>     Any associated memory-backend objects are created with share=on
>>>>
>>>> An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
>>>>
>>>> The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
>>>>
>>>> +#     Memory backend objects must have the share=on attribute, and
>>>> +#     must be mmap'able in the new QEMU process.  For example,
>>>> +#     memory-backend-file is acceptable, but memory-backend-ram is
>>>> +#     not.
>>>> +#
>>>> +#     The VM must be started with the '-machine memfd-alloc=on'
>>>> +#     option.  This causes implicit ram blocks -- those not explicitly
>>>> +#     described by a memory-backend object -- to be allocated by
>>>> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>>>> +#     RAM when it is specified without a memory-backend object.
>>>
>>> VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
>>> memory-backend-file,share=on propertly, these should be the only outliers?
>>>
>>> Are these important enough for the downtime?  Can we put them into the
>>> migrated image alongside with the rest device states?
>>
>> It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
>> The pages must remain pinned during CPR to support ongoing DMA activity
>> which could target those pages (which we do not quiesce), and the same
>> physical pages must be used for the ramblocks in the new qemu process.
> 
> Ah ok, yes DMA can happen on the fly.
> 
> Guest mem is definitely the major DMA target and that can be covered by
> -object memory-backend-*,shared=on cmdlines.
> 
> ROM is definitely not a DMA target.  So is VGA ram a target for, perhaps,
> an assigned vGPU device?  Do we have a list of things that will need that?
> Can we make them work somehow by sharing them like guest mem?

The pass-through devices map and pin all memory accessible to the guest.
We cannot make exceptions based on our intuition of how the memory will
and will not be used.

Also, we cannot simply abandon the old pinned ramblocks, owned by an mm_struct
that will become a zombie.  We would actually need to write additional code
to call device ioctls to unmap the oddball ramblocks.  It is far cleaner
and more correct to preserve them all.

> It'll be a complete tragedy if we introduced this whole thing only because
> of some minority.  I want to understand whether there's any generic way to
> solve this problem rather than this magical machine property.  IMHO it's
> very not trivial to maintain.

The machine property is the generic way.

A single opt-in option to call memfd_create() is an elegant and effective solution.
The code is small and not hard to maintain.  This is the option patch.  Most of it
is the boiler plate that any option has, and the single code location that formerly
called qemu_anon_ram_alloc now optionally calls qemu_memfd_create:

   machine: memfd-alloc option             25 insertions(+), 28 deletions(-)

These patches are simply stylistic and modularity improvements for ramblock,
valuable in their own right, which allows the previous patch to be small and clean.

   physmem: ram_block_create               29 insertions(+), 21 deletions(-)
   physmem: hoist guest_memfd creation     48 insertions(+), 37 deletions(-)
   physmem: hoist host memory allocation   36 insertions(+), 44 deletions(-)
   physmem: set ram block idstr earlier    25 insertions(+), 28 deletions(-)

>>>>>      - Even if we have such a machine property, why setting "memfd" will
>>>>>        always imply shared?  why not private?  After all it's not called
>>>>>        "memfd-shared-alloc", and we can create private mappings using
>>>>>        e.g. memory-backend-memfd,share=off.
>>>>
>>>> There is no use case for memfd-alloc with share=off, so no point IMO in
>>>> making the option more verbose.
>>>
>>> Unfortunately this fact doesn't make the property easier to understand. :-( >
>>>> For cpr, the mapping with all its modifications must be visible to new
>>>> qemu when qemu mmaps it.
>>>
>>> So this might be the important part - do you mean migrating
>>> VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
>>> Could you elaborate?
>>
>> Pinning.
>>
>>> Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
>>> restrict that specialty to minimal, making the interfacing as clear as
>>> possible, or (at least migration) maintainers will start to be soon scared
>>> and running away, if such proposal was not shot down.
>>>
>>> In short, I hope when we introduce new knobs for cpr, we shouldn't always
>>> keep cpr-* modes in mind, but consider whenever the user can use it without
>>> cpr-*.  I'm not sure whether it'll be always possible, but we should try.
>>
>> I agree in principle.  FWIW, I have tried to generalize the functionality needed
>> by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers,
>> precreate vmstate, factory objects; to base it on migration internals with
>> minimal change (vmstate); and to make minimal changes in the migration control
>> paths.
> 
> Thanks.
> 
> For this one I think reusing -object interface (hopefully without
> introducing a knob) would be a great step if that can fully describe what
> cpr-exec is looking for.  E.g., when cpr-exec mode enabled it can sanity
> check the memory backends making sure all things satisfy its need, and fail
> migration otherwise upfront.

For '-object memory-backend-*', I can tell whether cpr is allowed or not
without additional knobs.  See the blocker patches for examples where cpr
is blocked.

The problem is the implicit ramblocks that currently call qemu_ram_alloc_internal.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-05-30 18:39           ` Peter Xu
@ 2024-05-31 19:32             ` Steven Sistare via
  2024-06-03 22:29               ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Sistare via @ 2024-05-31 19:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On 5/30/2024 2:39 PM, Peter Xu wrote:
> On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:
>> On 5/29/2024 3:25 PM, Peter Xu wrote:
>>> On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
>>>> On 5/28/2024 5:44 PM, Peter Xu wrote:
>>>>> On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
>>>>>> Preserve fields of RAMBlocks that allocate their host memory during CPR so
>>>>>> the RAM allocation can be recovered.
>>>>>
>>>>> This sentence itself did not explain much, IMHO.  QEMU can share memory
>>>>> using fd based memory already of all kinds, as long as the memory backend
>>>>> is path-based it can be shared by sharing the same paths to dst.
>>>>>
>>>>> This reads very confusing as a generic concept.  I mean, QEMU migration
>>>>> relies on so many things to work right.  We mostly asks the users to "use
>>>>> exactly the same cmdline for src/dst QEMU unless you know what you're
>>>>> doing", otherwise many things can break.  That should also include ramblock
>>>>> being matched between src/dst due to the same cmdlines provided on both
>>>>> sides.  It'll be confusing to mention this when we thought the ramblocks
>>>>> also rely on that fact.
>>>>>
>>>>> So IIUC this sentence should be dropped in the real patch, and I'll try to
>>>>> guess the real reason with below..
>>>>
>>>> The properties of the implicitly created ramblocks must be preserved.
>>>> The defaults can and do change between qemu releases, even when the command-line
>>>> parameters do not change for the explicit objects that cause these implicit
>>>> ramblocks to be created.
>>>
>>> AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
>>> have an example?  Would that already cause issue when migrate?
>>
>> Alignment has changed, and used_length vs max_length changed when
>> resizeable ramblocks were introduced.  I have dealt with these issues
>> while supporting cpr for our internal use, and the learned lesson is to
>> explicitly communicate the creation-time parameters to new qemu.
> 
> Why used_length can change?  I'm looking at ram_mig_ram_block_resized():
> 
>      if (!migration_is_idle()) {
>          /*
>           * Precopy code on the source cannot deal with the size of RAM blocks
>           * changing at random points in time - especially after sending the
>           * RAM block sizes in the migration stream, they must no longer change.
>           * Abort and indicate a proper reason.
>           */
>          error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr);
>          migration_cancel(err);
>          error_free(err);
>      }
> 
> We sent used_length upfront of a migration during SETUP phase.  Looks like
> what you're describing can be something different, though?

I was imprecise.  used_length did not change; it was introduced as being
different than max_length when resizeable ramblocks were introduced.

The max_length is not sent.  It is an implicit property of the implementation,
and can change.  It is the size of the memfd mapping, so we need to know it
and preserve it.

used_length is indeed sent during SETUP.  We could also send max_length
at that time, and store both in the struct ramblock, and *maybe* that would
be safe, but that is more fragile and less future proof than setting both
properties to the correct value when the ramblock struct is created.

And BTW, the ramblock properties are sent using ad-hoc code in setup.
I send them using nice clean vmstate.

> Regarding to rb->align: isn't that mostly a constant, reflecting the MR's
> alignment?  It's set when ramblock is created IIUC:
> 
>      rb->align = mr->align;
> 
> When will the alignment change?

The alignment specified by the mr to allocate a new block is an implicit property
of the implementation, and has changed before, from one qemu release to another.
Not often, but it did, and could again in the future.  Communicating the alignment
from old qemu to new qemu is future proof.

>> These are not an issue for migration because the ramblock is re-created
>> and the data copied into the new memory.
>>
>>>>>> Mirror the mr->align field in the RAMBlock to simplify the vmstate.
>>>>>> Preserve the old host address, even though it is immediately discarded,
>>>>>> as it will be needed in the future for CPR with iommufd.  Preserve
>>>>>> guest_memfd, even though CPR does not yet support it, to maintain vmstate
>>>>>> compatibility when it becomes supported.
>>>>>
>>>>> .. It could be about the vfio vaddr update feature that you mentioned and
>>>>> only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
>>>>> help here)?
>>>>>
>>>>> If so, IMHO we should have this patch (or any variance form) to be there
>>>>> for your upcoming vfio support.  Keeping this around like this will make
>>>>> the series harder to review.  Or is it needed even before VFIO?
>>>>
>>>> This patch is needed independently of vfio or iommufd.
>>>>
>>>> guest_memfd is independent of vfio or iommufd.  It is a recent addition
>>>> which I have not tried to support, but I added this placeholder field
>>>> to it can be supported in the future without adding a new field later
>>>> and maintaining backwards compatibility.
>>>
>>> Is guest_memfd the only user so far, then?  If so, would it be possible we
>>> split it as a separate effort on top of the base cpr-exec support?
>>
>> I don't understand the question.  I am indeed deferring support for guest_memfd
>> to a future time.  For now, I am adding a blocker, and reserving a field for
>> it in the preserved ramblock attributes, to avoid adding a subsection later.
> 
> I meant I'm thinking whether the new ramblock vmsd may not be required for
> the initial implementation.
> 
> E.g., IIUC vaddr is required by iommufd, and so far that's not part of the
> initial support.
> 
> Then I think a major thing is about the fds to be managed that will need to
> be shared.  If we put guest_memfd aside, it can be really, mostly, about
> VFIO fds.  

The block->fd must be preserved.  That is the fd of the memfd_create used
by cpr.

> For that, I'm wondering whether you looked into something like
> this:
> 
> commit da3e04b26fd8d15b344944504d5ffa9c5f20b54b
> Author: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Date:   Tue Nov 21 16:44:10 2023 +0800
> 
>      vfio/pci: Make vfio cdev pre-openable by passing a file handle
> 
> I just notice this when I was thinking of a way where it might be possible
> to avoid QEMU vfio-pci open the device at all, then I found we have
> something like that already..
> 
> Then if the mgmt wants, IIUC that fd can be passed down from Libvirt
> cleanly to dest qemu in a no-exec context.  Would this work too, and
> cleaner / reusing existing infrastructures?

That capability as currently defined would not work for cpr.  The fd is
pre-created, but qemu still calls the kernel to configure it.  cpr skips
all kernel configuration calls.

> I think it's nice to always have libvirt managing most, or possible, all
> fds that qemu uses, then we don't even need scm_rights.  But I didn't look
> deeper into this, just a thought.

One could imagine a solution where the manager extracts internal properties
of vfio, ramblock, etc and passes them as creation time parameters on the
new qemu command line.  And, the manager pre-creates all fd's so they
can be passed to old and new qemu. Lots of code required in qemu and in the
manager, and all implicitly created objects would need to me made explicit.
Yuck. The precreate vmstate approach is much simpler for all.

> When thinking about this, I also wonder how cpr-exec handles the limited
> environments like cgroups and especially seccomps.  I'm not sure what's the
> status of that in most cloud environments, but I think exec() / fork() is
> definitely not always on the seccomp whitelist, and I think that's also
> another reason why we can think about avoid using them.

Exec must be allowed to use cpr-exec mode.  Fork can remain blocked.   Currently
the qemu sandbox option can block 'spawn', which blocks both exec and fork. I have
a patch in my next series that makes this more fine grained, so one or the other
can be blocked. Those unwilling to allow exec can wait for cpr-scm mode :)

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-05-29 17:31     ` Steven Sistare via
  2024-05-29 19:14       ` Peter Xu
@ 2024-06-03 10:17       ` Daniel P. Berrangé
  2024-06-03 11:59         ` Steven Sistare via
  1 sibling, 1 reply; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-06-03 10:17 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Peter Xu, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> On 5/28/2024 5:12 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote:
> > > Allocate anonymous memory using memfd_create if the memfd-alloc machine
> > > option is set.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   hw/core/machine.c   | 22 ++++++++++++++++++++++
> > >   include/hw/boards.h |  1 +
> > >   qemu-options.hx     |  6 ++++++
> > >   system/memory.c     |  9 ++++++---
> > >   system/physmem.c    | 18 +++++++++++++++++-
> > >   system/trace-events |  1 +
> > >   6 files changed, 53 insertions(+), 4 deletions(-)

> > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > index cf61f6b..f0dfda5 100644
> > > --- a/qemu-options.hx
> > > +++ b/qemu-options.hx
> > > @@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> > >       "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
> > >       "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> > >       "                mem-merge=on|off controls memory merge support (default: on)\n"
> > > +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
> > >       "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
> > >       "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
> > >       "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> > > @@ -79,6 +80,11 @@ SRST
> > >           supported by the host, de-duplicates identical memory pages
> > >           among VMs instances (enabled by default).
> > > +    ``memfd-alloc=on|off``
> > > +        Enables or disables allocation of anonymous guest RAM using
> > > +        memfd_create.  Any associated memory-backend objects are created with
> > > +        share=on.  The memfd-alloc default is off.
> > > +
> > >       ``aes-key-wrap=on|off``
> > >           Enables or disables AES key wrapping support on s390-ccw hosts.
> > >           This feature controls whether AES wrapping keys will be created
> > > diff --git a/system/memory.c b/system/memory.c
> > > index 49f1cb2..ca04a0e 100644
> > > --- a/system/memory.c
> > > +++ b/system/memory.c
> > > @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
> > >                                         uint64_t size,
> > >                                         Error **errp)
> > >   {
> > > +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> > 
> > If there's a machine option to "use memfd for allocations", then it's
> > shared mem... Hmm..
> > 
> > It is a bit confusing to me in quite a few levels:
> > 
> >    - Why memory allocation method will be defined by a machine property,
> >      even if we have memory-backend-* which should cover everything?
> 
> Some memory regions are implicitly created, and have no explicit representation
> on the qemu command line.  memfd-alloc affects those.
> 
> More generally, memfd-alloc affects all ramblock allocations that are
> not explicitly represented by memory-backend object.  Thus the simple
> command line "qemu -m 1G" does not explicitly describe an object, so it
> goes through the anonymous allocation path, and is affected by memfd-alloc.
> 
> Internally, create_default_memdev does create a memory-backend object.
> That is what my doc comment above refers to:
>   Any associated memory-backend objects are created with share=on
> 
> An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
> 
> The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
> 
> +#     Memory backend objects must have the share=on attribute, and
> +#     must be mmap'able in the new QEMU process.  For example,
> +#     memory-backend-file is acceptable, but memory-backend-ram is
> +#     not.
> +#
> +#     The VM must be started with the '-machine memfd-alloc=on'
> +#     option.  This causes implicit ram blocks -- those not explicitly
> +#     described by a memory-backend object -- to be allocated by
> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> +#     RAM when it is specified without a memory-backend object.
> 
> >    - Even if we have such a machine property, why setting "memfd" will
> >      always imply shared?  why not private?  After all it's not called
> >      "memfd-shared-alloc", and we can create private mappings using
> >      e.g. memory-backend-memfd,share=off.
> 
> There is no use case for memfd-alloc with share=off, so no point IMO in
> making the option more verbose.  For cpr, the mapping with all its modifications
> must be visible to new qemu when qemu mmaps it.


So IIUC, cpr doesn't care about the use of 'memfd' as the specific impl,
it only cares that the memory is share=on.

Rather than having a machine type option "memfd-alloc" which is named after
a Linux specific impl detail, how about having a machine type option
"mem-share=on", which just happens to trigger use of memfd internally on
Linux ? That gives us freedom to use non-memfd options if appropriate in
the future.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-06-03 10:17       ` Daniel P. Berrangé
@ 2024-06-03 11:59         ` Steven Sistare via
  0 siblings, 0 replies; 122+ messages in thread
From: Steven Sistare via @ 2024-06-03 11:59 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On 6/3/2024 6:17 AM, Daniel P. Berrangé wrote:
> On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
>> On 5/28/2024 5:12 PM, Peter Xu wrote:
>>> On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote:
>>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>>>> option is set.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    hw/core/machine.c   | 22 ++++++++++++++++++++++
>>>>    include/hw/boards.h |  1 +
>>>>    qemu-options.hx     |  6 ++++++
>>>>    system/memory.c     |  9 ++++++---
>>>>    system/physmem.c    | 18 +++++++++++++++++-
>>>>    system/trace-events |  1 +
>>>>    6 files changed, 53 insertions(+), 4 deletions(-)
> 
>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>> index cf61f6b..f0dfda5 100644
>>>> --- a/qemu-options.hx
>>>> +++ b/qemu-options.hx
>>>> @@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>>        "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>>>        "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>>>        "                mem-merge=on|off controls memory merge support (default: on)\n"
>>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
>>>>        "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>>>>        "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>>>>        "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
>>>> @@ -79,6 +80,11 @@ SRST
>>>>            supported by the host, de-duplicates identical memory pages
>>>>            among VMs instances (enabled by default).
>>>> +    ``memfd-alloc=on|off``
>>>> +        Enables or disables allocation of anonymous guest RAM using
>>>> +        memfd_create.  Any associated memory-backend objects are created with
>>>> +        share=on.  The memfd-alloc default is off.
>>>> +
>>>>        ``aes-key-wrap=on|off``
>>>>            Enables or disables AES key wrapping support on s390-ccw hosts.
>>>>            This feature controls whether AES wrapping keys will be created
>>>> diff --git a/system/memory.c b/system/memory.c
>>>> index 49f1cb2..ca04a0e 100644
>>>> --- a/system/memory.c
>>>> +++ b/system/memory.c
>>>> @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
>>>>                                          uint64_t size,
>>>>                                          Error **errp)
>>>>    {
>>>> +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>>>
>>> If there's a machine option to "use memfd for allocations", then it's
>>> shared mem... Hmm..
>>>
>>> It is a bit confusing to me in quite a few levels:
>>>
>>>     - Why memory allocation method will be defined by a machine property,
>>>       even if we have memory-backend-* which should cover everything?
>>
>> Some memory regions are implicitly created, and have no explicit representation
>> on the qemu command line.  memfd-alloc affects those.
>>
>> More generally, memfd-alloc affects all ramblock allocations that are
>> not explicitly represented by memory-backend object.  Thus the simple
>> command line "qemu -m 1G" does not explicitly describe an object, so it
>> goes through the anonymous allocation path, and is affected by memfd-alloc.
>>
>> Internally, create_default_memdev does create a memory-backend object.
>> That is what my doc comment above refers to:
>>    Any associated memory-backend objects are created with share=on
>>
>> An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
>>
>> The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
>>
>> +#     Memory backend objects must have the share=on attribute, and
>> +#     must be mmap'able in the new QEMU process.  For example,
>> +#     memory-backend-file is acceptable, but memory-backend-ram is
>> +#     not.
>> +#
>> +#     The VM must be started with the '-machine memfd-alloc=on'
>> +#     option.  This causes implicit ram blocks -- those not explicitly
>> +#     described by a memory-backend object -- to be allocated by
>> +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
>> +#     RAM when it is specified without a memory-backend object.
>>
>>>     - Even if we have such a machine property, why setting "memfd" will
>>>       always imply shared?  why not private?  After all it's not called
>>>       "memfd-shared-alloc", and we can create private mappings using
>>>       e.g. memory-backend-memfd,share=off.
>>
>> There is no use case for memfd-alloc with share=off, so no point IMO in
>> making the option more verbose.  For cpr, the mapping with all its modifications
>> must be visible to new qemu when qemu mmaps it.
> 
> 
> So IIUC, cpr doesn't care about the use of 'memfd' as the specific impl,
> it only cares that the memory is share=on.
> 
> Rather than having a machine type option "memfd-alloc" which is named after
> a Linux specific impl detail, how about having a machine type option
> "mem-share=on", which just happens to trigger use of memfd internally on
> Linux ? That gives us freedom to use non-memfd options if appropriate in
> the future.

That would be fine.  Internally we still need a mechanism to preserve the
memory and name it so qemu can mmap it post-exec, but in theory we could
invent some other mechanism to do so, such as creating /dev/shm files with
canonical names.

- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-05-31 19:32             ` Steven Sistare via
@ 2024-06-03 21:48               ` Peter Xu
  2024-06-04  7:13                 ` Daniel P. Berrangé
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-06-03 21:48 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Fri, May 31, 2024 at 03:32:05PM -0400, Steven Sistare wrote:
> On 5/30/2024 2:14 PM, Peter Xu wrote:
> > On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote:
> > > On 5/29/2024 3:14 PM, Peter Xu wrote:
> > > > On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> > > > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > > > index 49f1cb2..ca04a0e 100644
> > > > > > > --- a/system/memory.c
> > > > > > > +++ b/system/memory.c
> > > > > > > @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
> > > > > > >                                           uint64_t size,
> > > > > > >                                           Error **errp)
> > > > > > >     {
> > > > > > > +    uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> > > > > > 
> > > > > > If there's a machine option to "use memfd for allocations", then it's
> > > > > > shared mem... Hmm..
> > > > > > 
> > > > > > It is a bit confusing to me in quite a few levels:
> > > > > > 
> > > > > >      - Why memory allocation method will be defined by a machine property,
> > > > > >        even if we have memory-backend-* which should cover everything?
> > > > > 
> > > > > Some memory regions are implicitly created, and have no explicit representation
> > > > > on the qemu command line.  memfd-alloc affects those.
> > > > > 
> > > > > More generally, memfd-alloc affects all ramblock allocations that are
> > > > > not explicitly represented by memory-backend object.  Thus the simple
> > > > > command line "qemu -m 1G" does not explicitly describe an object, so it
> > > > > goes through the anonymous allocation path, and is affected by memfd-alloc.
> > > > 
> > > > Can we simply now allow "qemu -m 1G" to work for cpr-exec?
> > > 
> > > I assume you meant "simply not allow".
> > > 
> > > Yes, I could do that, but I would need to explicitly add code to exclude this
> > > case, and add a blocker.  Right now it "just works" for all paths that lead to
> > > ram_block_alloc_host, without any special logic at the memory-backend level.
> > > And, I'm not convinced that simplifies the docs, as now I would need to tell
> > > the user that "-m 1G" and similar constructions do not work with cpr.
> > > 
> > > I can try to clarify the doc for -memfd-alloc as currently defined.
> > 
> > Why do we need to keep cpr working for existing qemu cmdlines?  We'll
> > already need to add more new cmdline options already anyway, right?
> > 
> > cpr-reboot wasn't doing it, and that made sense to me, so that new features
> > will require the user to opt-in for it, starting with changing its
> > cmdlines.
> 
> I agree.  We need a new option to opt-in to cpr-friendly memory allocation, and I
> am proposing -machine memfd-alloc. I am simply saying that I can try to do a better
> job explaining the functionality in my proposed text for memfd-alloc, instead of
> changing the functionality to exclude "-m 1G".  I believe excluding "-m 1G" is the
> wrong approach, for the reasons I stated - messier implementation *and* documentation.
> 
> I am open to different syntax for opting in.

If the machine property is the only way to go then I agree that might be a
good idea, even though the name can be further discussed.  But I still want
to figure out something below.

> 
> > > > AFAIU that's
> > > > what we do with cpr-reboot: we ask the user to specify the right things to
> > > > make other thing work.  Otherwise it won't.
> > > > 
> > > > > 
> > > > > Internally, create_default_memdev does create a memory-backend object.
> > > > > That is what my doc comment above refers to:
> > > > >     Any associated memory-backend objects are created with share=on
> > > > > 
> > > > > An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
> > > > > 
> > > > > The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
> > > > > 
> > > > > +#     Memory backend objects must have the share=on attribute, and
> > > > > +#     must be mmap'able in the new QEMU process.  For example,
> > > > > +#     memory-backend-file is acceptable, but memory-backend-ram is
> > > > > +#     not.
> > > > > +#
> > > > > +#     The VM must be started with the '-machine memfd-alloc=on'
> > > > > +#     option.  This causes implicit ram blocks -- those not explicitly
> > > > > +#     described by a memory-backend object -- to be allocated by
> > > > > +#     mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> > > > > +#     RAM when it is specified without a memory-backend object.
> > > > 
> > > > VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
> > > > memory-backend-file,share=on propertly, these should be the only outliers?
> > > > 
> > > > Are these important enough for the downtime?  Can we put them into the
> > > > migrated image alongside with the rest device states?
> > > 
> > > It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
> > > The pages must remain pinned during CPR to support ongoing DMA activity
> > > which could target those pages (which we do not quiesce), and the same
> > > physical pages must be used for the ramblocks in the new qemu process.
> > 
> > Ah ok, yes DMA can happen on the fly.
> > 
> > Guest mem is definitely the major DMA target and that can be covered by
> > -object memory-backend-*,shared=on cmdlines.
> > 
> > ROM is definitely not a DMA target.  So is VGA ram a target for, perhaps,
> > an assigned vGPU device?  Do we have a list of things that will need that?
> > Can we make them work somehow by sharing them like guest mem?
> 
> The pass-through devices map and pin all memory accessible to the guest.
> We cannot make exceptions based on our intuition of how the memory will
> and will not be used.

True, but if you see my whole point is trying to see whether we can get rid
of the machine property completely, and it's also because we're just so
close to getting rid of it.. which I feel.

QEMU memory layout can be complicated across all the platforms, but I doubt
whether that's true for CPR's purpose and archs that it plans to support.
A generic VM's memory topology shouldn't be that complicated at all,
e.g. on x86:

              Block Name    PSize              Offset               Used              Total                HVA  RO
                  pc.ram    4 KiB  0x0000000000000000 0x0000000008000000 0x0000000008000000 0x00007f684fe00000  rw
   0000:00:02.0/vga.vram    4 KiB  0x0000000008080000 0x0000000001000000 0x0000000001000000 0x00007f684e800000  rw
    /rom@etc/acpi/tables    4 KiB  0x0000000009100000 0x0000000000020000 0x0000000000200000 0x00007f684dc00000  ro
                 pc.bios    4 KiB  0x0000000008000000 0x0000000000040000 0x0000000000040000 0x00007f68dc200000  ro
  0000:00:03.0/e1000.rom    4 KiB  0x00000000090c0000 0x0000000000040000 0x0000000000040000 0x00007f684e000000  ro
                  pc.rom    4 KiB  0x0000000008040000 0x0000000000020000 0x0000000000020000 0x00007f684fa00000  ro
    0000:00:02.0/vga.rom    4 KiB  0x0000000009080000 0x0000000000010000 0x0000000000010000 0x00007f684e400000  ro
   /rom@etc/table-loader    4 KiB  0x0000000009300000 0x0000000000001000 0x0000000000010000 0x00007f684d800000  ro
      /rom@etc/acpi/rsdp    4 KiB  0x0000000009340000 0x0000000000001000 0x0000000000001000 0x00007f684d400000  ro

It's simply the major ram, ROMs, and VGA.

I hoped it can work without it, I'll mention one more reason below.

So if we're going to have this machine property, can I still request a
possible list of things that can be the target of this property besides
guest mem?  I just want to know where it can be used "for real".  I know it
might be complicated, but maybe not.  I really can only think of VGA, and I
doubt whether that should be DMAed at all.

> 
> Also, we cannot simply abandon the old pinned ramblocks, owned by an mm_struct
> that will become a zombie.  We would actually need to write additional code
> to call device ioctls to unmap the oddball ramblocks.  It is far cleaner
> and more correct to preserve them all.
> 
> > It'll be a complete tragedy if we introduced this whole thing only because
> > of some minority.  I want to understand whether there's any generic way to
> > solve this problem rather than this magical machine property.  IMHO it's
> > very not trivial to maintain.
> 
> The machine property is the generic way.
> 
> A single opt-in option to call memfd_create() is an elegant and effective solution.
> The code is small and not hard to maintain.  This is the option patch.  Most of it
> is the boiler plate that any option has, and the single code location that formerly
> called qemu_anon_ram_alloc now optionally calls qemu_memfd_create:
> 
>   machine: memfd-alloc option             25 insertions(+), 28 deletions(-)
> 
> These patches are simply stylistic and modularity improvements for ramblock,
> valuable in their own right, which allows the previous patch to be small and clean.
> 
>   physmem: ram_block_create               29 insertions(+), 21 deletions(-)
>   physmem: hoist guest_memfd creation     48 insertions(+), 37 deletions(-)
>   physmem: hoist host memory allocation   36 insertions(+), 44 deletions(-)
>   physmem: set ram block idstr earlier    25 insertions(+), 28 deletions(-)

Let me explain the other reason why I don't want to have that machine
property..

That property, irrelevant of what it is called (and I doubt whether Dan's
suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
visible fd but it's shared-ram for sure..), is yet another way to specify
guest mem types.

What if the user specified this property but specified something else in
the -object parameters?  E.g. -machine share-ram=on -object
memory-backend-ram,share=off.  What should we do?

Fundamentally that's also why I "hope" we can avoid adding yet one more
place configure guest mem, and stick with -objects.

> 
> > > > > >      - Even if we have such a machine property, why setting "memfd" will
> > > > > >        always imply shared?  why not private?  After all it's not called
> > > > > >        "memfd-shared-alloc", and we can create private mappings using
> > > > > >        e.g. memory-backend-memfd,share=off.
> > > > > 
> > > > > There is no use case for memfd-alloc with share=off, so no point IMO in
> > > > > making the option more verbose.
> > > > 
> > > > Unfortunately this fact doesn't make the property easier to understand. :-( >
> > > > > For cpr, the mapping with all its modifications must be visible to new
> > > > > qemu when qemu mmaps it.
> > > > 
> > > > So this might be the important part - do you mean migrating
> > > > VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
> > > > Could you elaborate?
> > > 
> > > Pinning.
> > > 
> > > > Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
> > > > restrict that specialty to minimal, making the interfacing as clear as
> > > > possible, or (at least migration) maintainers will start to be soon scared
> > > > and running away, if such proposal was not shot down.
> > > > 
> > > > In short, I hope when we introduce new knobs for cpr, we shouldn't always
> > > > keep cpr-* modes in mind, but consider whenever the user can use it without
> > > > cpr-*.  I'm not sure whether it'll be always possible, but we should try.
> > > 
> > > I agree in principle.  FWIW, I have tried to generalize the functionality needed
> > > by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers,
> > > precreate vmstate, factory objects; to base it on migration internals with
> > > minimal change (vmstate); and to make minimal changes in the migration control
> > > paths.
> > 
> > Thanks.
> > 
> > For this one I think reusing -object interface (hopefully without
> > introducing a knob) would be a great step if that can fully describe what
> > cpr-exec is looking for.  E.g., when cpr-exec mode enabled it can sanity
> > check the memory backends making sure all things satisfy its need, and fail
> > migration otherwise upfront.
> 
> For '-object memory-backend-*', I can tell whether cpr is allowed or not
> without additional knobs.  See the blocker patches for examples where cpr
> is blocked.
> 
> The problem is the implicit ramblocks that currently call qemu_ram_alloc_internal.

I hope it won't ever happen that we introduced this property because "we
don't know", then when we know we found no real RAM regions are using it if
properly specify -object.  So I hope at least we know what we're doing, and
for explicitly what reason..

Let me know if you think I'm asking too much (which is possible.. :).  But I
hope in case that might be interesting to you too to know the real answer.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
  2024-05-31 19:32             ` Steven Sistare via
@ 2024-06-03 22:29               ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2024-06-03 22:29 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Daniel P. Berrange, Markus Armbruster

On Fri, May 31, 2024 at 03:32:11PM -0400, Steven Sistare wrote:
> On 5/30/2024 2:39 PM, Peter Xu wrote:
> > On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:
> > > On 5/29/2024 3:25 PM, Peter Xu wrote:
> > > > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
> > > > > On 5/28/2024 5:44 PM, Peter Xu wrote:
> > > > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> > > > > > > Preserve fields of RAMBlocks that allocate their host memory during CPR so
> > > > > > > the RAM allocation can be recovered.
> > > > > > 
> > > > > > This sentence itself did not explain much, IMHO.  QEMU can share memory
> > > > > > using fd based memory already of all kinds, as long as the memory backend
> > > > > > is path-based it can be shared by sharing the same paths to dst.
> > > > > > 
> > > > > > This reads very confusing as a generic concept.  I mean, QEMU migration
> > > > > > relies on so many things to work right.  We mostly asks the users to "use
> > > > > > exactly the same cmdline for src/dst QEMU unless you know what you're
> > > > > > doing", otherwise many things can break.  That should also include ramblock
> > > > > > being matched between src/dst due to the same cmdlines provided on both
> > > > > > sides.  It'll be confusing to mention this when we thought the ramblocks
> > > > > > also rely on that fact.
> > > > > > 
> > > > > > So IIUC this sentence should be dropped in the real patch, and I'll try to
> > > > > > guess the real reason with below..
> > > > > 
> > > > > The properties of the implicitly created ramblocks must be preserved.
> > > > > The defaults can and do change between qemu releases, even when the command-line
> > > > > parameters do not change for the explicit objects that cause these implicit
> > > > > ramblocks to be created.
> > > > 
> > > > AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
> > > > have an example?  Would that already cause issue when migrate?
> > > 
> > > Alignment has changed, and used_length vs max_length changed when
> > > resizeable ramblocks were introduced.  I have dealt with these issues
> > > while supporting cpr for our internal use, and the learned lesson is to
> > > explicitly communicate the creation-time parameters to new qemu.
> > 
> > Why used_length can change?  I'm looking at ram_mig_ram_block_resized():
> > 
> >      if (!migration_is_idle()) {
> >          /*
> >           * Precopy code on the source cannot deal with the size of RAM blocks
> >           * changing at random points in time - especially after sending the
> >           * RAM block sizes in the migration stream, they must no longer change.
> >           * Abort and indicate a proper reason.
> >           */
> >          error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr);
> >          migration_cancel(err);
> >          error_free(err);
> >      }
> > 
> > We sent used_length upfront of a migration during SETUP phase.  Looks like
> > what you're describing can be something different, though?
> 
> I was imprecise.  used_length did not change; it was introduced as being
> different than max_length when resizeable ramblocks were introduced.
> 
> The max_length is not sent.  It is an implicit property of the implementation,
> and can change.  It is the size of the memfd mapping, so we need to know it
> and preserve it.
> 
> used_length is indeed sent during SETUP.  We could also send max_length
> at that time, and store both in the struct ramblock, and *maybe* that would
> be safe, but that is more fragile and less future proof than setting both
> properties to the correct value when the ramblock struct is created.
> 
> And BTW, the ramblock properties are sent using ad-hoc code in setup.
> I send them using nice clean vmstate.

Right, I agree that's not pretty at all... I wished we have had something
better, but that was just there for years.

When you said max_length can change, could you give an example?  I want to
know whether it means we have bug already, and bug fixing can even be done
before the rest.

Thinking now, maybe max_length is indeed fine to be changed acorss
migration?

Consider the fact that only used_length is used in both src/dst for
e.g. migration, dirty tracking, etc. purposes.  Basically we assumed that's
the "real size" of RAM irrelevant of "how large it used to be before
migration", or "how large it can grow after migration completes", while
max_length is "possible max value" here but isn't really important for
migration.

E.g., mem resize can allow a larger range after migration if the user
specifies max_length on dest to be larger than src max_length somehow, and
logically migration should still work indeed.  I just don't know whether
there'll be people using it like that.

> 
> > Regarding to rb->align: isn't that mostly a constant, reflecting the MR's
> > alignment?  It's set when ramblock is created IIUC:
> > 
> >      rb->align = mr->align;
> > 
> > When will the alignment change?
> 
> The alignment specified by the mr to allocate a new block is an implicit property
> of the implementation, and has changed before, from one qemu release to another.
> Not often, but it did, and could again in the future.  Communicating the alignment
> from old qemu to new qemu is future proof.

Same on this one; do you have examples around and share?

I hope we don't introduce things without good reasons.  If we're talking
about "alignment can change", it'll be very helpful to know what we're
fixing against (before CPR's need).

> 
> > > These are not an issue for migration because the ramblock is re-created
> > > and the data copied into the new memory.
> > > 
> > > > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> > > > > > > Preserve the old host address, even though it is immediately discarded,
> > > > > > > as it will be needed in the future for CPR with iommufd.  Preserve
> > > > > > > guest_memfd, even though CPR does not yet support it, to maintain vmstate
> > > > > > > compatibility when it becomes supported.
> > > > > > 
> > > > > > .. It could be about the vfio vaddr update feature that you mentioned and
> > > > > > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
> > > > > > help here)?
> > > > > > 
> > > > > > If so, IMHO we should have this patch (or any variance form) to be there
> > > > > > for your upcoming vfio support.  Keeping this around like this will make
> > > > > > the series harder to review.  Or is it needed even before VFIO?
> > > > > 
> > > > > This patch is needed independently of vfio or iommufd.
> > > > > 
> > > > > guest_memfd is independent of vfio or iommufd.  It is a recent addition
> > > > > which I have not tried to support, but I added this placeholder field
> > > > > to it can be supported in the future without adding a new field later
> > > > > and maintaining backwards compatibility.
> > > > 
> > > > Is guest_memfd the only user so far, then?  If so, would it be possible we
> > > > split it as a separate effort on top of the base cpr-exec support?
> > > 
> > > I don't understand the question.  I am indeed deferring support for guest_memfd
> > > to a future time.  For now, I am adding a blocker, and reserving a field for
> > > it in the preserved ramblock attributes, to avoid adding a subsection later.
> > 
> > I meant I'm thinking whether the new ramblock vmsd may not be required for
> > the initial implementation.
> > 
> > E.g., IIUC vaddr is required by iommufd, and so far that's not part of the
> > initial support.
> > 
> > Then I think a major thing is about the fds to be managed that will need to
> > be shared.  If we put guest_memfd aside, it can be really, mostly, about
> > VFIO fds.
> 
> The block->fd must be preserved.  That is the fd of the memfd_create used
> by cpr.

Right, cpr needs all fds be passed over and I think that's a great idea.

It could be a matter of how do we mark those fds, how to pass them over,
and whether do we need to manage them one by one, or in a batch.

E.g., in my mind now I'm picturing something, I probably shared it bit by
bit in my previous replies when trying to review your series, but in
general, a cleaner approach may look like this:

  - QEMU provides a fd-manager, managing all relevant fds.  It can be
    ramblock fds, vfio fds, vhost fds, or whatever fds.  We "name" these
    fds in some way, so that we know how to recover on the other side.  We
    don't differenciate them with different vmsds: no need to migrate a fd
    in ramblock vmsd, then a fd in vfio vmsd, then a fd in vhost fd.  We
    migrate them all, then modules can try to fetch them on dest qemu,
    perhaps transparently (like qemu_open_internal() on /dev/fdsets), maybe
    not.  I haven't thought about that details.

  - FDs need to be passed over _before_ VM starts.  It might be easier to
    not attach that to a "pre" phase of "migration", but it might be doable
    in such way that: when cpr-xxx mode is supported, Libvirt can use a new
    QMP command to fetch all the FDs in one shot using scm rights (e.g.,
    "fd-manager-fetch"), then apply those list of fds _before_ dest QEMU
    try to initialize using another QMP command (e.g.,
    "fd-manager-apply"). QEMU src/dst don't talk at all on the FDs; they
    rely on Libvirt to set them up.  This will greatly simplify migration
    code on fd passovers; either using execve() or scm rights.

In this picture, neither execve() nor new migration protocol change needed.
Migration stream keeps just like a normal migration stream.

> 
> > For that, I'm wondering whether you looked into something like
> > this:
> > 
> > commit da3e04b26fd8d15b344944504d5ffa9c5f20b54b
> > Author: Zhenzhong Duan <zhenzhong.duan@intel.com>
> > Date:   Tue Nov 21 16:44:10 2023 +0800
> > 
> >      vfio/pci: Make vfio cdev pre-openable by passing a file handle
> > 
> > I just notice this when I was thinking of a way where it might be possible
> > to avoid QEMU vfio-pci open the device at all, then I found we have
> > something like that already..
> > 
> > Then if the mgmt wants, IIUC that fd can be passed down from Libvirt
> > cleanly to dest qemu in a no-exec context.  Would this work too, and
> > cleaner / reusing existing infrastructures?
> 
> That capability as currently defined would not work for cpr.  The fd is
> pre-created, but qemu still calls the kernel to configure it.  cpr skips
> all kernel configuration calls.

It's just an idea.  I didn't look into the details of it, but I suppose
from this part it might be similar to what cpr-exec would need when using a
new fd-manager or similar approach.  Basically we allow fds to be passed
over too, not from original qemu using exec() but from libvirt.  Would that
work for us?

> 
> > I think it's nice to always have libvirt managing most, or possible, all
> > fds that qemu uses, then we don't even need scm_rights.  But I didn't look
> > deeper into this, just a thought.
> 
> One could imagine a solution where the manager extracts internal properties
> of vfio, ramblock, etc and passes them as creation time parameters on the
> new qemu command line.  And, the manager pre-creates all fd's so they
> can be passed to old and new qemu. Lots of code required in qemu and in the
> manager, and all implicitly created objects would need to me made explicit.
> Yuck. The precreate vmstate approach is much simpler for all.

So please correct me here if I misunderstood, but isn't this a shared
problem with/without precreate vmsd?

IIUC we always need a way to pass over the fds in this case, either by
exec() or scm right or other approaches.  It looks to me that here
precreate is only the transport to deliver those fds, or am I wrong?

> 
> > When thinking about this, I also wonder how cpr-exec handles the limited
> > environments like cgroups and especially seccomps.  I'm not sure what's the
> > status of that in most cloud environments, but I think exec() / fork() is
> > definitely not always on the seccomp whitelist, and I think that's also
> > another reason why we can think about avoid using them.
> 
> Exec must be allowed to use cpr-exec mode.  Fork can remain blocked.   Currently
> the qemu sandbox option can block 'spawn', which blocks both exec and fork. I have
> a patch in my next series that makes this more fine grained, so one or the other
> can be blocked. Those unwilling to allow exec can wait for cpr-scm mode :)

The question is what cpr-scm will be different from cpr-exec, and whether
we'd like them both!  As a maintainer, I definitely want to maintain as
"less" as possible.. :-(

If they play similar role, I suggest we stick with one for sure and discuss
the design.  If cpr-exec will be accepted, I hope it's because we decided
to give up seccomp, rather than waiting for cpr-scm. :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-06-03 21:48               ` Peter Xu
@ 2024-06-04  7:13                 ` Daniel P. Berrangé
  2024-06-04 15:58                   ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Daniel P. Berrangé @ 2024-06-04  7:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:
> That property, irrelevant of what it is called (and I doubt whether Dan's
> suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
> visible fd but it's shared-ram for sure..), is yet another way to specify
> guest mem types.
> 
> What if the user specified this property but specified something else in
> the -object parameters?  E.g. -machine share-ram=on -object
> memory-backend-ram,share=off.  What should we do?

The machine property would only apply to memory regions that are
*NOT* being created via -object. The memory-backend objects would
always honour their own share settnig.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-06-04  7:13                 ` Daniel P. Berrangé
@ 2024-06-04 15:58                   ` Peter Xu
  2024-06-04 16:14                     ` David Hildenbrand
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-06-04 15:58 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:
> On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:
> > That property, irrelevant of what it is called (and I doubt whether Dan's
> > suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
> > visible fd but it's shared-ram for sure..), is yet another way to specify
> > guest mem types.
> > 
> > What if the user specified this property but specified something else in
> > the -object parameters?  E.g. -machine share-ram=on -object
> > memory-backend-ram,share=off.  What should we do?
> 
> The machine property would only apply to memory regions that are
> *NOT* being created via -object. The memory-backend objects would
> always honour their own share settnig.

In that case we may want to rename that to share-ram-by-default=on.
Otherwise it's not clear which one would take effect from an user POV, even
if we can define it like that in the code.

Even with share-ram-by-default=on, it can be still confusing in some form
or another. Consider this cmdline:

  -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1

Then is mem1 shared or not?  From reading the cmdline, if share ram by
default it should be ON if we don't specify it, but it's actually off?
It's because -object has its own default values.

IMHO fundamentally it's just controversial to have two ways to configure
guest memory.  If '-object' is the preferred and complete way to configure
it, I prefer sticking with it if possible and see what is missing.

I think I raised that as the other major reason too, that I think it's so
far only about the vram that is out of the picture here.  We don't and
shouldn't have complicated RW RAMs floating around that we want this
property to cover.  We should make sure we introduce this property with
some good reason, rather than "ok we don't know what happened, we don't
know what will leverage this, but maybe there is some floating RAMs..".
Right after we introduce this property we need to carry it forever, and
prepared people start to use it with/without cpr, and that's some
maintenance cost that I want to see whether we can avoid.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-06-04 15:58                   ` Peter Xu
@ 2024-06-04 16:14                     ` David Hildenbrand
  2024-06-04 16:41                       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: David Hildenbrand @ 2024-06-04 16:14 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: Steven Sistare, qemu-devel, Fabiano Rosas, Igor Mammedov,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daude,
	Paolo Bonzini, Markus Armbruster

On 04.06.24 17:58, Peter Xu wrote:
> On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:
>> On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:
>>> That property, irrelevant of what it is called (and I doubt whether Dan's
>>> suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
>>> visible fd but it's shared-ram for sure..), is yet another way to specify
>>> guest mem types.
>>>
>>> What if the user specified this property but specified something else in
>>> the -object parameters?  E.g. -machine share-ram=on -object
>>> memory-backend-ram,share=off.  What should we do?
>>
>> The machine property would only apply to memory regions that are
>> *NOT* being created via -object. The memory-backend objects would
>> always honour their own share settnig.
> 
> In that case we may want to rename that to share-ram-by-default=on.
> Otherwise it's not clear which one would take effect from an user POV, even
> if we can define it like that in the code.
> 
> Even with share-ram-by-default=on, it can be still confusing in some form
> or another. Consider this cmdline:
> 
>    -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1
> 
> Then is mem1 shared or not?  From reading the cmdline, if share ram by
> default it should be ON if we don't specify it, but it's actually off?
> It's because -object has its own default values.

We do have something similar with "merge" and "dump" properties. See 
machine_mem_merge() / machine_dump_guest_core().

These correspond to the "mem-merge" and "dump-guest-core" machine 
properties.

But ...

> 
> IMHO fundamentally it's just controversial to have two ways to configure
> guest memory.  If '-object' is the preferred and complete way to configure
> it, I prefer sticking with it if possible and see what is missing.

... I agree with that. With vhost-user we also require a reasonable 
configuration (using proper fd-based shared memory) for it to work.

> 
> I think I raised that as the other major reason too, that I think it's so
> far only about the vram that is out of the picture here.  We don't and
> shouldn't have complicated RW RAMs floating around that we want this
> property to cover.

Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing 
working by migrating that memory? CPR will be "slightly less fast".

But the biggest piece -- guest RAM -- will be migrated via the fd directly.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-06-04 16:14                     ` David Hildenbrand
@ 2024-06-04 16:41                       ` Peter Xu
  2024-06-04 17:16                         ` David Hildenbrand
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2024-06-04 16:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Daniel P. Berrangé, Steven Sistare, qemu-devel,
	Fabiano Rosas, Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On Tue, Jun 04, 2024 at 06:14:08PM +0200, David Hildenbrand wrote:
> On 04.06.24 17:58, Peter Xu wrote:
> > On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:
> > > On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:
> > > > That property, irrelevant of what it is called (and I doubt whether Dan's
> > > > suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
> > > > visible fd but it's shared-ram for sure..), is yet another way to specify
> > > > guest mem types.
> > > > 
> > > > What if the user specified this property but specified something else in
> > > > the -object parameters?  E.g. -machine share-ram=on -object
> > > > memory-backend-ram,share=off.  What should we do?
> > > 
> > > The machine property would only apply to memory regions that are
> > > *NOT* being created via -object. The memory-backend objects would
> > > always honour their own share settnig.
> > 
> > In that case we may want to rename that to share-ram-by-default=on.
> > Otherwise it's not clear which one would take effect from an user POV, even
> > if we can define it like that in the code.
> > 
> > Even with share-ram-by-default=on, it can be still confusing in some form
> > or another. Consider this cmdline:
> > 
> >    -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1
> > 
> > Then is mem1 shared or not?  From reading the cmdline, if share ram by
> > default it should be ON if we don't specify it, but it's actually off?
> > It's because -object has its own default values.
> 
> We do have something similar with "merge" and "dump" properties. See
> machine_mem_merge() / machine_dump_guest_core().
> 
> These correspond to the "mem-merge" and "dump-guest-core" machine
> properties.

These look fine so far, as long as -object cmdline doesn't allow to specify
the same thing again.

> 
> But ...
> 
> > 
> > IMHO fundamentally it's just controversial to have two ways to configure
> > guest memory.  If '-object' is the preferred and complete way to configure
> > it, I prefer sticking with it if possible and see what is missing.
> 
> ... I agree with that. With vhost-user we also require a reasonable
> configuration (using proper fd-based shared memory) for it to work.
> 
> > 
> > I think I raised that as the other major reason too, that I think it's so
> > far only about the vram that is out of the picture here.  We don't and
> > shouldn't have complicated RW RAMs floating around that we want this
> > property to cover.
> 
> Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing
> working by migrating that memory? CPR will be "slightly less fast".
> 
> But the biggest piece -- guest RAM -- will be migrated via the fd directly.

I think it should work but only without VFIO.  When with VFIO there must
have no private pages at all or migrating is racy with concurrent DMAs
(yes, AFAICT CPR can run migration with DMA running..).

CPR has a pretty tricky way of using VFIO pgtables in that it requires the
PFNs to not change before/after migration.  Feel free to have a look at
VFIO_DMA_MAP_FLAG_VADDR in vfio.h then you may get a feeling of it.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH V1 17/26] machine: memfd-alloc option
  2024-06-04 16:41                       ` Peter Xu
@ 2024-06-04 17:16                         ` David Hildenbrand
  0 siblings, 0 replies; 122+ messages in thread
From: David Hildenbrand @ 2024-06-04 17:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé, Steven Sistare, qemu-devel,
	Fabiano Rosas, Igor Mammedov, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster

On 04.06.24 18:41, Peter Xu wrote:
> On Tue, Jun 04, 2024 at 06:14:08PM +0200, David Hildenbrand wrote:
>> On 04.06.24 17:58, Peter Xu wrote:
>>> On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:
>>>> On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:
>>>>> That property, irrelevant of what it is called (and I doubt whether Dan's
>>>>> suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
>>>>> visible fd but it's shared-ram for sure..), is yet another way to specify
>>>>> guest mem types.
>>>>>
>>>>> What if the user specified this property but specified something else in
>>>>> the -object parameters?  E.g. -machine share-ram=on -object
>>>>> memory-backend-ram,share=off.  What should we do?
>>>>
>>>> The machine property would only apply to memory regions that are
>>>> *NOT* being created via -object. The memory-backend objects would
>>>> always honour their own share settnig.
>>>
>>> In that case we may want to rename that to share-ram-by-default=on.
>>> Otherwise it's not clear which one would take effect from an user POV, even
>>> if we can define it like that in the code.
>>>
>>> Even with share-ram-by-default=on, it can be still confusing in some form
>>> or another. Consider this cmdline:
>>>
>>>     -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1
>>>
>>> Then is mem1 shared or not?  From reading the cmdline, if share ram by
>>> default it should be ON if we don't specify it, but it's actually off?
>>> It's because -object has its own default values.
>>
>> We do have something similar with "merge" and "dump" properties. See
>> machine_mem_merge() / machine_dump_guest_core().
>>
>> These correspond to the "mem-merge" and "dump-guest-core" machine
>> properties.
> 
> These look fine so far, as long as -object cmdline doesn't allow to specify
> the same thing again.
> 

You can. The mem-merge / dump-guest-core set the default that can be 
modified per memory backend (merge / dump properties).

>>
>> But ...
>>
>>>
>>> IMHO fundamentally it's just controversial to have two ways to configure
>>> guest memory.  If '-object' is the preferred and complete way to configure
>>> it, I prefer sticking with it if possible and see what is missing.
>>
>> ... I agree with that. With vhost-user we also require a reasonable
>> configuration (using proper fd-based shared memory) for it to work.
>>
>>>
>>> I think I raised that as the other major reason too, that I think it's so
>>> far only about the vram that is out of the picture here.  We don't and
>>> shouldn't have complicated RW RAMs floating around that we want this
>>> property to cover.
>>
>> Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing
>> working by migrating that memory? CPR will be "slightly less fast".
>>
>> But the biggest piece -- guest RAM -- will be migrated via the fd directly.
> 
> I think it should work but only without VFIO.  When with VFIO there must
> have no private pages at all or migrating is racy with concurrent DMAs
> (yes, AFAICT CPR can run migration with DMA running..).

Understood. For these we could fail migration. Thanks for the pointer.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2024-06-04 17:17 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
2024-04-29 15:55 ` [PATCH V1 01/26] oslib: qemu_clear_cloexec Steve Sistare
2024-05-06 23:27   ` Fabiano Rosas
2024-05-07  8:56     ` Daniel P. Berrangé
2024-05-07 13:54       ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 02/26] vl: helper to request re-exec Steve Sistare
2024-04-29 15:55 ` [PATCH V1 03/26] migration: SAVEVM_FOREACH Steve Sistare
2024-05-06 23:17   ` Fabiano Rosas
2024-05-13 19:27     ` Steven Sistare
2024-05-27 18:14       ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 04/26] migration: delete unused parameter mis Steve Sistare
2024-05-06 21:50   ` Fabiano Rosas
2024-05-27 18:02   ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 05/26] migration: precreate vmstate Steve Sistare
2024-05-07 21:02   ` Fabiano Rosas
2024-05-13 19:28     ` Steven Sistare
2024-05-24 13:56   ` Fabiano Rosas
2024-05-27 18:16   ` Peter Xu
2024-05-28 15:09     ` Steven Sistare via
2024-05-29 18:39       ` Peter Xu
2024-05-30 17:04         ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 06/26] migration: precreate vmstate for exec Steve Sistare
2024-05-06 23:34   ` Fabiano Rosas
2024-05-13 19:28     ` Steven Sistare
2024-05-13 21:21       ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 07/26] migration: VMStateId Steve Sistare
2024-05-07 21:03   ` Fabiano Rosas
2024-05-27 18:20   ` Peter Xu
2024-05-28 15:10     ` Steven Sistare via
2024-05-28 17:44       ` Peter Xu
2024-05-29 17:30         ` Steven Sistare via
2024-05-29 18:53           ` Peter Xu
2024-05-30 17:11             ` Steven Sistare via
2024-05-30 18:03               ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 08/26] migration: vmstate_info_void_ptr Steve Sistare
2024-05-07 21:33   ` Fabiano Rosas
2024-05-27 18:31   ` Peter Xu
2024-05-28 15:10     ` Steven Sistare via
2024-05-28 18:21       ` Peter Xu
2024-05-29 17:30         ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 09/26] migration: vmstate_register_named Steve Sistare
2024-05-09 14:19   ` Fabiano Rosas
2024-05-09 14:32     ` Fabiano Rosas
2024-05-13 19:29       ` Steven Sistare
2024-04-29 15:55 ` [PATCH V1 10/26] migration: vmstate_unregister_named Steve Sistare
2024-04-29 15:55 ` [PATCH V1 11/26] migration: vmstate_register at init time Steve Sistare
2024-04-29 15:55 ` [PATCH V1 12/26] migration: vmstate factory object Steve Sistare
2024-04-29 15:55 ` [PATCH V1 13/26] physmem: ram_block_create Steve Sistare
2024-05-13 18:37   ` Fabiano Rosas
2024-05-13 19:30     ` Steven Sistare
2024-04-29 15:55 ` [PATCH V1 14/26] physmem: hoist guest_memfd creation Steve Sistare
2024-04-29 15:55 ` [PATCH V1 15/26] physmem: hoist host memory allocation Steve Sistare
2024-04-29 15:55 ` [PATCH V1 16/26] physmem: set ram block idstr earlier Steve Sistare
2024-04-29 15:55 ` [PATCH V1 17/26] machine: memfd-alloc option Steve Sistare
2024-05-28 21:12   ` Peter Xu
2024-05-29 17:31     ` Steven Sistare via
2024-05-29 19:14       ` Peter Xu
2024-05-30 17:11         ` Steven Sistare via
2024-05-30 18:14           ` Peter Xu
2024-05-31 19:32             ` Steven Sistare via
2024-06-03 21:48               ` Peter Xu
2024-06-04  7:13                 ` Daniel P. Berrangé
2024-06-04 15:58                   ` Peter Xu
2024-06-04 16:14                     ` David Hildenbrand
2024-06-04 16:41                       ` Peter Xu
2024-06-04 17:16                         ` David Hildenbrand
2024-06-03 10:17       ` Daniel P. Berrangé
2024-06-03 11:59         ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 18/26] migration: cpr-exec-args parameter Steve Sistare
2024-05-02 12:23   ` Markus Armbruster
2024-05-02 16:00     ` Steven Sistare
2024-05-21  8:13   ` Daniel P. Berrangé
2024-04-29 15:55 ` [PATCH V1 19/26] physmem: preserve ram blocks for cpr Steve Sistare
2024-05-28 21:44   ` Peter Xu
2024-05-29 17:31     ` Steven Sistare via
2024-05-29 19:25       ` Peter Xu
2024-05-30 17:12         ` Steven Sistare via
2024-05-30 18:39           ` Peter Xu
2024-05-31 19:32             ` Steven Sistare via
2024-06-03 22:29               ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 20/26] migration: cpr-exec mode Steve Sistare
2024-05-02 12:23   ` Markus Armbruster
2024-05-02 16:00     ` Steven Sistare
2024-05-03  6:26       ` Markus Armbruster
2024-05-21  8:20   ` Daniel P. Berrangé
2024-05-24 14:58   ` Fabiano Rosas
2024-05-27 18:54     ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 21/26] migration: migrate_add_blocker_mode Steve Sistare
2024-05-09 17:47   ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 22/26] migration: ram block cpr-exec blockers Steve Sistare
2024-05-09 18:01   ` Fabiano Rosas
2024-05-13 19:29     ` Steven Sistare
2024-04-29 15:55 ` [PATCH V1 23/26] migration: misc " Steve Sistare
2024-05-09 18:05   ` Fabiano Rosas
2024-05-24 12:40   ` Fabiano Rosas
2024-05-27 19:02     ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 24/26] seccomp: cpr-exec blocker Steve Sistare
2024-05-09 18:16   ` Fabiano Rosas
2024-05-10  7:54   ` Daniel P. Berrangé
2024-05-13 19:29     ` Steven Sistare
2024-05-21  7:14       ` Daniel P. Berrangé
2024-04-29 15:55 ` [PATCH V1 25/26] migration: fix mismatched GPAs during cpr-exec Steve Sistare
2024-05-09 18:39   ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 26/26] migration: only-migratable-modes Steve Sistare
2024-05-09 19:14   ` Fabiano Rosas
2024-05-13 19:48     ` Steven Sistare
2024-05-13 21:57       ` Fabiano Rosas
2024-05-21  8:05   ` Daniel P. Berrangé
2024-05-02 16:13 ` cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec) Steven Sistare
2024-05-02 18:15   ` Peter Xu
2024-05-20 18:30 ` [PATCH V1 00/26] Live update: cpr-exec Steven Sistare
2024-05-20 22:28   ` Fabiano Rosas
2024-05-21  2:31     ` Peter Xu
2024-05-21 11:46       ` Steven Sistare
2024-05-27 17:45         ` Peter Xu
2024-05-28 15:10           ` Steven Sistare via
2024-05-28 16:42             ` Peter Xu
2024-05-30 17:17               ` Steven Sistare via
2024-05-30 19:23                 ` Peter Xu
2024-05-24 13:02 ` Fabiano Rosas
2024-05-24 14:07   ` Steven Sistare
2024-05-27 18:07 ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).