[PATCH V2 00/13] Live update: cpr-transfer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 00/13] Live update: cpr-transfer
@ 2024-09-30 19:40 Steve Sistare
  2024-09-30 19:40 ` [PATCH V2 01/13] machine: alloc-anon option Steve Sistare
                   ` (13 more replies)
  0 siblings, 14 replies; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

What?

This patch series adds the live migration cpr-transfer mode, which
allows the user to transfer a guest to a new QEMU instance on the same
host with minimal guest pause time, by preserving guest RAM in place,
albeit with new virtual addresses in new QEMU, and by preserving device
file descriptors.

The new user-visible interfaces are:
  * cpr-transfer (MigMode migration parameter)
  * cpr-uri (migration parameter)
  * cpr-uri (command-line argument)

The user sets the mode parameter before invoking the migrate command.
In this mode, the user starts new QEMU on the same host as old QEMU, with
the same arguments as old QEMU, plus the -incoming and the -cpr-uri options.
The user issues the migrate command to old QEMU, which stops the VM, saves
state to the migration channels, and enters the postmigrate state.  Execution
resumes in new QEMU.

Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started
with the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process.

This mode requires a second migration channel, specified by the cpr-uri
migration property on the outgoing side, and by the cpr-uri QEMU command-line
option on the incoming side.  The channel must be a type, such as unix socket,
that supports SCM_RIGHTS.

Why?

This mode has less impact on the guest than any other method of updating
in place.  The pause time is much lower, because devices need not be torn
down and recreated, DMA does not need to be drained and quiesced, and minimal
state is copied to new QEMU.  Further, there are no constraints on the guest.
By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
and suspending plus resuming vfio devices adds multiple seconds to the
guest pause time.  Lastly, there is no loss of connectivity to the guest,
because chardev descriptors remain open and connected.

These benefits all derive from the core design principle of this mode,
which is preserving open descriptors.  This approach is very general and
can be used to support a wide variety of devices that do not have hardware
support for live migration, including but not limited to: vfio, chardev,
vhost, vdpa, and iommufd.  Some devices need new kernel software interfaces
to allow a descriptor to be used in a process that did not originally open it.

How?

All memory that is mapped by the guest is preserved in place.  Indeed,
it must be, because it may be the target of DMA requests, which are not
quiesced during cpr-transfer.  All such memory must be mmap'able in new QEMU.
This is easy for named memory-backend objects, as long as they are mapped
shared, because they are visible in the file system in both old and new QEMU.
Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
so the memfd's can be sent to new QEMU.  Pages that were locked in memory
for DMA in old QEMU remain locked in new QEMU, because the descriptor of
the device that locked them remains open.

cpr-transfer preserves descriptors by sending them to new QEMU via the
cpr-uri, which must support SCM_RIGHTS, and by sending the unique name
and value of each descriptor to new QEMU
via CPR state.

For device descriptors, new QEMU reuses the descriptor when creating the
device, rather than opening it again.  For memfd descriptors, new QEMU
mmap's the preserved memfd when a ramblock is created.

CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel, specified by cpr-uri.  New QEMU
reads the second channel prior to creating devices or backends.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in terminal 2.

  Terminal 1: start old QEMU
  # qemu-kvm -monitor stdio -object
  memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
  -m 4G -machine anon-alloc=memfd ...

  Terminal 2: start new QEMU
  # qemu-kvm ... -incoming unix:vm.sock -cpr-uri unix:cpr.sock

  Terminal 1:
  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-transfer
  (qemu) migrate_set_parameter cpr-uri unix:cpr.sock
  (qemu) migrate -d unix:vm.sock
  (qemu) info status
  VM status: paused (postmigrate)

  Terminal 2:
  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

This patch series implements a minimal version of cpr-transfer.  Additional
series are ready to be posted to deliver the complete vision described
above, including
  * vfio
  * chardev
  * vhost and tap
  * blockers
  * migration-test cases
  * cpr-exec mode

Works in progress include:
  * vdpa
  * iommufd

Changes in V2:
  * cpr-transfer is the first new mode proposed, and cpr-exec is deferred
  * anon-alloc does not apply to memory-backend-object
  * replaced hack with proper synchronization between source and target
  * defined QEMU_CPR_FILE_MAGIC
  * addressed misc review comments

The first 6 patches below are foundational and are needed for both cpr-transfer
mode and the proposed cpr-exec mode.  The last 7 patches are specific to
cpr-transfer and implement the mechanisms for sharing state across a socket
using SCM_RIGHTS.

Steve Sistare (13):
  machine: alloc-anon option
  migration: cpr-state
  migration: save cpr mode
  migration: stop vm earlier for cpr
  physmem: preserve ram blocks for cpr
  hostmem-memfd: preserve for cpr
  migration: SCM_RIGHTS for QEMUFile
  migration: VMSTATE_FD
  migration: cpr-transfer save and load
  migration: cpr-uri parameter
  migration: cpr-uri option
  migration: split qmp_migrate
  migration: cpr-transfer mode

 backends/hostmem-memfd.c       |  12 +-
 hw/core/machine.c              |  19 +++
 include/hw/boards.h            |   1 +
 include/migration/cpr.h        |  38 ++++++
 include/migration/vmstate.h    |   9 ++
 migration/cpr-transfer.c       |  81 +++++++++++++
 migration/cpr.c                | 269 +++++++++++++++++++++++++++++++++++++++++
 migration/meson.build          |   2 +
 migration/migration-hmp-cmds.c |  10 ++
 migration/migration.c          | 116 ++++++++++++++++--
 migration/migration.h          |   2 +
 migration/options.c            |  37 +++++-
 migration/options.h            |   1 +
 migration/qemu-file.c          |  83 ++++++++++++-
 migration/qemu-file.h          |   2 +
 migration/ram.c                |   2 +
 migration/trace-events         |   7 ++
 migration/vmstate-types.c      |  33 +++++
 qapi/machine.json              |  14 +++
 qapi/migration.json            |  45 ++++++-
 qemu-options.hx                |  19 +++
 stubs/vmstate.c                |   7 ++
 system/physmem.c               |  58 +++++++++
 system/trace-events            |   3 +
 system/vl.c                    |  10 ++
 25 files changed, 857 insertions(+), 23 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr-transfer.c
 create mode 100644 migration/cpr.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH V2 01/13] machine: alloc-anon option
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-03 16:14   ` Peter Xu
  2024-10-07 15:36   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 02/13] migration: cpr-state Steve Sistare
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This option applies to
memory allocated as a side effect of creating various devices. It does
not apply to memory-backend-objects, whether explicitly specified on
the command line, or implicitly created by the -m command line option.

The memfd option is intended to support new migration modes, in which the
memory region can be transferred in place to a new QEMU process, by sending
the memfd file descriptor to the process.  Memory contents are preserved,
and if the mode also transfers device descriptors, then pages that are
locked in memory for DMA remain locked.  This behavior is a pre-requisite
for supporting vfio, vdpa, and iommufd devices with the new modes.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 19 +++++++++++++++++++
 include/hw/boards.h |  1 +
 qapi/machine.json   | 14 ++++++++++++++
 qemu-options.hx     | 11 +++++++++++
 system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
 system/trace-events |  3 +++
 6 files changed, 83 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index adaba17..a89a32b 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static int machine_get_anon_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->anon_alloc;
+}
+
+static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->anon_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
+                                   &AnonAllocOption_lookup,
+                                   machine_get_anon_alloc,
+                                   machine_set_anon_alloc);
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 5966069..5a87647 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -393,6 +393,7 @@ struct MachineState {
     bool enable_graphics;
     ConfidentialGuestSupport *cgs;
     HostMemoryBackend *memdev;
+    AnonAllocOption anon_alloc;
     /*
      * convenience alias to ram_memdev_id backend memory region
      * or to numa container memory region
diff --git a/qapi/machine.json b/qapi/machine.json
index a6b8795..d4a63f5 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1898,3 +1898,17 @@
 { 'command': 'x-query-interrupt-controllers',
   'returns': 'HumanReadableText',
   'features': [ 'unstable' ]}
+
+##
+# @AnonAllocOption:
+#
+# An enumeration of the options for allocating anonymous guest memory.
+#
+# @mmap: allocate using mmap MAP_ANON
+#
+# @memfd: allocate using memfd_create
+#
+# Since: 9.2
+##
+{ 'enum': 'AnonAllocOption',
+  'data': [ 'mmap', 'memfd' ] }
diff --git a/qemu-options.hx b/qemu-options.hx
index d94e2cb..90ab943 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                nvdimm=on|off controls NVDIMM support (default=off)\n"
     "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
     "                hmat=on|off controls ACPI HMAT support (default=off)\n"
+    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
     "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
     "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
     QEMU_ARCH_ALL)
@@ -101,6 +102,16 @@ SRST
         Enables or disables ACPI Heterogeneous Memory Attribute Table
         (HMAT) support. The default is off.
 
+    ``anon-alloc=mmap|memfd``
+        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
+        or memfd_create.  This option applies to memory allocated as a
+        side effect of creating various devices. It does not apply to
+        memory-backend-objects, whether explicitly specified on the
+        command line, or implicitly created by the -m command line
+        option.
+
+        Some migration modes require anon-alloc=memfd.
+
     ``memory-backend='id'``
         An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
         Allows to use a memory backend as main RAM.
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a..174f7e0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -47,6 +47,7 @@
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -69,6 +70,8 @@
 
 #include "qemu/pmem.h"
 
+#include "qapi/qapi-types-migration.h"
+#include "migration/options.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 qemu_mutex_unlock_ramlist();
                 return;
             }
+
+        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
+                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
+                                        TYPE_MEMORY_BACKEND)) {
+            size_t max_length = new_block->max_length;
+            MemoryRegion *mr = new_block->mr;
+            const char *name = memory_region_name(mr);
+
+            new_block->mr->align = QEMU_VMALLOC_ALIGN;
+            new_block->flags |= RAM_SHARED;
+
+            if (new_block->fd == -1) {
+                new_block->fd = qemu_memfd_create(name, max_length + mr->align,
+                                                  0, 0, 0, errp);
+            }
+
+            if (new_block->fd >= 0) {
+                int mfd = new_block->fd;
+                qemu_set_cloexec(mfd);
+                new_block->host = file_ram_alloc(new_block, max_length, mfd,
+                                                 false, 0, errp);
+            }
+            if (!new_block->host) {
+                qemu_mutex_unlock_ramlist();
+                return;
+            }
+            memory_try_enable_merging(new_block->host, new_block->max_length);
+            free_on_error = true;
+
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
@@ -1932,6 +1964,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         ram_block_notify_add(new_block->host, new_block->used_length,
                              new_block->max_length);
     }
+    trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
+                        new_block->fd, new_block->used_length,
+                        new_block->max_length);
     return;
 
 out_free:
diff --git a/system/trace-events b/system/trace-events
index 074d001..4669411 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
 
 # cpu-throttle.c
 cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
+
+#physmem.c
+ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 02/13] migration: cpr-state
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
  2024-09-30 19:40 ` [PATCH V2 01/13] machine: alloc-anon option Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 14:14   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 03/13] migration: save cpr mode Steve Sistare
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

CPR must save state that is needed after QEMU is restarted, when devices
are realized.  Thus the extra state cannot be saved in the migration stream,
as objects must already exist before that stream can be loaded.  Instead,
define auxilliary state structures and vmstate descriptions, not associated
with any registered object, and serialize the aux state to a cpr-specific
stream in cpr_state_save.  Deserialize in cpr_state_load after QEMU
restarts, before devices are realized.

Provide accessors for clients to register file descriptors for saving.
The mechanism for passing the fd's to the new process will be specific
to each migration mode, and added in subsequent patches.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 include/migration/cpr.h |  26 ++++++
 migration/cpr.c         | 217 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |   1 +
 migration/migration.c   |   6 ++
 migration/trace-events  |   5 ++
 system/vl.c             |   7 ++
 6 files changed, 262 insertions(+)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..e7b898b
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+#define QEMU_CPR_FILE_MAGIC     0x51435052
+#define QEMU_CPR_FILE_VERSION   0x00000001
+
+typedef int (*cpr_walk_fd_cb)(int fd);
+void cpr_save_fd(const char *name, int id, int fd);
+void cpr_delete_fd(const char *name, int id);
+int cpr_find_fd(const char *name, int id);
+int cpr_walk_fd(cpr_walk_fd_cb cb);
+void cpr_resave_fd(const char *name, int id, int fd);
+
+int cpr_state_save(Error **errp);
+int cpr_state_load(Error **errp);
+void cpr_state_close(void);
+struct QIOChannel *cpr_state_ioc(void);
+
+#endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..e50fc75
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,217 @@
+/*
+ * Copyright (c) 2021-2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/misc.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/vmstate.h"
+#include "sysemu/runstate.h"
+#include "trace.h"
+
+/*************************************************************************/
+/* cpr state container for all information to be saved. */
+
+typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
+
+typedef struct CprState {
+    CprFdList fds;
+} CprState;
+
+static CprState cpr_state;
+
+/****************************************************************************/
+
+typedef struct CprFd {
+    char *name;
+    unsigned int namelen;
+    int id;
+    int fd;
+    QLIST_ENTRY(CprFd) next;
+} CprFd;
+
+static const VMStateDescription vmstate_cpr_fd = {
+    .name = "cpr fd",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(namelen, CprFd),
+        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
+        VMSTATE_INT32(id, CprFd),
+        VMSTATE_INT32(fd, CprFd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+    CprFd *elem = g_new0(CprFd, 1);
+
+    trace_cpr_save_fd(name, id, fd);
+    elem->name = g_strdup(name);
+    elem->namelen = strlen(name) + 1;
+    elem->id = id;
+    elem->fd = fd;
+    QLIST_INSERT_HEAD(&cpr_state.fds, elem, next);
+}
+
+static CprFd *find_fd(CprFdList *head, const char *name, int id)
+{
+    CprFd *elem;
+
+    QLIST_FOREACH(elem, head, next) {
+        if (!strcmp(elem->name, name) && elem->id == id) {
+            return elem;
+        }
+    }
+    return NULL;
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+    CprFd *elem = find_fd(&cpr_state.fds, name, id);
+
+    if (elem) {
+        QLIST_REMOVE(elem, next);
+        g_free(elem->name);
+        g_free(elem);
+    }
+
+    trace_cpr_delete_fd(name, id);
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    CprFd *elem = find_fd(&cpr_state.fds, name, id);
+    int fd = elem ? elem->fd : -1;
+
+    trace_cpr_find_fd(name, id, fd);
+    return fd;
+}
+
+int cpr_walk_fd(cpr_walk_fd_cb cb)
+{
+    CprFd *elem;
+
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        if (elem->fd >= 0 && cb(elem->fd)) {
+            return 1;
+        }
+    }
+    return 0;
+}
+
+void cpr_resave_fd(const char *name, int id, int fd)
+{
+    CprFd *elem = find_fd(&cpr_state.fds, name, id);
+    int old_fd = elem ? elem->fd : -1;
+
+    if (old_fd < 0) {
+        cpr_save_fd(name, id, fd);
+    } else if (old_fd != fd) {
+        error_setg(&error_fatal,
+                   "internal error: cpr fd '%s' id %d value %d "
+                   "already saved with a different value %d",
+                   name, id, fd, old_fd);
+    }
+}
+/*************************************************************************/
+#define CPR_STATE "CprState"
+
+static const VMStateDescription vmstate_cpr_state = {
+    .name = CPR_STATE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
+        VMSTATE_END_OF_LIST()
+    }
+};
+/*************************************************************************/
+
+static QEMUFile *cpr_state_file;
+
+QIOChannel *cpr_state_ioc(void)
+{
+    return qemu_file_get_ioc(cpr_state_file);
+}
+
+int cpr_state_save(Error **errp)
+{
+    int ret;
+    QEMUFile *f;
+
+    /* set f based on mode in a later patch in this series */
+    return 0;
+
+    qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
+    qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
+
+    ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
+    if (ret) {
+        error_setg(errp, "vmstate_save_state error %d", ret);
+        qemu_fclose(f);
+        return ret;
+    }
+
+    /*
+     * Close the socket only partially so we can later detect when the other
+     * end closes by getting a HUP event.
+     */
+    qemu_fflush(f);
+    qio_channel_shutdown(qemu_file_get_ioc(f), QIO_CHANNEL_SHUTDOWN_WRITE,
+                         NULL);
+    cpr_state_file = f;
+    return 0;
+}
+
+int cpr_state_load(Error **errp)
+{
+    int ret;
+    uint32_t v;
+    QEMUFile *f;
+
+    /* set f based on mode in a later patch in this series */
+    return 0;
+
+    v = qemu_get_be32(f);
+    if (v != QEMU_CPR_FILE_MAGIC) {
+        error_setg(errp, "Not a migration stream (bad magic %x)", v);
+        qemu_fclose(f);
+        return -EINVAL;
+    }
+    v = qemu_get_be32(f);
+    if (v != QEMU_CPR_FILE_VERSION) {
+        error_setg(errp, "Unsupported migration stream version %d", v);
+        qemu_fclose(f);
+        return -ENOTSUP;
+    }
+
+    ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
+    if (ret) {
+        error_setg(errp, "vmstate_load_state error %d", ret);
+        qemu_fclose(f);
+        return ret;
+    }
+
+    /*
+     * Let the caller decide when to close the socket (and generate a HUP event
+     * for the sending side).
+     */
+    cpr_state_file = f;
+    return ret;
+}
+
+void cpr_state_close(void)
+{
+    if (cpr_state_file) {
+        qemu_fclose(cpr_state_file);
+        cpr_state_file = NULL;
+    }
+}
diff --git a/migration/meson.build b/migration/meson.build
index 66d3de8..e5f4211 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -13,6 +13,7 @@ system_ss.add(files(
   'block-dirty-bitmap.c',
   'channel.c',
   'channel-block.c',
+  'cpr.c',
   'dirtyrate.c',
   'exec.c',
   'fd.c',
diff --git a/migration/migration.c b/migration/migration.c
index ae2be31..834b0a2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -27,6 +27,7 @@
 #include "sysemu/cpu-throttle.h"
 #include "rdma.h"
 #include "ram.h"
+#include "migration/cpr.h"
 #include "migration/global_state.h"
 #include "migration/misc.h"
 #include "migration.h"
@@ -2123,6 +2124,10 @@ void qmp_migrate(const char *uri, bool has_channels,
         }
     }
 
+    if (cpr_state_save(&local_err)) {
+        goto out;
+    }
+
     if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
         SocketAddress *saddr = &addr->u.socket;
         if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
@@ -2147,6 +2152,7 @@ void qmp_migrate(const char *uri, bool has_channels,
                           MIGRATION_STATUS_FAILED);
     }
 
+out:
     if (local_err) {
         if (!resume_requested) {
             yank_unregister_instance(MIGRATION_YANK_INSTANCE);
diff --git a/migration/trace-events b/migration/trace-events
index c65902f..5356fb5 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -341,6 +341,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
 # colo-failover.c
 colo_failover_set_state(const char *new_state) "new state %s"
 
+# cpr.c
+cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
+cpr_delete_fd(const char *name, int id) "%s, id %d"
+cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
+
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
 send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
diff --git a/system/vl.c b/system/vl.c
index 752a1da..565d932 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -77,6 +77,7 @@
 #include "hw/block/block.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/pc.h"
+#include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
 #include "sysemu/tpm.h"
@@ -3720,6 +3721,12 @@ void qemu_init(int argc, char **argv)
 
     qemu_create_machine(machine_opts_dict);
 
+    /*
+     * Load incoming CPR state before any devices are created, because it
+     * contains file descriptors that are needed in device initialization code.
+     */
+    cpr_state_load(&error_fatal);
+
     suspend_mux_open();
 
     qemu_disable_default_devices();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 03/13] migration: save cpr mode
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
  2024-09-30 19:40 ` [PATCH V2 01/13] machine: alloc-anon option Steve Sistare
  2024-09-30 19:40 ` [PATCH V2 02/13] migration: cpr-state Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 15:18   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 04/13] migration: stop vm earlier for cpr Steve Sistare
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Save the mode in CPR state, so the user does not need to explicitly specify
it for the target.  Modify migrate_mode() so it returns the incoming mode on
the target.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h |  7 +++++++
 migration/cpr.c         | 23 ++++++++++++++++++++++-
 migration/migration.c   |  1 +
 migration/options.c     |  9 +++++++--
 4 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index e7b898b..ac7a63e 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -8,9 +8,16 @@
 #ifndef MIGRATION_CPR_H
 #define MIGRATION_CPR_H
 
+#include "qapi/qapi-types-migration.h"
+
+#define MIG_MODE_NONE           -1
+
 #define QEMU_CPR_FILE_MAGIC     0x51435052
 #define QEMU_CPR_FILE_VERSION   0x00000001
 
+MigMode cpr_get_incoming_mode(void);
+void cpr_set_incoming_mode(MigMode mode);
+
 typedef int (*cpr_walk_fd_cb)(int fd);
 void cpr_save_fd(const char *name, int id, int fd);
 void cpr_delete_fd(const char *name, int id);
diff --git a/migration/cpr.c b/migration/cpr.c
index e50fc75..7514c4e 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -21,10 +21,23 @@
 typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
 
 typedef struct CprState {
+    MigMode mode;
     CprFdList fds;
 } CprState;
 
-static CprState cpr_state;
+static CprState cpr_state = {
+    .mode = MIG_MODE_NONE,
+};
+
+MigMode cpr_get_incoming_mode(void)
+{
+    return cpr_state.mode;
+}
+
+void cpr_set_incoming_mode(MigMode mode)
+{
+    cpr_state.mode = mode;
+}
 
 /****************************************************************************/
 
@@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
 /*************************************************************************/
 #define CPR_STATE "CprState"
 
+static int cpr_state_presave(void *opaque)
+{
+    cpr_state.mode = migrate_mode();
+    return 0;
+}
+
 static const VMStateDescription vmstate_cpr_state = {
     .name = CPR_STATE,
     .version_id = 1,
     .minimum_version_id = 1,
+    .pre_save = cpr_state_presave,
     .fields = (VMStateField[]) {
+        VMSTATE_UINT32(mode, CprState),
         VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
         VMSTATE_END_OF_LIST()
     }
diff --git a/migration/migration.c b/migration/migration.c
index 834b0a2..df00e5c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -416,6 +416,7 @@ void migration_incoming_state_destroy(void)
         mis->postcopy_qemufile_dst = NULL;
     }
 
+    cpr_set_incoming_mode(MIG_MODE_NONE);
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
diff --git a/migration/options.c b/migration/options.c
index 147cd2b..cc85a84 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -22,6 +22,7 @@
 #include "qapi/qmp/qnull.h"
 #include "sysemu/runstate.h"
 #include "migration/colo.h"
+#include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
@@ -768,8 +769,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
 
 MigMode migrate_mode(void)
 {
-    MigrationState *s = migrate_get_current();
-    MigMode mode = s->parameters.mode;
+    MigMode mode = cpr_get_incoming_mode();
+
+    if (mode == MIG_MODE_NONE) {
+        MigrationState *s = migrate_get_current();
+        mode = s->parameters.mode;
+    }
 
     assert(mode >= 0 && mode < MIG_MODE__MAX);
     return mode;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 04/13] migration: stop vm earlier for cpr
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (2 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 03/13] migration: save cpr mode Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 15:27   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 05/13] physmem: preserve ram blocks " Steve Sistare
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Stop the vm earlier for cpr, to guarantee consistent device state when
CPR state is saved.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/migration.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index df00e5c..868bf0e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2082,6 +2082,7 @@ void qmp_migrate(const char *uri, bool has_channels,
     MigrationState *s = migrate_get_current();
     g_autoptr(MigrationChannel) channel = NULL;
     MigrationAddress *addr = NULL;
+    bool stopped = false;
 
     /*
      * Having preliminary checks for uri and channel
@@ -2125,6 +2126,15 @@ void qmp_migrate(const char *uri, bool has_channels,
         }
     }
 
+    if (migrate_mode_is_cpr(s)) {
+        int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
+        if (ret < 0) {
+            error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
+            goto out;
+        }
+        stopped = true;
+    }
+
     if (cpr_state_save(&local_err)) {
         goto out;
     }
@@ -2160,6 +2170,9 @@ out:
         }
         migrate_fd_error(s, local_err);
         error_propagate(errp, local_err);
+        if (stopped) {
+            vm_resume(s->vm_old_state);
+        }
         return;
     }
 }
@@ -3743,7 +3756,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
     Error *local_err = NULL;
     uint64_t rate_limit;
     bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
-    int ret;
 
     /*
      * If there's a previous error, free it and prepare for another one.
@@ -3815,14 +3827,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         return;
     }
 
-    if (migrate_mode_is_cpr(s)) {
-        ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
-        if (ret < 0) {
-            error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
-            goto fail;
-        }
-    }
-
     if (migrate_background_snapshot()) {
         qemu_thread_create(&s->thread, "mig/snapshot",
                 bg_migration_thread, s, QEMU_THREAD_JOINABLE);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (3 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 04/13] migration: stop vm earlier for cpr Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 15:49   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 06/13] hostmem-memfd: preserve " Steve Sistare
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Save the memfd for anonymous ramblocks in CPR state, along with a name
that uniquely identifies it.  The block's idstr is not yet set, so it
cannot be used for this purpose.  Find the saved memfd in new QEMU when
creating a block.  QEMU hard-codes the length of some internally-created
blocks, so to guard against that length changing, use lseek to get the
actual length of an incoming memfd.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 system/physmem.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/system/physmem.c b/system/physmem.c
index 174f7e0..ddbeec9 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -72,6 +72,7 @@
 
 #include "qapi/qapi-types-migration.h"
 #include "migration/options.h"
+#include "migration/cpr.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
     }
 }
 
+static char *cpr_name(RAMBlock *block)
+{
+    MemoryRegion *mr = block->mr;
+    const char *mr_name = memory_region_name(mr);
+    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
+
+    if (id) {
+        return g_strdup_printf("%s/%s", id, mr_name);
+    } else {
+        return g_strdup(mr_name);
+    }
+}
+
 size_t qemu_ram_pagesize(RAMBlock *rb)
 {
     return rb->page_size;
@@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                                         TYPE_MEMORY_BACKEND)) {
             size_t max_length = new_block->max_length;
             MemoryRegion *mr = new_block->mr;
-            const char *name = memory_region_name(mr);
+            g_autofree char *name = cpr_name(new_block);
 
             new_block->mr->align = QEMU_VMALLOC_ALIGN;
             new_block->flags |= RAM_SHARED;
+            new_block->fd = cpr_find_fd(name, 0);
 
             if (new_block->fd == -1) {
                 new_block->fd = qemu_memfd_create(name, max_length + mr->align,
                                                   0, 0, 0, errp);
+                cpr_save_fd(name, 0, new_block->fd);
+            } else {
+                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
             }
 
             if (new_block->fd >= 0) {
@@ -1875,6 +1893,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                                                  false, 0, errp);
             }
             if (!new_block->host) {
+                cpr_delete_fd(name, 0);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
@@ -2182,6 +2201,8 @@ static void reclaim_ramblock(RAMBlock *block)
 
 void qemu_ram_free(RAMBlock *block)
 {
+    g_autofree char *name = NULL;
+
     if (!block) {
         return;
     }
@@ -2192,6 +2213,8 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    name = cpr_name(block);
+    cpr_delete_fd(name, 0);
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 06/13] hostmem-memfd: preserve for cpr
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (4 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 05/13] physmem: preserve ram blocks " Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 15:52   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile Steve Sistare
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Preserve memory-backend-memfd memory objects during cpr-transfer.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/hostmem-memfd.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 6a3c89a..2740222 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qapi/error.h"
 #include "qom/object.h"
+#include "migration/cpr.h"
 
 #define TYPE_MEMORY_BACKEND_MEMFD "memory-backend-memfd"
 
@@ -35,15 +36,19 @@ static bool
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
-    g_autofree char *name = NULL;
+    g_autofree char *name = host_memory_backend_get_name(backend);
+    int fd = cpr_find_fd(name, 0);
     uint32_t ram_flags;
-    int fd;
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
         return false;
     }
 
+    if (fd >= 0) {
+        goto have_fd;
+    }
+
     fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
                            m->hugetlb, m->hugetlbsize, m->seal ?
                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
@@ -51,9 +56,10 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     if (fd == -1) {
         return false;
     }
+    cpr_save_fd(name, 0, fd);
 
+have_fd:
     backend->aligned = true;
-    name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     ram_flags |= backend->guest_memfd ? RAM_GUEST_MEMFD : 0;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (5 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 06/13] hostmem-memfd: preserve " Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 16:06   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 08/13] migration: VMSTATE_FD Steve Sistare
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define functions to put/get file descriptors to/from a QEMUFile, for qio
channels that support SCM_RIGHTS.  Maintain ordering such that
  put(A), put(fd), put(B)
followed by
  get(A), get(fd), get(B)
always succeeds.  Other get orderings may succeed but are not guaranteed.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/qemu-file.c  | 83 +++++++++++++++++++++++++++++++++++++++++++++++---
 migration/qemu-file.h  |  2 ++
 migration/trace-events |  2 ++
 3 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index b6d2f58..7f951ab 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -37,6 +37,11 @@
 #define IO_BUF_SIZE 32768
 #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
 
+typedef struct FdEntry {
+    QTAILQ_ENTRY(FdEntry) entry;
+    int fd;
+} FdEntry;
+
 struct QEMUFile {
     QIOChannel *ioc;
     bool is_writable;
@@ -51,6 +56,9 @@ struct QEMUFile {
 
     int last_error;
     Error *last_error_obj;
+
+    bool fd_pass;
+    QTAILQ_HEAD(, FdEntry) fds;
 };
 
 /*
@@ -109,6 +117,8 @@ static QEMUFile *qemu_file_new_impl(QIOChannel *ioc, bool is_writable)
     object_ref(ioc);
     f->ioc = ioc;
     f->is_writable = is_writable;
+    f->fd_pass = qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS);
+    QTAILQ_INIT(&f->fds);
 
     return f;
 }
@@ -310,6 +320,10 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
     int len;
     int pending;
     Error *local_error = NULL;
+    g_autofree int *fds = NULL;
+    size_t nfd = 0;
+    int **pfds = f->fd_pass ? &fds : NULL;
+    size_t *pnfd = f->fd_pass ? &nfd : NULL;
 
     assert(!qemu_file_is_writable(f));
 
@@ -325,10 +339,9 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
     }
 
     do {
-        len = qio_channel_read(f->ioc,
-                               (char *)f->buf + pending,
-                               IO_BUF_SIZE - pending,
-                               &local_error);
+        struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
+        len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,
+                                     &local_error);
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
                 qio_channel_yield(f->ioc, G_IO_IN);
@@ -348,9 +361,65 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
         qemu_file_set_error_obj(f, len, local_error);
     }
 
+    for (int i = 0; i < nfd; i++) {
+        FdEntry *fde = g_new0(FdEntry, 1);
+        fde->fd = fds[i];
+        QTAILQ_INSERT_TAIL(&f->fds, fde, entry);
+    }
+
     return len;
 }
 
+int qemu_file_put_fd(QEMUFile *f, int fd)
+{
+    int ret = 0;
+    QIOChannel *ioc = qemu_file_get_ioc(f);
+    Error *err = NULL;
+    struct iovec iov = { (void *)" ", 1 };
+
+    /*
+     * Send a dummy byte so qemu_fill_buffer on the receiving side does not
+     * fail with a len=0 error.  Flush first to maintain ordering wrt other
+     * data.
+     */
+
+    qemu_fflush(f);
+    if (qio_channel_writev_full(ioc, &iov, 1, &fd, 1, 0, &err) < 1) {
+        error_report_err(error_copy(err));
+        qemu_file_set_error_obj(f, -EIO, err);
+        ret = -1;
+    }
+    trace_qemu_file_put_fd(f->ioc->name, fd, ret);
+    return ret;
+}
+
+int qemu_file_get_fd(QEMUFile *f)
+{
+    int fd = -1;
+    FdEntry *fde;
+
+    if (!f->fd_pass) {
+        Error *err = NULL;
+        error_setg(&err, "%s does not support fd passing", f->ioc->name);
+        error_report_err(error_copy(err));
+        qemu_file_set_error_obj(f, -EIO, err);
+        goto out;
+    }
+
+    /* Force the dummy byte and its fd passenger to appear. */
+    qemu_peek_byte(f, 0);
+
+    fde = QTAILQ_FIRST(&f->fds);
+    if (fde) {
+        qemu_get_byte(f);       /* Drop the dummy byte */
+        fd = fde->fd;
+        QTAILQ_REMOVE(&f->fds, fde, entry);
+    }
+out:
+    trace_qemu_file_get_fd(f->ioc->name, fd);
+    return fd;
+}
+
 /** Closes the file
  *
  * Returns negative error value if any error happened on previous operations or
@@ -361,11 +430,17 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
  */
 int qemu_fclose(QEMUFile *f)
 {
+    FdEntry *fde, *next;
     int ret = qemu_fflush(f);
     int ret2 = qio_channel_close(f->ioc, NULL);
     if (ret >= 0) {
         ret = ret2;
     }
+    QTAILQ_FOREACH_SAFE(fde, &f->fds, entry, next) {
+        warn_report("qemu_fclose: received fd %d was never claimed", fde->fd);
+        close(fde->fd);
+        g_free(fde);
+    }
     g_clear_pointer(&f->ioc, object_unref);
     error_free(f->last_error_obj);
     g_free(f);
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 11c2120..3e47a20 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -79,5 +79,7 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
                           off_t pos);
 
 QIOChannel *qemu_file_get_ioc(QEMUFile *file);
+int qemu_file_put_fd(QEMUFile *f, int fd);
+int qemu_file_get_fd(QEMUFile *f);
 
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index 5356fb5..345506b 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -88,6 +88,8 @@ put_qlist_end(const char *field_name, const char *vmsd_name) "%s(%s)"
 
 # qemu-file.c
 qemu_file_fclose(void) ""
+qemu_file_put_fd(const char *name, int fd, int ret) "ioc %s, fd %d -> status %d"
+qemu_file_get_fd(const char *name, int fd) "ioc %s -> fd %d"
 
 # ram.c
 get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 08/13] migration: VMSTATE_FD
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (6 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 16:36   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 09/13] migration: cpr-transfer save and load Steve Sistare
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define VMSTATE_FD for declaring a file descriptor field in a
VMStateDescription.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h |  9 +++++++++
 migration/vmstate-types.c   | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index f313f2f..a1dfab4 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -230,6 +230,7 @@ extern const VMStateInfo vmstate_info_uint8;
 extern const VMStateInfo vmstate_info_uint16;
 extern const VMStateInfo vmstate_info_uint32;
 extern const VMStateInfo vmstate_info_uint64;
+extern const VMStateInfo vmstate_info_fd;
 
 /** Put this in the stream when migrating a null pointer.*/
 #define VMS_NULLPTR_MARKER (0x30U) /* '0' */
@@ -902,6 +903,9 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
     VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)
 
+#define VMSTATE_FD_V(_f, _s, _v)                                  \
+    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_fd, int32_t)
+
 #ifdef CONFIG_LINUX
 
 #define VMSTATE_U8_V(_f, _s, _v)                                   \
@@ -936,6 +940,9 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64(_f, _s)                                        \
     VMSTATE_UINT64_V(_f, _s, 0)
 
+#define VMSTATE_FD(_f, _s)                                            \
+    VMSTATE_FD_V(_f, _s, 0)
+
 #ifdef CONFIG_LINUX
 
 #define VMSTATE_U8(_f, _s)                                         \
@@ -1009,6 +1016,8 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64_TEST(_f, _s, _t)                                  \
     VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_uint64, uint64_t)
 
+#define VMSTATE_FD_TEST(_f, _s, _t)                                            \
+    VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_fd, int32_t)
 
 #define VMSTATE_TIMER_PTR_TEST(_f, _s, _test)                             \
     VMSTATE_POINTER_TEST(_f, _s, _test, vmstate_info_timer, QEMUTimer *)
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index e83bfcc..6e45a4a 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -314,6 +314,38 @@ const VMStateInfo vmstate_info_uint64 = {
     .put  = put_uint64,
 };
 
+/* File descriptor communicated via SCM_RIGHTS */
+
+static int get_fd(QEMUFile *f, void *pv, size_t size,
+                  const VMStateField *field)
+{
+    int32_t *v = pv;
+    qemu_get_sbe32s(f, v);
+    if (*v < 0) {
+        return 0;
+    }
+    *v = qemu_file_get_fd(f);
+    return 0;
+}
+
+static int put_fd(QEMUFile *f, void *pv, size_t size,
+                  const VMStateField *field, JSONWriter *vmdesc)
+{
+    int32_t *v = pv;
+
+    qemu_put_sbe32s(f, v);
+    if (*v < 0) {
+        return 0;
+    }
+    return qemu_file_put_fd(f, *v);
+}
+
+const VMStateInfo vmstate_info_fd = {
+    .name = "fd",
+    .get  = get_fd,
+    .put  = put_fd,
+};
+
 static int get_nullptr(QEMUFile *f, void *pv, size_t size,
                        const VMStateField *field)
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 09/13] migration: cpr-transfer save and load
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (7 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 08/13] migration: VMSTATE_FD Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 16:47   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 10/13] migration: cpr-uri parameter Steve Sistare
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Add functions to create a QEMUFile based on a unix URI, for saving or
loading, for use by cpr-transfer mode to preserve CPR state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h  |  3 ++
 migration/cpr-transfer.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build    |  1 +
 3 files changed, 85 insertions(+)
 create mode 100644 migration/cpr-transfer.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index ac7a63e..51c19ed 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -30,4 +30,7 @@ int cpr_state_load(Error **errp);
 void cpr_state_close(void);
 struct QIOChannel *cpr_state_ioc(void);
 
+QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
+QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
+
 #endif
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
new file mode 100644
index 0000000..fb9ecd8
--- /dev/null
+++ b/migration/cpr-transfer.c
@@ -0,0 +1,81 @@
+/*
+ * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "io/channel-file.h"
+#include "io/channel-socket.h"
+#include "io/net-listener.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/savevm.h"
+#include "migration/qemu-file.h"
+#include "migration/vmstate.h"
+
+QEMUFile *cpr_transfer_output(const char *uri, Error **errp)
+{
+    g_autoptr(MigrationChannel) channel = NULL;
+    QIOChannel *ioc;
+
+    if (!migrate_uri_parse(uri, &channel, errp)) {
+        return NULL;
+    }
+
+    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
+        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
+
+        QIOChannelSocket *sioc = qio_channel_socket_new();
+        SocketAddress *saddr = &channel->addr->u.socket;
+
+        if (qio_channel_socket_connect_sync(sioc, saddr, errp)) {
+            object_unref(OBJECT(sioc));
+            return NULL;
+        }
+        ioc = QIO_CHANNEL(sioc);
+
+    } else {
+        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
+        return NULL;
+    }
+
+    qio_channel_set_name(ioc, "cpr-out");
+    return qemu_file_new_output(ioc);
+}
+
+QEMUFile *cpr_transfer_input(const char *uri, Error **errp)
+{
+    g_autoptr(MigrationChannel) channel = NULL;
+    QIOChannel *ioc;
+
+    if (!migrate_uri_parse(uri, &channel, errp)) {
+        return NULL;
+    }
+
+    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
+        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
+
+        QIOChannelSocket *sioc;
+        SocketAddress *saddr = &channel->addr->u.socket;
+        QIONetListener *listener = qio_net_listener_new();
+
+        qio_net_listener_set_name(listener, "cpr-socket-listener");
+        if (qio_net_listener_open_sync(listener, saddr, 1, errp) < 0) {
+            object_unref(OBJECT(listener));
+            return NULL;
+        }
+
+        sioc = qio_net_listener_wait_client(listener);
+        ioc = QIO_CHANNEL(sioc);
+
+    } else {
+        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
+        return NULL;
+    }
+
+    qio_channel_set_name(ioc, "cpr-in");
+    return qemu_file_new_input(ioc);
+}
diff --git a/migration/meson.build b/migration/meson.build
index e5f4211..684ba98 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -14,6 +14,7 @@ system_ss.add(files(
   'channel.c',
   'channel-block.c',
   'cpr.c',
+  'cpr-transfer.c',
   'dirtyrate.c',
   'exec.c',
   'fd.c',
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 10/13] migration: cpr-uri parameter
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (8 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 09/13] migration: cpr-transfer save and load Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 16:49   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 11/13] migration: cpr-uri option Steve Sistare
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define the cpr-uri migration parameter to specify the URI to which
CPR vmstate is saved for cpr-transfer mode.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/migration-hmp-cmds.c | 10 ++++++++++
 migration/options.c            | 28 ++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 18 +++++++++++++++---
 4 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 20d1a6e..79d8c66 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -358,6 +358,11 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
                                MIGRATION_PARAMETER_DIRECT_IO),
                            params->direct_io ? "on" : "off");
         }
+
+        assert(params->cpr_uri);
+        monitor_printf(mon, "%s: '%s'\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_CPR_URI),
+            params->cpr_uri);
     }
 
     qapi_free_MigrationParameters(params);
@@ -639,6 +644,11 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_direct_io = true;
         visit_type_bool(v, param, &p->direct_io, &err);
         break;
+    case MIGRATION_PARAMETER_CPR_URI:
+        p->cpr_uri = g_new0(StrOrNull, 1);
+        p->cpr_uri->type = QTYPE_QSTRING;
+        visit_type_str(v, param, &p->cpr_uri->u.s, &err);
+        break;
     default:
         g_assert_not_reached();
     }
diff --git a/migration/options.c b/migration/options.c
index cc85a84..6e7fea7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -173,6 +173,8 @@ Property migration_properties[] = {
     DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
                        parameters.zero_page_detection,
                        ZERO_PAGE_DETECTION_MULTIFD),
+    DEFINE_PROP_STRING("cpr-uri", MigrationState,
+                       parameters.cpr_uri),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -865,6 +867,13 @@ ZeroPageDetection migrate_zero_page_detection(void)
     return s->parameters.zero_page_detection;
 }
 
+const char *migrate_cpr_uri(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.cpr_uri;
+}
+
 /* parameters helpers */
 
 AnnounceParameters *migrate_announce_params(void)
@@ -950,6 +959,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->zero_page_detection = s->parameters.zero_page_detection;
     params->has_direct_io = true;
     params->direct_io = s->parameters.direct_io;
+    params->cpr_uri = g_strdup(s->parameters.cpr_uri);
 
     return params;
 }
@@ -984,6 +994,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_mode = true;
     params->has_zero_page_detection = true;
     params->has_direct_io = true;
+    params->cpr_uri = g_strdup("");
 }
 
 /*
@@ -1283,6 +1294,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_direct_io) {
         dest->direct_io = params->direct_io;
     }
+
+    if (params->cpr_uri) {
+        assert(params->cpr_uri->type == QTYPE_QSTRING);
+        dest->cpr_uri = params->cpr_uri->u.s;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1415,6 +1431,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_direct_io) {
         s->parameters.direct_io = params->direct_io;
     }
+
+    if (params->cpr_uri) {
+        g_free(s->parameters.cpr_uri);
+        assert(params->cpr_uri->type == QTYPE_QSTRING);
+        s->parameters.cpr_uri = g_strdup(params->cpr_uri->u.s);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
@@ -1441,6 +1463,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
         params->tls_authz->u.s = strdup("");
     }
 
+    if (params->cpr_uri && params->cpr_uri->type == QTYPE_QNULL) {
+        qobject_unref(params->cpr_uri->u.n);
+        params->cpr_uri->type = QTYPE_QSTRING;
+        params->cpr_uri->u.s = strdup("");
+    }
+
     migrate_params_test_apply(params, &tmp);
 
     if (!migrate_params_check(&tmp, errp)) {
diff --git a/migration/options.h b/migration/options.h
index a0bd6ed..efccb0e 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -86,6 +86,7 @@ const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 ZeroPageDetection migrate_zero_page_detection(void);
+const char *migrate_cpr_uri(void);
 
 /* parameters helpers */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index b66cccf..c0d8bcc 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -841,6 +841,9 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-uri: URI for an additional migration channel needed by
+#     @cpr-transfer mode. (Since 9.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -867,7 +870,8 @@
            'vcpu-dirty-limit',
            'mode',
            'zero-page-detection',
-           'direct-io'] }
+           'direct-io',
+           'cpr-uri'] }
 
 ##
 # @MigrateSetParameters:
@@ -1022,6 +1026,9 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-uri: URI for an additional migration channel needed by
+#     @cpr-transfer mode. (Since 9.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1063,7 +1070,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-uri': 'StrOrNull' } }
 
 ##
 # @migrate-set-parameters:
@@ -1232,6 +1240,9 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-uri: URI for an additional migration channel needed by
+#     @cpr-transfer mode. (Since 9.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1270,7 +1281,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-uri': 'str' } }
 
 ##
 # @query-migrate-parameters:
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 11/13] migration: cpr-uri option
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (9 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 10/13] migration: cpr-uri parameter Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 16:50   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 12/13] migration: split qmp_migrate Steve Sistare
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define the cpr-uri QEMU command-line option to specify the URI from
which CPR vmstate is loaded for cpr-transfer mode.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h | 1 +
 migration/cpr.c         | 7 +++++++
 qemu-options.hx         | 8 ++++++++
 system/vl.c             | 3 +++
 4 files changed, 19 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 51c19ed..e886c98 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -25,6 +25,7 @@ int cpr_find_fd(const char *name, int id);
 int cpr_walk_fd(cpr_walk_fd_cb cb);
 void cpr_resave_fd(const char *name, int id, int fd);
 
+void cpr_set_cpr_uri(const char *uri);
 int cpr_state_save(Error **errp);
 int cpr_state_load(Error **errp);
 void cpr_state_close(void);
diff --git a/migration/cpr.c b/migration/cpr.c
index 7514c4e..86f66c1 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -163,6 +163,13 @@ QIOChannel *cpr_state_ioc(void)
     return qemu_file_get_ioc(cpr_state_file);
 }
 
+static char *cpr_uri;
+
+void cpr_set_cpr_uri(const char *uri)
+{
+    cpr_uri = g_strdup(uri);
+}
+
 int cpr_state_save(Error **errp)
 {
     int ret;
diff --git a/qemu-options.hx b/qemu-options.hx
index 90ab943..2c88229 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4963,6 +4963,14 @@ SRST
 
 ERST
 
+DEF("cpr-uri", HAS_ARG, QEMU_OPTION_cpr_uri, \
+    "-cpr-uri unix:socketpath\n",
+    QEMU_ARCH_ALL)
+SRST
+``-cpr-uri unix:socketpath``
+    URI for incoming CPR state, for the cpr-transfer migration mode.
+ERST
+
 DEF("incoming", HAS_ARG, QEMU_OPTION_incoming, \
     "-incoming tcp:[host]:port[,to=maxport][,ipv4=on|off][,ipv6=on|off]\n" \
     "-incoming rdma:host:port[,ipv4=on|off][,ipv6=on|off]\n" \
diff --git a/system/vl.c b/system/vl.c
index 565d932..1ac6b0b 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -3490,6 +3490,9 @@ void qemu_init(int argc, char **argv)
                     exit(1);
                 }
                 break;
+            case QEMU_OPTION_cpr_uri:
+                cpr_set_cpr_uri(optarg);
+                break;
             case QEMU_OPTION_incoming:
                 if (!incoming) {
                     runstate_set(RUN_STATE_INMIGRATE);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 12/13] migration: split qmp_migrate
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (10 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 11/13] migration: cpr-uri option Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 19:18   ` Peter Xu
  2024-09-30 19:40 ` [PATCH V2 13/13] migration: cpr-transfer mode Steve Sistare
  2024-10-08 14:33 ` [PATCH V2 00/13] Live update: cpr-transfer Vladimir Sementsov-Ogievskiy
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Split qmp_migrate into start and finish functions.  Finish will be
called asynchronously in a subsequent patch, but for now, call it
immediately.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/migration.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 868bf0e..3301583 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2073,6 +2073,9 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
     return true;
 }
 
+static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
+                               Error **errp);
+
 void qmp_migrate(const char *uri, bool has_channels,
                  MigrationChannelList *channels, bool has_detach, bool detach,
                  bool has_resume, bool resume, Error **errp)
@@ -2120,12 +2123,6 @@ void qmp_migrate(const char *uri, bool has_channels,
         return;
     }
 
-    if (!resume_requested) {
-        if (!yank_register_instance(MIGRATION_YANK_INSTANCE, errp)) {
-            return;
-        }
-    }
-
     if (migrate_mode_is_cpr(s)) {
         int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
         if (ret < 0) {
@@ -2139,6 +2136,30 @@ void qmp_migrate(const char *uri, bool has_channels,
         goto out;
     }
 
+    qmp_migrate_finish(addr, resume_requested, errp);
+
+out:
+    if (local_err) {
+        migrate_fd_error(s, local_err);
+        error_propagate(errp, local_err);
+        if (stopped) {
+            vm_resume(s->vm_old_state);
+        }
+    }
+}
+
+static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
+                               Error **errp)
+{
+    MigrationState *s = migrate_get_current();
+    Error *local_err = NULL;
+
+    if (!resume_requested) {
+        if (!yank_register_instance(MIGRATION_YANK_INSTANCE, errp)) {
+            return;
+        }
+    }
+
     if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
         SocketAddress *saddr = &addr->u.socket;
         if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
@@ -2163,14 +2184,13 @@ void qmp_migrate(const char *uri, bool has_channels,
                           MIGRATION_STATUS_FAILED);
     }
 
-out:
     if (local_err) {
         if (!resume_requested) {
             yank_unregister_instance(MIGRATION_YANK_INSTANCE);
         }
         migrate_fd_error(s, local_err);
         error_propagate(errp, local_err);
-        if (stopped) {
+        if (migrate_mode_is_cpr(s)) {
             vm_resume(s->vm_old_state);
         }
         return;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH V2 13/13] migration: cpr-transfer mode
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (11 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 12/13] migration: split qmp_migrate Steve Sistare
@ 2024-09-30 19:40 ` Steve Sistare
  2024-10-07 19:44   ` Peter Xu
  2024-10-08 14:33 ` [PATCH V2 00/13] Live update: cpr-transfer Vladimir Sementsov-Ogievskiy
  13 siblings, 1 reply; 79+ messages in thread
From: Steve Sistare @ 2024-09-30 19:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Add the cpr-transfer migration mode.  Usage:
  qemu-system-$arch -machine anon-alloc=memfd ...

  start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"

  Issue commands to old QEMU:
  migrate_set_parameter mode cpr-transfer
  migrate_set_parameter cpr-uri <uri-2>
  migrate -d <uri-1>

The migrate command stops the VM, saves CPR state to uri-2, saves
normal migration state to uri-1, and old QEMU enters the postmigrate
state.  The user starts new QEMU on the same host as old QEMU, with the
same arguments as old QEMU, plus the -incoming option.  Guest RAM is
preserved in place, albeit with new virtual addresses in new QEMU.

This mode requires a second migration channel, specified by the
cpr-uri migration property on the outgoing side, and by the cpr-uri
QEMU command-line option on the incoming side.  The channel must
be a type, such as unix socket, that supports SCM_RIGHTS.

Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started with
the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process.  The memfds
are kept open by sending the descriptors to new QEMU via the
cpr-uri, which must support SCM_RIGHTS, and they are mmap'd
in new QEMU.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h   |  1 +
 migration/cpr.c           | 34 +++++++++++++++++++----
 migration/migration.c     | 69 +++++++++++++++++++++++++++++++++++++++++++++--
 migration/migration.h     |  2 ++
 migration/ram.c           |  2 ++
 migration/vmstate-types.c |  5 ++--
 qapi/migration.json       | 27 ++++++++++++++++++-
 stubs/vmstate.c           |  7 +++++
 8 files changed, 137 insertions(+), 10 deletions(-)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index e886c98..5cd373f 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -30,6 +30,7 @@ int cpr_state_save(Error **errp);
 int cpr_state_load(Error **errp);
 void cpr_state_close(void);
 struct QIOChannel *cpr_state_ioc(void);
+bool cpr_needed_for_reuse(void *opaque);
 
 QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
 QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 86f66c1..911b556 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -9,6 +9,7 @@
 #include "qapi/error.h"
 #include "migration/cpr.h"
 #include "migration/misc.h"
+#include "migration/options.h"
 #include "migration/qemu-file.h"
 #include "migration/savevm.h"
 #include "migration/vmstate.h"
@@ -57,7 +58,7 @@ static const VMStateDescription vmstate_cpr_fd = {
         VMSTATE_UINT32(namelen, CprFd),
         VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
         VMSTATE_INT32(id, CprFd),
-        VMSTATE_INT32(fd, CprFd),
+        VMSTATE_FD(fd, CprFd),
         VMSTATE_END_OF_LIST()
     }
 };
@@ -174,9 +175,16 @@ int cpr_state_save(Error **errp)
 {
     int ret;
     QEMUFile *f;
+    MigMode mode = migrate_mode();
 
-    /* set f based on mode in a later patch in this series */
-    return 0;
+    if (mode == MIG_MODE_CPR_TRANSFER) {
+        f = cpr_transfer_output(migrate_cpr_uri(), errp);
+    } else {
+        return 0;
+    }
+    if (!f) {
+        return -1;
+    }
 
     qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
     qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
@@ -205,8 +213,18 @@ int cpr_state_load(Error **errp)
     uint32_t v;
     QEMUFile *f;
 
-    /* set f based on mode in a later patch in this series */
-    return 0;
+    /*
+     * Mode will be loaded in CPR state, so cannot use it to decide which
+     * form of state to load.
+     */
+    if (cpr_uri) {
+        f = cpr_transfer_input(cpr_uri, errp);
+    } else {
+        return 0;
+    }
+    if (!f) {
+        return -1;
+    }
 
     v = qemu_get_be32(f);
     if (v != QEMU_CPR_FILE_MAGIC) {
@@ -243,3 +261,9 @@ void cpr_state_close(void)
         cpr_state_file = NULL;
     }
 }
+
+bool cpr_needed_for_reuse(void *opaque)
+{
+    MigMode mode = migrate_mode();
+    return mode == MIG_MODE_CPR_TRANSFER;
+}
diff --git a/migration/migration.c b/migration/migration.c
index 3301583..73b85aa 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -76,6 +76,7 @@
 static NotifierWithReturnList migration_state_notifiers[] = {
     NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
     NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
+    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
 };
 
 /* Messages sent on the return path from destination to source */
@@ -109,6 +110,7 @@ static int migration_maybe_pause(MigrationState *s,
 static void migrate_fd_cancel(MigrationState *s);
 static bool close_return_path_on_source(MigrationState *s);
 static void migration_completion_end(MigrationState *s);
+static void migrate_hup_delete(MigrationState *s);
 
 static void migration_downtime_start(MigrationState *s)
 {
@@ -204,6 +206,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
         return false;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
+        addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
+        error_setg(errp, "Migration requires streamable transport (eg unix)");
+        return false;
+    }
+
     return true;
 }
 
@@ -316,6 +324,7 @@ void migration_cancel(const Error *error)
         qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
     }
     migrate_fd_cancel(current_migration);
+    migrate_hup_delete(current_migration);
 }
 
 void migration_shutdown(void)
@@ -718,6 +727,9 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
     } else {
         error_setg(errp, "unknown migration protocol: %s", uri);
     }
+
+    /* Close cpr socket to tell source that we are listening */
+    cpr_state_close();
 }
 
 static void process_incoming_migration_bh(void *opaque)
@@ -1414,6 +1426,8 @@ static void migrate_fd_cleanup(MigrationState *s)
     s->vmdesc = NULL;
 
     qemu_savevm_state_cleanup();
+    cpr_state_close();
+    migrate_hup_delete(s);
 
     close_return_path_on_source(s);
 
@@ -1698,7 +1712,9 @@ bool migration_thread_is_self(void)
 
 bool migrate_mode_is_cpr(MigrationState *s)
 {
-    return s->parameters.mode == MIG_MODE_CPR_REBOOT;
+    MigMode mode = s->parameters.mode;
+    return mode == MIG_MODE_CPR_REBOOT ||
+           mode == MIG_MODE_CPR_TRANSFER;
 }
 
 int migrate_init(MigrationState *s, Error **errp)
@@ -2033,6 +2049,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
         return false;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
+        !s->parameters.cpr_uri) {
+        error_setg(errp, "cpr-transfer mode requires setting cpr-uri");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
@@ -2076,6 +2098,37 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
 static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
                                Error **errp);
 
+static void migrate_hup_add(MigrationState *s, QIOChannel *ioc, GSourceFunc cb,
+                            void *opaque)
+{
+        s->hup_source = qio_channel_create_watch(ioc, G_IO_HUP);
+        g_source_set_callback(s->hup_source, cb, opaque, NULL);
+        g_source_attach(s->hup_source, NULL);
+}
+
+static void migrate_hup_delete(MigrationState *s)
+{
+    if (s->hup_source) {
+        g_source_destroy(s->hup_source);
+        g_source_unref(s->hup_source);
+        s->hup_source = NULL;
+    }
+}
+
+static gboolean qmp_migrate_finish_cb(QIOChannel *channel,
+                                      GIOCondition cond,
+                                      void *opaque)
+{
+    MigrationAddress *addr = opaque;
+
+    qmp_migrate_finish(addr, false, NULL);
+
+    cpr_state_close();
+    migrate_hup_delete(migrate_get_current());
+    qapi_free_MigrationAddress(addr);
+    return G_SOURCE_REMOVE;
+}
+
 void qmp_migrate(const char *uri, bool has_channels,
                  MigrationChannelList *channels, bool has_detach, bool detach,
                  bool has_resume, bool resume, Error **errp)
@@ -2136,7 +2189,19 @@ void qmp_migrate(const char *uri, bool has_channels,
         goto out;
     }
 
-    qmp_migrate_finish(addr, resume_requested, errp);
+    /*
+     * For cpr-transfer, the target may not be listening yet on the migration
+     * channel, because first it must finish cpr_load_state.  The target tells
+     * us it is listening by closing the cpr-state socket.  Wait for that HUP
+     * event before connecting in qmp_migrate_finish.
+     */
+    if (s->parameters.mode == MIG_MODE_CPR_TRANSFER) {
+        migrate_hup_add(s, cpr_state_ioc(), (GSourceFunc)qmp_migrate_finish_cb,
+                        QAPI_CLONE(MigrationAddress, addr));
+
+    } else {
+        qmp_migrate_finish(addr, resume_requested, errp);
+    }
 
 out:
     if (local_err) {
diff --git a/migration/migration.h b/migration/migration.h
index 38aa140..74c167b 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -457,6 +457,8 @@ struct MigrationState {
     bool switchover_acked;
     /* Is this a rdma migration */
     bool rdma_migration;
+
+    GSource *hup_source;
 };
 
 void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
diff --git a/migration/ram.c b/migration/ram.c
index 81eda27..e2cef50 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -216,7 +216,9 @@ static bool postcopy_preempt_active(void)
 
 bool migrate_ram_is_ignored(RAMBlock *block)
 {
+    MigMode mode = migrate_mode();
     return !qemu_ram_is_migratable(block) ||
+           mode == MIG_MODE_CPR_TRANSFER ||
            (migrate_ignore_shared() && qemu_ram_is_shared(block)
                                     && qemu_ram_is_named_file(block));
 }
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index 6e45a4a..b5a55b8 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -15,6 +15,7 @@
 #include "qemu-file.h"
 #include "migration.h"
 #include "migration/vmstate.h"
+#include "migration/client-options.h"
 #include "qemu/error-report.h"
 #include "qemu/queue.h"
 #include "trace.h"
@@ -321,7 +322,7 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
 {
     int32_t *v = pv;
     qemu_get_sbe32s(f, v);
-    if (*v < 0) {
+    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
         return 0;
     }
     *v = qemu_file_get_fd(f);
@@ -334,7 +335,7 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
     int32_t *v = pv;
 
     qemu_put_sbe32s(f, v);
-    if (*v < 0) {
+    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
         return 0;
     }
     return qemu_file_put_fd(f, *v);
diff --git a/qapi/migration.json b/qapi/migration.json
index c0d8bcc..f51b4cb 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -611,9 +611,34 @@
 #     or COLO.
 #
 #     (since 8.2)
+#
+# @cpr-transfer: This mode allows the user to transfer a guest to a
+#     new QEMU instance on the same host with minimal guest pause
+#     time, by preserving guest RAM in place, albeit with new virtual
+#     addresses in new QEMU.
+#
+#     The user starts new QEMU on the same host as old QEMU, with the
+#     the same arguments as old QEMU, plus the -incoming option.  The
+#     user issues the migrate command to old QEMU, which stops the VM,
+#     saves state to the migration channels, and enters the
+#     postmigrate state.  Execution resumes in new QEMU.  Guest RAM is
+#     preserved in place, albeit with new virtual addresses in new
+#     QEMU.  The incoming migration channel cannot be a file type.
+#
+#     This mode requires a second migration channel, specified by the
+#     cpr-uri migration property on the outgoing side, and by
+#     the cpr-uri QEMU command-line option on the incoming
+#     side.  The channel must be a type, such as unix socket, that
+#     supports SCM_RIGHTS.
+#
+#     Memory-backend objects must have the share=on attribute, but
+#     memory-backend-epc is not supported.  The VM must be started
+#     with the '-machine anon-alloc=memfd' option.
+#
+#     (since 9.2)
 ##
 { 'enum': 'MigMode',
-  'data': [ 'normal', 'cpr-reboot' ] }
+  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
 
 ##
 # @ZeroPageDetection:
diff --git a/stubs/vmstate.c b/stubs/vmstate.c
index 8513d92..c190762 100644
--- a/stubs/vmstate.c
+++ b/stubs/vmstate.c
@@ -1,5 +1,7 @@
 #include "qemu/osdep.h"
 #include "migration/vmstate.h"
+#include "qapi/qapi-types-migration.h"
+#include "migration/client-options.h"
 
 int vmstate_register_with_alias_id(VMStateIf *obj,
                                    uint32_t instance_id,
@@ -21,3 +23,8 @@ bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
 {
     return true;
 }
+
+MigMode migrate_mode(void)
+{
+    return MIG_MODE_NORMAL;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-09-30 19:40 ` [PATCH V2 01/13] machine: alloc-anon option Steve Sistare
@ 2024-10-03 16:14   ` Peter Xu
  2024-10-04 10:14     ` David Hildenbrand
  2024-10-07 15:36   ` Peter Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-03 16:14 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Igor Mammedov

On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> on the value of the anon-alloc machine property.  This option applies to
> memory allocated as a side effect of creating various devices. It does
> not apply to memory-backend-objects, whether explicitly specified on
> the command line, or implicitly created by the -m command line option.
> 
> The memfd option is intended to support new migration modes, in which the
> memory region can be transferred in place to a new QEMU process, by sending
> the memfd file descriptor to the process.  Memory contents are preserved,
> and if the mode also transfers device descriptors, then pages that are
> locked in memory for DMA remain locked.  This behavior is a pre-requisite
> for supporting vfio, vdpa, and iommufd devices with the new modes.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

[Igor seems missing in the loop; added]

> ---
>  hw/core/machine.c   | 19 +++++++++++++++++++
>  include/hw/boards.h |  1 +
>  qapi/machine.json   | 14 ++++++++++++++
>  qemu-options.hx     | 11 +++++++++++
>  system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
>  system/trace-events |  3 +++
>  6 files changed, 83 insertions(+)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index adaba17..a89a32b 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>      ms->mem_merge = value;
>  }
>  
> +static int machine_get_anon_alloc(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->anon_alloc;
> +}
> +
> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->anon_alloc = value;
> +}
> +
>  static bool machine_get_usb(Object *obj, Error **errp)
>  {
>      MachineState *ms = MACHINE(obj);
> @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "mem-merge",
>          "Enable/disable memory merge support");
>  
> +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
> +                                   &AnonAllocOption_lookup,
> +                                   machine_get_anon_alloc,
> +                                   machine_set_anon_alloc);
> +
>      object_class_property_add_bool(oc, "usb",
>          machine_get_usb, machine_set_usb);
>      object_class_property_set_description(oc, "usb",
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 5966069..5a87647 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -393,6 +393,7 @@ struct MachineState {
>      bool enable_graphics;
>      ConfidentialGuestSupport *cgs;
>      HostMemoryBackend *memdev;
> +    AnonAllocOption anon_alloc;
>      /*
>       * convenience alias to ram_memdev_id backend memory region
>       * or to numa container memory region
> diff --git a/qapi/machine.json b/qapi/machine.json
> index a6b8795..d4a63f5 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1898,3 +1898,17 @@
>  { 'command': 'x-query-interrupt-controllers',
>    'returns': 'HumanReadableText',
>    'features': [ 'unstable' ]}
> +
> +##
> +# @AnonAllocOption:
> +#
> +# An enumeration of the options for allocating anonymous guest memory.
> +#
> +# @mmap: allocate using mmap MAP_ANON
> +#
> +# @memfd: allocate using memfd_create
> +#
> +# Since: 9.2
> +##
> +{ 'enum': 'AnonAllocOption',
> +  'data': [ 'mmap', 'memfd' ] }
> diff --git a/qemu-options.hx b/qemu-options.hx
> index d94e2cb..90ab943 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                nvdimm=on|off controls NVDIMM support (default=off)\n"
>      "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
>      "                hmat=on|off controls ACPI HMAT support (default=off)\n"
> +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
>      "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
>      "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
>      QEMU_ARCH_ALL)
> @@ -101,6 +102,16 @@ SRST
>          Enables or disables ACPI Heterogeneous Memory Attribute Table
>          (HMAT) support. The default is off.
>  
> +    ``anon-alloc=mmap|memfd``
> +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
> +        or memfd_create.  This option applies to memory allocated as a
> +        side effect of creating various devices. It does not apply to
> +        memory-backend-objects, whether explicitly specified on the
> +        command line, or implicitly created by the -m command line
> +        option.
> +
> +        Some migration modes require anon-alloc=memfd.
> +
>      ``memory-backend='id'``
>          An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
>          Allows to use a memory backend as main RAM.
> diff --git a/system/physmem.c b/system/physmem.c
> index dc1db3a..174f7e0 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -47,6 +47,7 @@
>  #include "qemu/qemu-print.h"
>  #include "qemu/log.h"
>  #include "qemu/memalign.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -69,6 +70,8 @@
>  
>  #include "qemu/pmem.h"
>  
> +#include "qapi/qapi-types-migration.h"
> +#include "migration/options.h"
>  #include "migration/vmstate.h"
>  
>  #include "qemu/range.h"
> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> +
> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> +                                        TYPE_MEMORY_BACKEND)) {

This is pretty fragile.. if someone adds yet another layer on top of memory
backend objects, the ownership links can change and this might silently run
into something else even without any warning..

I wished we dig into what is missing, but maybe that's too trivial.  If
not, we still need to make this as solid.  Perhaps that can be a ram flag
and let relevant callers pass in that flag explicitly.

I think RAM_SHARED can actually be that flag already - I mean, in all paths
that we may create anon mem (but not memory-backend-* objects), is it
always safe we always switch to RAM_SHARED from anon?

And that should also make sure that _if_ that path can already have an
optional RAM_SHARED flag passed in, it means we did something wrong,
because such change should not violate with whatever the user can specify
share=on/off.  It means nothing user specified would be violated.

I think that means below paths [1-4] are only relevant:

qemu_ram_alloc
    memory_region_init_rom_device_nomigrate [1]
    memory_region_init_ram_flags_nomigrate
        memory_region_init_ram_nomigrate    [2]
        memory_region_init_rom_nomigrate    [3]
qemu_ram_alloc_resizeable                   [4]

So I wonder whether we can have a patch simply switching [1-4] to
constantly use VM_SHARED; I assume they're all corner case allocations:
they never include major guest memory, but things like vram, etc.

I feel like that's fine, I think it should even work with migration where
an old QEMU with all such memory chunks being anon, be migrated to a new
QEMU where such memory chunks being all memfd.  Fundamentally it should
work as qemu migration relies on host pointer not anything else IIRC.
Worth double check, some migration test could also be useful if something
obvious I overlook.  Nothing yet I spot will go wrong.

Then, maybe..  we don't need any new machine type property?

> +            size_t max_length = new_block->max_length;
> +            MemoryRegion *mr = new_block->mr;
> +            const char *name = memory_region_name(mr);
> +
> +            new_block->mr->align = QEMU_VMALLOC_ALIGN;
> +            new_block->flags |= RAM_SHARED;
> +
> +            if (new_block->fd == -1) {
> +                new_block->fd = qemu_memfd_create(name, max_length + mr->align,
> +                                                  0, 0, 0, errp);
> +            }
> +
> +            if (new_block->fd >= 0) {
> +                int mfd = new_block->fd;
> +                qemu_set_cloexec(mfd);
> +                new_block->host = file_ram_alloc(new_block, max_length, mfd,
> +                                                 false, 0, errp);
> +            }
> +            if (!new_block->host) {
> +                qemu_mutex_unlock_ramlist();
> +                return;
> +            }
> +            memory_try_enable_merging(new_block->host, new_block->max_length);

IIUC this can be dropped.  It's destined to be SHARED here, so KSM won't work.

But if you agree with VM_SHARED idea I mentioned above, this chunk of code
is not needed as a whole, instead it'll be two separate patches instead:

  - Make above paths [1-4] constantly use VM_SHARED

  - Change qemu_anon_ram_alloc() path so that it'll cache the fd if VM_SHARED

> +            free_on_error = true;
> +
>          } else {
>              new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>                                                    &new_block->mr->align,
> @@ -1932,6 +1964,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>          ram_block_notify_add(new_block->host, new_block->used_length,
>                               new_block->max_length);
>      }
> +    trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
> +                        new_block->fd, new_block->used_length,
> +                        new_block->max_length);
>      return;
>  
>  out_free:
> diff --git a/system/trace-events b/system/trace-events
> index 074d001..4669411 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
>  
>  # cpu-throttle.c
>  cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
> +
> +#physmem.c
> +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-10-03 16:14   ` Peter Xu
@ 2024-10-04 10:14     ` David Hildenbrand
  2024-10-04 12:33       ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2024-10-04 10:14 UTC (permalink / raw)
  To: Peter Xu, Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, Igor Mammedov

On 03.10.24 18:14, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>> on the value of the anon-alloc machine property.  This option applies to
>> memory allocated as a side effect of creating various devices. It does
>> not apply to memory-backend-objects, whether explicitly specified on
>> the command line, or implicitly created by the -m command line option.
>>
>> The memfd option is intended to support new migration modes, in which the
>> memory region can be transferred in place to a new QEMU process, by sending
>> the memfd file descriptor to the process.  Memory contents are preserved,
>> and if the mode also transfers device descriptors, then pages that are
>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> [Igor seems missing in the loop; added]
> 
>> ---
>>   hw/core/machine.c   | 19 +++++++++++++++++++
>>   include/hw/boards.h |  1 +
>>   qapi/machine.json   | 14 ++++++++++++++
>>   qemu-options.hx     | 11 +++++++++++
>>   system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
>>   system/trace-events |  3 +++
>>   6 files changed, 83 insertions(+)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index adaba17..a89a32b 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>       ms->mem_merge = value;
>>   }
>>   
>> +static int machine_get_anon_alloc(Object *obj, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    return ms->anon_alloc;
>> +}
>> +
>> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    ms->anon_alloc = value;
>> +}
>> +
>>   static bool machine_get_usb(Object *obj, Error **errp)
>>   {
>>       MachineState *ms = MACHINE(obj);
>> @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>       object_class_property_set_description(oc, "mem-merge",
>>           "Enable/disable memory merge support");
>>   
>> +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
>> +                                   &AnonAllocOption_lookup,
>> +                                   machine_get_anon_alloc,
>> +                                   machine_set_anon_alloc);
>> +
>>       object_class_property_add_bool(oc, "usb",
>>           machine_get_usb, machine_set_usb);
>>       object_class_property_set_description(oc, "usb",
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index 5966069..5a87647 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -393,6 +393,7 @@ struct MachineState {
>>       bool enable_graphics;
>>       ConfidentialGuestSupport *cgs;
>>       HostMemoryBackend *memdev;
>> +    AnonAllocOption anon_alloc;
>>       /*
>>        * convenience alias to ram_memdev_id backend memory region
>>        * or to numa container memory region
>> diff --git a/qapi/machine.json b/qapi/machine.json
>> index a6b8795..d4a63f5 100644
>> --- a/qapi/machine.json
>> +++ b/qapi/machine.json
>> @@ -1898,3 +1898,17 @@
>>   { 'command': 'x-query-interrupt-controllers',
>>     'returns': 'HumanReadableText',
>>     'features': [ 'unstable' ]}
>> +
>> +##
>> +# @AnonAllocOption:
>> +#
>> +# An enumeration of the options for allocating anonymous guest memory.
>> +#
>> +# @mmap: allocate using mmap MAP_ANON
>> +#
>> +# @memfd: allocate using memfd_create
>> +#
>> +# Since: 9.2
>> +##
>> +{ 'enum': 'AnonAllocOption',
>> +  'data': [ 'mmap', 'memfd' ] }
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index d94e2cb..90ab943 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>       "                nvdimm=on|off controls NVDIMM support (default=off)\n"
>>       "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
>>       "                hmat=on|off controls ACPI HMAT support (default=off)\n"
>> +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
>>       "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
>>       "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
>>       QEMU_ARCH_ALL)
>> @@ -101,6 +102,16 @@ SRST
>>           Enables or disables ACPI Heterogeneous Memory Attribute Table
>>           (HMAT) support. The default is off.
>>   
>> +    ``anon-alloc=mmap|memfd``
>> +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
>> +        or memfd_create.  This option applies to memory allocated as a
>> +        side effect of creating various devices. It does not apply to
>> +        memory-backend-objects, whether explicitly specified on the
>> +        command line, or implicitly created by the -m command line
>> +        option.
>> +
>> +        Some migration modes require anon-alloc=memfd.
>> +
>>       ``memory-backend='id'``
>>           An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
>>           Allows to use a memory backend as main RAM.
>> diff --git a/system/physmem.c b/system/physmem.c
>> index dc1db3a..174f7e0 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -47,6 +47,7 @@
>>   #include "qemu/qemu-print.h"
>>   #include "qemu/log.h"
>>   #include "qemu/memalign.h"
>> +#include "qemu/memfd.h"
>>   #include "exec/memory.h"
>>   #include "exec/ioport.h"
>>   #include "sysemu/dma.h"
>> @@ -69,6 +70,8 @@
>>   
>>   #include "qemu/pmem.h"
>>   
>> +#include "qapi/qapi-types-migration.h"
>> +#include "migration/options.h"
>>   #include "migration/vmstate.h"
>>   
>>   #include "qemu/range.h"
>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>                   qemu_mutex_unlock_ramlist();
>>                   return;
>>               }
>> +
>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>> +                                        TYPE_MEMORY_BACKEND)) {
> 
> This is pretty fragile.. if someone adds yet another layer on top of memory
> backend objects, the ownership links can change and this might silently run
> into something else even without any warning..
> 
> I wished we dig into what is missing, but maybe that's too trivial.  If
> not, we still need to make this as solid.  Perhaps that can be a ram flag
> and let relevant callers pass in that flag explicitly.

How would they decide whether or not we want to set the flag in the 
current configuration?

> 
> I think RAM_SHARED can actually be that flag already - I mean, in all paths
> that we may create anon mem (but not memory-backend-* objects), is it
> always safe we always switch to RAM_SHARED from anon?

Do you mean only setting the flag (-> anonymous shmem) or switching also 
to memfd, which is a bigger change?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-10-04 10:14     ` David Hildenbrand
@ 2024-10-04 12:33       ` Peter Xu
  2024-10-04 12:54         ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-04 12:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Igor Mammedov

On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote:
> On 03.10.24 18:14, Peter Xu wrote:
> > On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
> > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > on the value of the anon-alloc machine property.  This option applies to
> > > memory allocated as a side effect of creating various devices. It does
> > > not apply to memory-backend-objects, whether explicitly specified on
> > > the command line, or implicitly created by the -m command line option.
> > > 
> > > The memfd option is intended to support new migration modes, in which the
> > > memory region can be transferred in place to a new QEMU process, by sending
> > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > and if the mode also transfers device descriptors, then pages that are
> > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > [Igor seems missing in the loop; added]
> > 
> > > ---
> > >   hw/core/machine.c   | 19 +++++++++++++++++++
> > >   include/hw/boards.h |  1 +
> > >   qapi/machine.json   | 14 ++++++++++++++
> > >   qemu-options.hx     | 11 +++++++++++
> > >   system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
> > >   system/trace-events |  3 +++
> > >   6 files changed, 83 insertions(+)
> > > 
> > > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > > index adaba17..a89a32b 100644
> > > --- a/hw/core/machine.c
> > > +++ b/hw/core/machine.c
> > > @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> > >       ms->mem_merge = value;
> > >   }
> > > +static int machine_get_anon_alloc(Object *obj, Error **errp)
> > > +{
> > > +    MachineState *ms = MACHINE(obj);
> > > +
> > > +    return ms->anon_alloc;
> > > +}
> > > +
> > > +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
> > > +{
> > > +    MachineState *ms = MACHINE(obj);
> > > +
> > > +    ms->anon_alloc = value;
> > > +}
> > > +
> > >   static bool machine_get_usb(Object *obj, Error **errp)
> > >   {
> > >       MachineState *ms = MACHINE(obj);
> > > @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> > >       object_class_property_set_description(oc, "mem-merge",
> > >           "Enable/disable memory merge support");
> > > +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
> > > +                                   &AnonAllocOption_lookup,
> > > +                                   machine_get_anon_alloc,
> > > +                                   machine_set_anon_alloc);
> > > +
> > >       object_class_property_add_bool(oc, "usb",
> > >           machine_get_usb, machine_set_usb);
> > >       object_class_property_set_description(oc, "usb",
> > > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > > index 5966069..5a87647 100644
> > > --- a/include/hw/boards.h
> > > +++ b/include/hw/boards.h
> > > @@ -393,6 +393,7 @@ struct MachineState {
> > >       bool enable_graphics;
> > >       ConfidentialGuestSupport *cgs;
> > >       HostMemoryBackend *memdev;
> > > +    AnonAllocOption anon_alloc;
> > >       /*
> > >        * convenience alias to ram_memdev_id backend memory region
> > >        * or to numa container memory region
> > > diff --git a/qapi/machine.json b/qapi/machine.json
> > > index a6b8795..d4a63f5 100644
> > > --- a/qapi/machine.json
> > > +++ b/qapi/machine.json
> > > @@ -1898,3 +1898,17 @@
> > >   { 'command': 'x-query-interrupt-controllers',
> > >     'returns': 'HumanReadableText',
> > >     'features': [ 'unstable' ]}
> > > +
> > > +##
> > > +# @AnonAllocOption:
> > > +#
> > > +# An enumeration of the options for allocating anonymous guest memory.
> > > +#
> > > +# @mmap: allocate using mmap MAP_ANON
> > > +#
> > > +# @memfd: allocate using memfd_create
> > > +#
> > > +# Since: 9.2
> > > +##
> > > +{ 'enum': 'AnonAllocOption',
> > > +  'data': [ 'mmap', 'memfd' ] }
> > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > index d94e2cb..90ab943 100644
> > > --- a/qemu-options.hx
> > > +++ b/qemu-options.hx
> > > @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> > >       "                nvdimm=on|off controls NVDIMM support (default=off)\n"
> > >       "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
> > >       "                hmat=on|off controls ACPI HMAT support (default=off)\n"
> > > +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
> > >       "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
> > >       "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
> > >       QEMU_ARCH_ALL)
> > > @@ -101,6 +102,16 @@ SRST
> > >           Enables or disables ACPI Heterogeneous Memory Attribute Table
> > >           (HMAT) support. The default is off.
> > > +    ``anon-alloc=mmap|memfd``
> > > +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
> > > +        or memfd_create.  This option applies to memory allocated as a
> > > +        side effect of creating various devices. It does not apply to
> > > +        memory-backend-objects, whether explicitly specified on the
> > > +        command line, or implicitly created by the -m command line
> > > +        option.
> > > +
> > > +        Some migration modes require anon-alloc=memfd.
> > > +
> > >       ``memory-backend='id'``
> > >           An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
> > >           Allows to use a memory backend as main RAM.
> > > diff --git a/system/physmem.c b/system/physmem.c
> > > index dc1db3a..174f7e0 100644
> > > --- a/system/physmem.c
> > > +++ b/system/physmem.c
> > > @@ -47,6 +47,7 @@
> > >   #include "qemu/qemu-print.h"
> > >   #include "qemu/log.h"
> > >   #include "qemu/memalign.h"
> > > +#include "qemu/memfd.h"
> > >   #include "exec/memory.h"
> > >   #include "exec/ioport.h"
> > >   #include "sysemu/dma.h"
> > > @@ -69,6 +70,8 @@
> > >   #include "qemu/pmem.h"
> > > +#include "qapi/qapi-types-migration.h"
> > > +#include "migration/options.h"
> > >   #include "migration/vmstate.h"
> > >   #include "qemu/range.h"
> > > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> > >                   qemu_mutex_unlock_ramlist();
> > >                   return;
> > >               }
> > > +
> > > +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
> > > +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> > > +                                        TYPE_MEMORY_BACKEND)) {
> > 
> > This is pretty fragile.. if someone adds yet another layer on top of memory
> > backend objects, the ownership links can change and this might silently run
> > into something else even without any warning..
> > 
> > I wished we dig into what is missing, but maybe that's too trivial.  If
> > not, we still need to make this as solid.  Perhaps that can be a ram flag
> > and let relevant callers pass in that flag explicitly.
> 
> How would they decide whether or not we want to set the flag in the current
> configuration?

It was in the previous email where it got cut..  I listed four paths that
may need change.

> 
> > 
> > I think RAM_SHARED can actually be that flag already - I mean, in all paths
> > that we may create anon mem (but not memory-backend-* objects), is it
> > always safe we always switch to RAM_SHARED from anon?
> 
> Do you mean only setting the flag (-> anonymous shmem) or switching also to
> memfd, which is a bigger change?

Switching to memfd.  I thought anon shmem (mmap(MAP_SHARED)) is mostly the
same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no?

In this case we need the fds so we need to do the latter to cache the fds.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-10-04 12:33       ` Peter Xu
@ 2024-10-04 12:54         ` David Hildenbrand
  2024-10-04 13:24           ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2024-10-04 12:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Igor Mammedov

On 04.10.24 14:33, Peter Xu wrote:
> On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote:
>> On 03.10.24 18:14, Peter Xu wrote:
>>> On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>> on the value of the anon-alloc machine property.  This option applies to
>>>> memory allocated as a side effect of creating various devices. It does
>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>> the command line, or implicitly created by the -m command line option.
>>>>
>>>> The memfd option is intended to support new migration modes, in which the
>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>> and if the mode also transfers device descriptors, then pages that are
>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> [Igor seems missing in the loop; added]
>>>
>>>> ---
>>>>    hw/core/machine.c   | 19 +++++++++++++++++++
>>>>    include/hw/boards.h |  1 +
>>>>    qapi/machine.json   | 14 ++++++++++++++
>>>>    qemu-options.hx     | 11 +++++++++++
>>>>    system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
>>>>    system/trace-events |  3 +++
>>>>    6 files changed, 83 insertions(+)
>>>>
>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>> index adaba17..a89a32b 100644
>>>> --- a/hw/core/machine.c
>>>> +++ b/hw/core/machine.c
>>>> @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>>>        ms->mem_merge = value;
>>>>    }
>>>> +static int machine_get_anon_alloc(Object *obj, Error **errp)
>>>> +{
>>>> +    MachineState *ms = MACHINE(obj);
>>>> +
>>>> +    return ms->anon_alloc;
>>>> +}
>>>> +
>>>> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
>>>> +{
>>>> +    MachineState *ms = MACHINE(obj);
>>>> +
>>>> +    ms->anon_alloc = value;
>>>> +}
>>>> +
>>>>    static bool machine_get_usb(Object *obj, Error **errp)
>>>>    {
>>>>        MachineState *ms = MACHINE(obj);
>>>> @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>>>        object_class_property_set_description(oc, "mem-merge",
>>>>            "Enable/disable memory merge support");
>>>> +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
>>>> +                                   &AnonAllocOption_lookup,
>>>> +                                   machine_get_anon_alloc,
>>>> +                                   machine_set_anon_alloc);
>>>> +
>>>>        object_class_property_add_bool(oc, "usb",
>>>>            machine_get_usb, machine_set_usb);
>>>>        object_class_property_set_description(oc, "usb",
>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>>> index 5966069..5a87647 100644
>>>> --- a/include/hw/boards.h
>>>> +++ b/include/hw/boards.h
>>>> @@ -393,6 +393,7 @@ struct MachineState {
>>>>        bool enable_graphics;
>>>>        ConfidentialGuestSupport *cgs;
>>>>        HostMemoryBackend *memdev;
>>>> +    AnonAllocOption anon_alloc;
>>>>        /*
>>>>         * convenience alias to ram_memdev_id backend memory region
>>>>         * or to numa container memory region
>>>> diff --git a/qapi/machine.json b/qapi/machine.json
>>>> index a6b8795..d4a63f5 100644
>>>> --- a/qapi/machine.json
>>>> +++ b/qapi/machine.json
>>>> @@ -1898,3 +1898,17 @@
>>>>    { 'command': 'x-query-interrupt-controllers',
>>>>      'returns': 'HumanReadableText',
>>>>      'features': [ 'unstable' ]}
>>>> +
>>>> +##
>>>> +# @AnonAllocOption:
>>>> +#
>>>> +# An enumeration of the options for allocating anonymous guest memory.
>>>> +#
>>>> +# @mmap: allocate using mmap MAP_ANON
>>>> +#
>>>> +# @memfd: allocate using memfd_create
>>>> +#
>>>> +# Since: 9.2
>>>> +##
>>>> +{ 'enum': 'AnonAllocOption',
>>>> +  'data': [ 'mmap', 'memfd' ] }
>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>> index d94e2cb..90ab943 100644
>>>> --- a/qemu-options.hx
>>>> +++ b/qemu-options.hx
>>>> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>>        "                nvdimm=on|off controls NVDIMM support (default=off)\n"
>>>>        "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
>>>>        "                hmat=on|off controls ACPI HMAT support (default=off)\n"
>>>> +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
>>>>        "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
>>>>        "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
>>>>        QEMU_ARCH_ALL)
>>>> @@ -101,6 +102,16 @@ SRST
>>>>            Enables or disables ACPI Heterogeneous Memory Attribute Table
>>>>            (HMAT) support. The default is off.
>>>> +    ``anon-alloc=mmap|memfd``
>>>> +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
>>>> +        or memfd_create.  This option applies to memory allocated as a
>>>> +        side effect of creating various devices. It does not apply to
>>>> +        memory-backend-objects, whether explicitly specified on the
>>>> +        command line, or implicitly created by the -m command line
>>>> +        option.
>>>> +
>>>> +        Some migration modes require anon-alloc=memfd.
>>>> +
>>>>        ``memory-backend='id'``
>>>>            An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
>>>>            Allows to use a memory backend as main RAM.
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index dc1db3a..174f7e0 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -47,6 +47,7 @@
>>>>    #include "qemu/qemu-print.h"
>>>>    #include "qemu/log.h"
>>>>    #include "qemu/memalign.h"
>>>> +#include "qemu/memfd.h"
>>>>    #include "exec/memory.h"
>>>>    #include "exec/ioport.h"
>>>>    #include "sysemu/dma.h"
>>>> @@ -69,6 +70,8 @@
>>>>    #include "qemu/pmem.h"
>>>> +#include "qapi/qapi-types-migration.h"
>>>> +#include "migration/options.h"
>>>>    #include "migration/vmstate.h"
>>>>    #include "qemu/range.h"
>>>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>                    qemu_mutex_unlock_ramlist();
>>>>                    return;
>>>>                }
>>>> +
>>>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>>>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>>>> +                                        TYPE_MEMORY_BACKEND)) {
>>>
>>> This is pretty fragile.. if someone adds yet another layer on top of memory
>>> backend objects, the ownership links can change and this might silently run
>>> into something else even without any warning..
>>>
>>> I wished we dig into what is missing, but maybe that's too trivial.  If
>>> not, we still need to make this as solid.  Perhaps that can be a ram flag
>>> and let relevant callers pass in that flag explicitly.
>>
>> How would they decide whether or not we want to set the flag in the current
>> configuration?
> 
> It was in the previous email where it got cut..  I listed four paths that
> may need change.

That's not my question. Who would decide whether we want to set 
MAP_SHARED in these callers or not?

If you have "unconditionally" in mind, I think it's a bad idea. If there 
is some other toggle to perform that setting conditionally, why not.

> 
>>
>>>
>>> I think RAM_SHARED can actually be that flag already - I mean, in all paths
>>> that we may create anon mem (but not memory-backend-* objects), is it
>>> always safe we always switch to RAM_SHARED from anon?
>>
>> Do you mean only setting the flag (-> anonymous shmem) or switching also to
>> memfd, which is a bigger change?
> 
> Switching to memfd.  I thought anon shmem (mmap(MAP_SHARED)) is mostly the
> same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no?

Memfd is Linux specific, keep that in mind. Apart from that there 
shouldn't be much difference between anon shmem and memfd (there are 
memory commit differences, though).

Of course, there is a difference between anon memory and shmem, for 
example regarding what viritofsd faced (e.g., KSM) recently.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-10-04 12:54         ` David Hildenbrand
@ 2024-10-04 13:24           ` Peter Xu
  2024-10-07 16:23             ` David Hildenbrand
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-04 13:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Igor Mammedov

On Fri, Oct 04, 2024 at 02:54:38PM +0200, David Hildenbrand wrote:
> On 04.10.24 14:33, Peter Xu wrote:
> > On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote:
> > > On 03.10.24 18:14, Peter Xu wrote:
> > > > On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
> > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > memory allocated as a side effect of creating various devices. It does
> > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > the command line, or implicitly created by the -m command line option.
> > > > > 
> > > > > The memfd option is intended to support new migration modes, in which the
> > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > 
> > > > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > > 
> > > > [Igor seems missing in the loop; added]
> > > > 
> > > > > ---
> > > > >    hw/core/machine.c   | 19 +++++++++++++++++++
> > > > >    include/hw/boards.h |  1 +
> > > > >    qapi/machine.json   | 14 ++++++++++++++
> > > > >    qemu-options.hx     | 11 +++++++++++
> > > > >    system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
> > > > >    system/trace-events |  3 +++
> > > > >    6 files changed, 83 insertions(+)
> > > > > 
> > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > > > > index adaba17..a89a32b 100644
> > > > > --- a/hw/core/machine.c
> > > > > +++ b/hw/core/machine.c
> > > > > @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> > > > >        ms->mem_merge = value;
> > > > >    }
> > > > > +static int machine_get_anon_alloc(Object *obj, Error **errp)
> > > > > +{
> > > > > +    MachineState *ms = MACHINE(obj);
> > > > > +
> > > > > +    return ms->anon_alloc;
> > > > > +}
> > > > > +
> > > > > +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
> > > > > +{
> > > > > +    MachineState *ms = MACHINE(obj);
> > > > > +
> > > > > +    ms->anon_alloc = value;
> > > > > +}
> > > > > +
> > > > >    static bool machine_get_usb(Object *obj, Error **errp)
> > > > >    {
> > > > >        MachineState *ms = MACHINE(obj);
> > > > > @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> > > > >        object_class_property_set_description(oc, "mem-merge",
> > > > >            "Enable/disable memory merge support");
> > > > > +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
> > > > > +                                   &AnonAllocOption_lookup,
> > > > > +                                   machine_get_anon_alloc,
> > > > > +                                   machine_set_anon_alloc);
> > > > > +
> > > > >        object_class_property_add_bool(oc, "usb",
> > > > >            machine_get_usb, machine_set_usb);
> > > > >        object_class_property_set_description(oc, "usb",
> > > > > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > > > > index 5966069..5a87647 100644
> > > > > --- a/include/hw/boards.h
> > > > > +++ b/include/hw/boards.h
> > > > > @@ -393,6 +393,7 @@ struct MachineState {
> > > > >        bool enable_graphics;
> > > > >        ConfidentialGuestSupport *cgs;
> > > > >        HostMemoryBackend *memdev;
> > > > > +    AnonAllocOption anon_alloc;
> > > > >        /*
> > > > >         * convenience alias to ram_memdev_id backend memory region
> > > > >         * or to numa container memory region
> > > > > diff --git a/qapi/machine.json b/qapi/machine.json
> > > > > index a6b8795..d4a63f5 100644
> > > > > --- a/qapi/machine.json
> > > > > +++ b/qapi/machine.json
> > > > > @@ -1898,3 +1898,17 @@
> > > > >    { 'command': 'x-query-interrupt-controllers',
> > > > >      'returns': 'HumanReadableText',
> > > > >      'features': [ 'unstable' ]}
> > > > > +
> > > > > +##
> > > > > +# @AnonAllocOption:
> > > > > +#
> > > > > +# An enumeration of the options for allocating anonymous guest memory.
> > > > > +#
> > > > > +# @mmap: allocate using mmap MAP_ANON
> > > > > +#
> > > > > +# @memfd: allocate using memfd_create
> > > > > +#
> > > > > +# Since: 9.2
> > > > > +##
> > > > > +{ 'enum': 'AnonAllocOption',
> > > > > +  'data': [ 'mmap', 'memfd' ] }
> > > > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > > > index d94e2cb..90ab943 100644
> > > > > --- a/qemu-options.hx
> > > > > +++ b/qemu-options.hx
> > > > > @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> > > > >        "                nvdimm=on|off controls NVDIMM support (default=off)\n"
> > > > >        "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
> > > > >        "                hmat=on|off controls ACPI HMAT support (default=off)\n"
> > > > > +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
> > > > >        "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
> > > > >        "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
> > > > >        QEMU_ARCH_ALL)
> > > > > @@ -101,6 +102,16 @@ SRST
> > > > >            Enables or disables ACPI Heterogeneous Memory Attribute Table
> > > > >            (HMAT) support. The default is off.
> > > > > +    ``anon-alloc=mmap|memfd``
> > > > > +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
> > > > > +        or memfd_create.  This option applies to memory allocated as a
> > > > > +        side effect of creating various devices. It does not apply to
> > > > > +        memory-backend-objects, whether explicitly specified on the
> > > > > +        command line, or implicitly created by the -m command line
> > > > > +        option.
> > > > > +
> > > > > +        Some migration modes require anon-alloc=memfd.
> > > > > +
> > > > >        ``memory-backend='id'``
> > > > >            An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
> > > > >            Allows to use a memory backend as main RAM.
> > > > > diff --git a/system/physmem.c b/system/physmem.c
> > > > > index dc1db3a..174f7e0 100644
> > > > > --- a/system/physmem.c
> > > > > +++ b/system/physmem.c
> > > > > @@ -47,6 +47,7 @@
> > > > >    #include "qemu/qemu-print.h"
> > > > >    #include "qemu/log.h"
> > > > >    #include "qemu/memalign.h"
> > > > > +#include "qemu/memfd.h"
> > > > >    #include "exec/memory.h"
> > > > >    #include "exec/ioport.h"
> > > > >    #include "sysemu/dma.h"
> > > > > @@ -69,6 +70,8 @@
> > > > >    #include "qemu/pmem.h"
> > > > > +#include "qapi/qapi-types-migration.h"
> > > > > +#include "migration/options.h"
> > > > >    #include "migration/vmstate.h"
> > > > >    #include "qemu/range.h"
> > > > > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> > > > >                    qemu_mutex_unlock_ramlist();
> > > > >                    return;
> > > > >                }
> > > > > +
> > > > > +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
> > > > > +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> > > > > +                                        TYPE_MEMORY_BACKEND)) {
> > > > 
> > > > This is pretty fragile.. if someone adds yet another layer on top of memory
> > > > backend objects, the ownership links can change and this might silently run
> > > > into something else even without any warning..
> > > > 
> > > > I wished we dig into what is missing, but maybe that's too trivial.  If
> > > > not, we still need to make this as solid.  Perhaps that can be a ram flag
> > > > and let relevant callers pass in that flag explicitly.
> > > 
> > > How would they decide whether or not we want to set the flag in the current
> > > configuration?
> > 
> > It was in the previous email where it got cut..  I listed four paths that
> > may need change.
> 
> That's not my question. Who would decide whether we want to set MAP_SHARED
> in these callers or not?
> 
> If you have "unconditionally" in mind, I think it's a bad idea. If there is
> some other toggle to perform that setting conditionally, why not.

Yes I thought it could be unconditionally.  We can discuss downside below,
I think we can still use a new flag otherwise, but the idea would be the
same, where I want the flag to be explicit in the callers not implicitly
with the object type check, which I think can be hackish.

> 
> > 
> > > 
> > > > 
> > > > I think RAM_SHARED can actually be that flag already - I mean, in all paths
> > > > that we may create anon mem (but not memory-backend-* objects), is it
> > > > always safe we always switch to RAM_SHARED from anon?
> > > 
> > > Do you mean only setting the flag (-> anonymous shmem) or switching also to
> > > memfd, which is a bigger change?
> > 
> > Switching to memfd.  I thought anon shmem (mmap(MAP_SHARED)) is mostly the
> > same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no?
> 
> Memfd is Linux specific, keep that in mind. Apart from that there shouldn't
> be much difference between anon shmem and memfd (there are memory commit
> differences, though).

Could you elaborate the memory commit difference and what does that imply
to QEMU's usage?

> 
> Of course, there is a difference between anon memory and shmem, for example
> regarding what viritofsd faced (e.g., KSM) recently.

The four paths shouldn't be KSM target, AFAICT.  None of them are major
part of guest RAM, but only very small chunks like vram or roms.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 02/13] migration: cpr-state
  2024-09-30 19:40 ` [PATCH V2 02/13] migration: cpr-state Steve Sistare
@ 2024-10-07 14:14   ` Peter Xu
  2024-10-07 19:30     ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 14:14 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:33PM -0700, Steve Sistare wrote:
> CPR must save state that is needed after QEMU is restarted, when devices
> are realized.  Thus the extra state cannot be saved in the migration stream,
> as objects must already exist before that stream can be loaded.  Instead,
> define auxilliary state structures and vmstate descriptions, not associated
> with any registered object, and serialize the aux state to a cpr-specific
> stream in cpr_state_save.  Deserialize in cpr_state_load after QEMU
> restarts, before devices are realized.
> 
> Provide accessors for clients to register file descriptors for saving.
> The mechanism for passing the fd's to the new process will be specific
> to each migration mode, and added in subsequent patches.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>

Only two trivial comments below.

> ---
>  include/migration/cpr.h |  26 ++++++
>  migration/cpr.c         | 217 ++++++++++++++++++++++++++++++++++++++++++++++++
>  migration/meson.build   |   1 +
>  migration/migration.c   |   6 ++
>  migration/trace-events  |   5 ++
>  system/vl.c             |   7 ++
>  6 files changed, 262 insertions(+)
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr.c
> 
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> new file mode 100644
> index 0000000..e7b898b
> --- /dev/null
> +++ b/include/migration/cpr.h
> @@ -0,0 +1,26 @@
> +/*
> + * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef MIGRATION_CPR_H
> +#define MIGRATION_CPR_H
> +
> +#define QEMU_CPR_FILE_MAGIC     0x51435052
> +#define QEMU_CPR_FILE_VERSION   0x00000001
> +
> +typedef int (*cpr_walk_fd_cb)(int fd);
> +void cpr_save_fd(const char *name, int id, int fd);
> +void cpr_delete_fd(const char *name, int id);
> +int cpr_find_fd(const char *name, int id);
> +int cpr_walk_fd(cpr_walk_fd_cb cb);
> +void cpr_resave_fd(const char *name, int id, int fd);
> +
> +int cpr_state_save(Error **errp);
> +int cpr_state_load(Error **errp);
> +void cpr_state_close(void);
> +struct QIOChannel *cpr_state_ioc(void);
> +
> +#endif
> diff --git a/migration/cpr.c b/migration/cpr.c
> new file mode 100644
> index 0000000..e50fc75
> --- /dev/null
> +++ b/migration/cpr.c
> @@ -0,0 +1,217 @@
> +/*
> + * Copyright (c) 2021-2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "migration/cpr.h"
> +#include "migration/misc.h"
> +#include "migration/qemu-file.h"
> +#include "migration/savevm.h"
> +#include "migration/vmstate.h"
> +#include "sysemu/runstate.h"
> +#include "trace.h"
> +
> +/*************************************************************************/
> +/* cpr state container for all information to be saved. */
> +
> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
> +
> +typedef struct CprState {
> +    CprFdList fds;
> +} CprState;
> +
> +static CprState cpr_state;
> +
> +/****************************************************************************/
> +
> +typedef struct CprFd {
> +    char *name;
> +    unsigned int namelen;
> +    int id;
> +    int fd;
> +    QLIST_ENTRY(CprFd) next;
> +} CprFd;
> +
> +static const VMStateDescription vmstate_cpr_fd = {
> +    .name = "cpr fd",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32(namelen, CprFd),
> +        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
> +        VMSTATE_INT32(id, CprFd),
> +        VMSTATE_INT32(fd, CprFd),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +void cpr_save_fd(const char *name, int id, int fd)
> +{
> +    CprFd *elem = g_new0(CprFd, 1);
> +
> +    trace_cpr_save_fd(name, id, fd);
> +    elem->name = g_strdup(name);
> +    elem->namelen = strlen(name) + 1;
> +    elem->id = id;
> +    elem->fd = fd;
> +    QLIST_INSERT_HEAD(&cpr_state.fds, elem, next);
> +}
> +
> +static CprFd *find_fd(CprFdList *head, const char *name, int id)
> +{
> +    CprFd *elem;
> +
> +    QLIST_FOREACH(elem, head, next) {
> +        if (!strcmp(elem->name, name) && elem->id == id) {
> +            return elem;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +void cpr_delete_fd(const char *name, int id)
> +{
> +    CprFd *elem = find_fd(&cpr_state.fds, name, id);
> +
> +    if (elem) {
> +        QLIST_REMOVE(elem, next);
> +        g_free(elem->name);
> +        g_free(elem);
> +    }
> +
> +    trace_cpr_delete_fd(name, id);
> +}
> +
> +int cpr_find_fd(const char *name, int id)
> +{
> +    CprFd *elem = find_fd(&cpr_state.fds, name, id);
> +    int fd = elem ? elem->fd : -1;
> +
> +    trace_cpr_find_fd(name, id, fd);
> +    return fd;
> +}
> +
> +int cpr_walk_fd(cpr_walk_fd_cb cb)
> +{
> +    CprFd *elem;
> +
> +    QLIST_FOREACH(elem, &cpr_state.fds, next) {
> +        if (elem->fd >= 0 && cb(elem->fd)) {
> +            return 1;
> +        }
> +    }
> +    return 0;
> +}
> +
> +void cpr_resave_fd(const char *name, int id, int fd)
> +{
> +    CprFd *elem = find_fd(&cpr_state.fds, name, id);
> +    int old_fd = elem ? elem->fd : -1;
> +
> +    if (old_fd < 0) {
> +        cpr_save_fd(name, id, fd);
> +    } else if (old_fd != fd) {
> +        error_setg(&error_fatal,
> +                   "internal error: cpr fd '%s' id %d value %d "
> +                   "already saved with a different value %d",
> +                   name, id, fd, old_fd);
> +    }
> +}

I remember I commented this, maybe not.. cpr_walk_fd() and cpr_resave_fd()
are not used in this series.  Suggest introduce them only when they're
used.

> +/*************************************************************************/
> +#define CPR_STATE "CprState"
> +
> +static const VMStateDescription vmstate_cpr_state = {
> +    .name = CPR_STATE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +/*************************************************************************/
> +
> +static QEMUFile *cpr_state_file;
> +
> +QIOChannel *cpr_state_ioc(void)
> +{
> +    return qemu_file_get_ioc(cpr_state_file);
> +}
> +
> +int cpr_state_save(Error **errp)
> +{
> +    int ret;
> +    QEMUFile *f;
> +
> +    /* set f based on mode in a later patch in this series */
> +    return 0;
> +
> +    qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
> +    qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
> +
> +    ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
> +    if (ret) {
> +        error_setg(errp, "vmstate_save_state error %d", ret);
> +        qemu_fclose(f);
> +        return ret;
> +    }
> +
> +    /*
> +     * Close the socket only partially so we can later detect when the other
> +     * end closes by getting a HUP event.
> +     */
> +    qemu_fflush(f);
> +    qio_channel_shutdown(qemu_file_get_ioc(f), QIO_CHANNEL_SHUTDOWN_WRITE,
> +                         NULL);

What happens if we send everything and close immediately?

I didn't see how this cached file is used later throughout the whole
series.  Is it used in some follow up series?

> +    cpr_state_file = f;
> +    return 0;
> +}
> +
> +int cpr_state_load(Error **errp)
> +{
> +    int ret;
> +    uint32_t v;
> +    QEMUFile *f;
> +
> +    /* set f based on mode in a later patch in this series */
> +    return 0;
> +
> +    v = qemu_get_be32(f);
> +    if (v != QEMU_CPR_FILE_MAGIC) {
> +        error_setg(errp, "Not a migration stream (bad magic %x)", v);
> +        qemu_fclose(f);
> +        return -EINVAL;
> +    }
> +    v = qemu_get_be32(f);
> +    if (v != QEMU_CPR_FILE_VERSION) {
> +        error_setg(errp, "Unsupported migration stream version %d", v);
> +        qemu_fclose(f);
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
> +    if (ret) {
> +        error_setg(errp, "vmstate_load_state error %d", ret);
> +        qemu_fclose(f);
> +        return ret;
> +    }
> +
> +    /*
> +     * Let the caller decide when to close the socket (and generate a HUP event
> +     * for the sending side).
> +     */
> +    cpr_state_file = f;
> +    return ret;
> +}
> +
> +void cpr_state_close(void)
> +{
> +    if (cpr_state_file) {
> +        qemu_fclose(cpr_state_file);
> +        cpr_state_file = NULL;
> +    }
> +}
> diff --git a/migration/meson.build b/migration/meson.build
> index 66d3de8..e5f4211 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -13,6 +13,7 @@ system_ss.add(files(
>    'block-dirty-bitmap.c',
>    'channel.c',
>    'channel-block.c',
> +  'cpr.c',
>    'dirtyrate.c',
>    'exec.c',
>    'fd.c',
> diff --git a/migration/migration.c b/migration/migration.c
> index ae2be31..834b0a2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -27,6 +27,7 @@
>  #include "sysemu/cpu-throttle.h"
>  #include "rdma.h"
>  #include "ram.h"
> +#include "migration/cpr.h"
>  #include "migration/global_state.h"
>  #include "migration/misc.h"
>  #include "migration.h"
> @@ -2123,6 +2124,10 @@ void qmp_migrate(const char *uri, bool has_channels,
>          }
>      }
>  
> +    if (cpr_state_save(&local_err)) {
> +        goto out;
> +    }
> +
>      if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
>          SocketAddress *saddr = &addr->u.socket;
>          if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
> @@ -2147,6 +2152,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>                            MIGRATION_STATUS_FAILED);
>      }
>  
> +out:
>      if (local_err) {
>          if (!resume_requested) {
>              yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> diff --git a/migration/trace-events b/migration/trace-events
> index c65902f..5356fb5 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -341,6 +341,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
>  # colo-failover.c
>  colo_failover_set_state(const char *new_state) "new state %s"
>  
> +# cpr.c
> +cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
> +cpr_delete_fd(const char *name, int id) "%s, id %d"
> +cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
> +
>  # block-dirty-bitmap.c
>  send_bitmap_header_enter(void) ""
>  send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
> diff --git a/system/vl.c b/system/vl.c
> index 752a1da..565d932 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -77,6 +77,7 @@
>  #include "hw/block/block.h"
>  #include "hw/i386/x86.h"
>  #include "hw/i386/pc.h"
> +#include "migration/cpr.h"
>  #include "migration/misc.h"
>  #include "migration/snapshot.h"
>  #include "sysemu/tpm.h"
> @@ -3720,6 +3721,12 @@ void qemu_init(int argc, char **argv)
>  
>      qemu_create_machine(machine_opts_dict);
>  
> +    /*
> +     * Load incoming CPR state before any devices are created, because it
> +     * contains file descriptors that are needed in device initialization code.
> +     */
> +    cpr_state_load(&error_fatal);
> +
>      suspend_mux_open();
>  
>      qemu_disable_default_devices();
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 03/13] migration: save cpr mode
  2024-09-30 19:40 ` [PATCH V2 03/13] migration: save cpr mode Steve Sistare
@ 2024-10-07 15:18   ` Peter Xu
  2024-10-07 19:31     ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 15:18 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:34PM -0700, Steve Sistare wrote:
> Save the mode in CPR state, so the user does not need to explicitly specify
> it for the target.  Modify migrate_mode() so it returns the incoming mode on
> the target.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/cpr.h |  7 +++++++
>  migration/cpr.c         | 23 ++++++++++++++++++++++-
>  migration/migration.c   |  1 +
>  migration/options.c     |  9 +++++++--
>  4 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index e7b898b..ac7a63e 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -8,9 +8,16 @@
>  #ifndef MIGRATION_CPR_H
>  #define MIGRATION_CPR_H
>  
> +#include "qapi/qapi-types-migration.h"
> +
> +#define MIG_MODE_NONE           -1
> +
>  #define QEMU_CPR_FILE_MAGIC     0x51435052
>  #define QEMU_CPR_FILE_VERSION   0x00000001
>  
> +MigMode cpr_get_incoming_mode(void);
> +void cpr_set_incoming_mode(MigMode mode);
> +
>  typedef int (*cpr_walk_fd_cb)(int fd);
>  void cpr_save_fd(const char *name, int id, int fd);
>  void cpr_delete_fd(const char *name, int id);
> diff --git a/migration/cpr.c b/migration/cpr.c
> index e50fc75..7514c4e 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -21,10 +21,23 @@
>  typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>  
>  typedef struct CprState {
> +    MigMode mode;
>      CprFdList fds;
>  } CprState;
>  
> -static CprState cpr_state;
> +static CprState cpr_state = {
> +    .mode = MIG_MODE_NONE,
> +};
> +
> +MigMode cpr_get_incoming_mode(void)
> +{
> +    return cpr_state.mode;
> +}
> +
> +void cpr_set_incoming_mode(MigMode mode)
> +{
> +    cpr_state.mode = mode;
> +}
>  
>  /****************************************************************************/
>  
> @@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
>  /*************************************************************************/
>  #define CPR_STATE "CprState"
>  
> +static int cpr_state_presave(void *opaque)
> +{
> +    cpr_state.mode = migrate_mode();
> +    return 0;
> +}
> +
>  static const VMStateDescription vmstate_cpr_state = {
>      .name = CPR_STATE,
>      .version_id = 1,
>      .minimum_version_id = 1,
> +    .pre_save = cpr_state_presave,
>      .fields = (VMStateField[]) {
> +        VMSTATE_UINT32(mode, CprState),
>          VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>          VMSTATE_END_OF_LIST()
>      }
> diff --git a/migration/migration.c b/migration/migration.c
> index 834b0a2..df00e5c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -416,6 +416,7 @@ void migration_incoming_state_destroy(void)
>          mis->postcopy_qemufile_dst = NULL;
>      }
>  
> +    cpr_set_incoming_mode(MIG_MODE_NONE);
>      yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>  }
>  
> diff --git a/migration/options.c b/migration/options.c
> index 147cd2b..cc85a84 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -22,6 +22,7 @@
>  #include "qapi/qmp/qnull.h"
>  #include "sysemu/runstate.h"
>  #include "migration/colo.h"
> +#include "migration/cpr.h"
>  #include "migration/misc.h"
>  #include "migration.h"
>  #include "migration-stats.h"
> @@ -768,8 +769,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
>  
>  MigMode migrate_mode(void)
>  {
> -    MigrationState *s = migrate_get_current();
> -    MigMode mode = s->parameters.mode;
> +    MigMode mode = cpr_get_incoming_mode();
> +
> +    if (mode == MIG_MODE_NONE) {
> +        MigrationState *s = migrate_get_current();
> +        mode = s->parameters.mode;
> +    }

Is this trying to avoid interfering with what user specified?

I can kind of get the point of it, but it'll also look pretty werid in this
case that user can set the mode but then when query before cpr-transfer
incoming completes it won't read what was set previously, but what was
migrated via the cpr channel.

And IIUC it is needed to migrate this mode in cpr stream so as to avoid
another new qemu cmdline on dest qemu.  If true this needs to be mentioned
in the commit message; so far it reads like it's optional, then it's not
clear why only cpr-mode needs to be migrated not other migration parameters.

If that won't get right easily, I wonder whether we could just overwrite
parameters.mode directly by the cpr stream.  After all IIUC that's before
QMP is available, so there's no legal way to set it, then no legal way that
it overwrites an user input?

>  
>      assert(mode >= 0 && mode < MIG_MODE__MAX);
>      return mode;
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 04/13] migration: stop vm earlier for cpr
  2024-09-30 19:40 ` [PATCH V2 04/13] migration: stop vm earlier for cpr Steve Sistare
@ 2024-10-07 15:27   ` Peter Xu
  2024-10-07 20:52     ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 15:27 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote:
> Stop the vm earlier for cpr, to guarantee consistent device state when
> CPR state is saved.

Could you add some more info on why this order matters?

E.g., qmp_migrate should switch migration state machine to SETUP, while
this path holds BQL, I think it means there's no way devices got hot added
concurrently of the whole process.

Would other things change in the cpr states (name, fd, etc.)?  It'll be
great to mention these details in the commit message.

Thanks,

> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/migration.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index df00e5c..868bf0e 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2082,6 +2082,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>      MigrationState *s = migrate_get_current();
>      g_autoptr(MigrationChannel) channel = NULL;
>      MigrationAddress *addr = NULL;
> +    bool stopped = false;
>  
>      /*
>       * Having preliminary checks for uri and channel
> @@ -2125,6 +2126,15 @@ void qmp_migrate(const char *uri, bool has_channels,
>          }
>      }
>  
> +    if (migrate_mode_is_cpr(s)) {
> +        int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
> +        if (ret < 0) {
> +            error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
> +            goto out;
> +        }
> +        stopped = true;
> +    }
> +
>      if (cpr_state_save(&local_err)) {
>          goto out;
>      }
> @@ -2160,6 +2170,9 @@ out:
>          }
>          migrate_fd_error(s, local_err);
>          error_propagate(errp, local_err);
> +        if (stopped) {
> +            vm_resume(s->vm_old_state);
> +        }
>          return;
>      }
>  }
> @@ -3743,7 +3756,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>      Error *local_err = NULL;
>      uint64_t rate_limit;
>      bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
> -    int ret;
>  
>      /*
>       * If there's a previous error, free it and prepare for another one.
> @@ -3815,14 +3827,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>          return;
>      }
>  
> -    if (migrate_mode_is_cpr(s)) {
> -        ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
> -        if (ret < 0) {
> -            error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
> -            goto fail;
> -        }
> -    }
> -
>      if (migrate_background_snapshot()) {
>          qemu_thread_create(&s->thread, "mig/snapshot",
>                  bg_migration_thread, s, QEMU_THREAD_JOINABLE);
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-09-30 19:40 ` [PATCH V2 01/13] machine: alloc-anon option Steve Sistare
  2024-10-03 16:14   ` Peter Xu
@ 2024-10-07 15:36   ` Peter Xu
  2024-10-07 19:30     ` Steven Sistare
  1 sibling, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 15:36 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
> diff --git a/system/trace-events b/system/trace-events
> index 074d001..4669411 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
>  
>  # cpu-throttle.c
>  cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
> +
> +#physmem.c
> +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"

This breaks 32bit build:

../system/trace-events: In function ‘_nocheck__trace_ram_block_add’:
../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 8 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
   52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
      |                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 9 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
   52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
      |                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
   52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
      |                      ^~~~~~~~~~~~~~~~                                                                   ~~~~~~~~~~~
      |                                                                                                         |
      |                                                                                                         size_t {aka unsigned int}
../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
   52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
      |                      ^~~~~~~~~~~~~~~~                                                                                ~~~~~~~~~~
      |                                                                                                                      |
      |                                                                                                                      size_t {aka unsigned int}

Probably need to switch to %zu for size_t's.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-09-30 19:40 ` [PATCH V2 05/13] physmem: preserve ram blocks " Steve Sistare
@ 2024-10-07 15:49   ` Peter Xu
  2024-10-07 16:28     ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 15:49 UTC (permalink / raw)
  To: Steve Sistare, Igor Mammedov, Michael S. Tsirkin
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Igor Mammedov,
	Michael S. Tsirkin

On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
> Save the memfd for anonymous ramblocks in CPR state, along with a name
> that uniquely identifies it.  The block's idstr is not yet set, so it
> cannot be used for this purpose.  Find the saved memfd in new QEMU when
> creating a block.  QEMU hard-codes the length of some internally-created
> blocks, so to guard against that length changing, use lseek to get the
> actual length of an incoming memfd.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  system/physmem.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 174f7e0..ddbeec9 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -72,6 +72,7 @@
>  
>  #include "qapi/qapi-types-migration.h"
>  #include "migration/options.h"
> +#include "migration/cpr.h"
>  #include "migration/vmstate.h"
>  
>  #include "qemu/range.h"
> @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
>      }
>  }
>  
> +static char *cpr_name(RAMBlock *block)
> +{
> +    MemoryRegion *mr = block->mr;
> +    const char *mr_name = memory_region_name(mr);
> +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
> +
> +    if (id) {
> +        return g_strdup_printf("%s/%s", id, mr_name);
> +    } else {
> +        return g_strdup(mr_name);
> +    }
> +}
> +
>  size_t qemu_ram_pagesize(RAMBlock *rb)
>  {
>      return rb->page_size;
> @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                                          TYPE_MEMORY_BACKEND)) {
>              size_t max_length = new_block->max_length;
>              MemoryRegion *mr = new_block->mr;
> -            const char *name = memory_region_name(mr);
> +            g_autofree char *name = cpr_name(new_block);
>  
>              new_block->mr->align = QEMU_VMALLOC_ALIGN;
>              new_block->flags |= RAM_SHARED;
> +            new_block->fd = cpr_find_fd(name, 0);
>  
>              if (new_block->fd == -1) {
>                  new_block->fd = qemu_memfd_create(name, max_length + mr->align,
>                                                    0, 0, 0, errp);
> +                cpr_save_fd(name, 0, new_block->fd);
> +            } else {
> +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);

So this can overwrite the max_length that the caller specified..

I remember we used to have some tricks on specifying different max_length
for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
host so the size can be bigger than src qemu's old ramblocks), so that the
MR is always large enough to reload even the new firmwares, while migration
only migrates the smaller size (used_length) so it's fine as we keep the
extra sizes empty. I think that can relevant to the qemu_ram_resize() call
of parse_ramblock().

The reload will not happen until some point, perhaps system resets.  I
wonder whether that is an issue in this case.

+Igor +Mst for this.

>              }
>  
>              if (new_block->fd >= 0) {
> @@ -1875,6 +1893,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                                                   false, 0, errp);
>              }
>              if (!new_block->host) {
> +                cpr_delete_fd(name, 0);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> @@ -2182,6 +2201,8 @@ static void reclaim_ramblock(RAMBlock *block)
>  
>  void qemu_ram_free(RAMBlock *block)
>  {
> +    g_autofree char *name = NULL;
> +
>      if (!block) {
>          return;
>      }
> @@ -2192,6 +2213,8 @@ void qemu_ram_free(RAMBlock *block)
>      }
>  
>      qemu_mutex_lock_ramlist();
> +    name = cpr_name(block);
> +    cpr_delete_fd(name, 0);
>      QLIST_REMOVE_RCU(block, next);
>      ram_list.mru_block = NULL;
>      /* Write list before version */
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 06/13] hostmem-memfd: preserve for cpr
  2024-09-30 19:40 ` [PATCH V2 06/13] hostmem-memfd: preserve " Steve Sistare
@ 2024-10-07 15:52   ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-07 15:52 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:37PM -0700, Steve Sistare wrote:
> Preserve memory-backend-memfd memory objects during cpr-transfer.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile
  2024-09-30 19:40 ` [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile Steve Sistare
@ 2024-10-07 16:06   ` Peter Xu
  2024-10-07 16:35     ` Daniel P. Berrangé
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 16:06 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:38PM -0700, Steve Sistare wrote:
> Define functions to put/get file descriptors to/from a QEMUFile, for qio
> channels that support SCM_RIGHTS.  Maintain ordering such that
>   put(A), put(fd), put(B)
> followed by
>   get(A), get(fd), get(B)
> always succeeds.  Other get orderings may succeed but are not guaranteed.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/qemu-file.c  | 83 +++++++++++++++++++++++++++++++++++++++++++++++---
>  migration/qemu-file.h  |  2 ++
>  migration/trace-events |  2 ++
>  3 files changed, 83 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index b6d2f58..7f951ab 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -37,6 +37,11 @@
>  #define IO_BUF_SIZE 32768
>  #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
>  
> +typedef struct FdEntry {
> +    QTAILQ_ENTRY(FdEntry) entry;
> +    int fd;
> +} FdEntry;
> +
>  struct QEMUFile {
>      QIOChannel *ioc;
>      bool is_writable;
> @@ -51,6 +56,9 @@ struct QEMUFile {
>  
>      int last_error;
>      Error *last_error_obj;
> +
> +    bool fd_pass;
> +    QTAILQ_HEAD(, FdEntry) fds;
>  };
>  
>  /*
> @@ -109,6 +117,8 @@ static QEMUFile *qemu_file_new_impl(QIOChannel *ioc, bool is_writable)
>      object_ref(ioc);
>      f->ioc = ioc;
>      f->is_writable = is_writable;
> +    f->fd_pass = qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS);
> +    QTAILQ_INIT(&f->fds);
>  
>      return f;
>  }
> @@ -310,6 +320,10 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
>      int len;
>      int pending;
>      Error *local_error = NULL;
> +    g_autofree int *fds = NULL;
> +    size_t nfd = 0;
> +    int **pfds = f->fd_pass ? &fds : NULL;
> +    size_t *pnfd = f->fd_pass ? &nfd : NULL;
>  
>      assert(!qemu_file_is_writable(f));
>  
> @@ -325,10 +339,9 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
>      }
>  
>      do {
> -        len = qio_channel_read(f->ioc,
> -                               (char *)f->buf + pending,
> -                               IO_BUF_SIZE - pending,
> -                               &local_error);
> +        struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
> +        len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,
> +                                     &local_error);
>          if (len == QIO_CHANNEL_ERR_BLOCK) {
>              if (qemu_in_coroutine()) {
>                  qio_channel_yield(f->ioc, G_IO_IN);
> @@ -348,9 +361,65 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
>          qemu_file_set_error_obj(f, len, local_error);
>      }
>  
> +    for (int i = 0; i < nfd; i++) {
> +        FdEntry *fde = g_new0(FdEntry, 1);
> +        fde->fd = fds[i];
> +        QTAILQ_INSERT_TAIL(&f->fds, fde, entry);
> +    }
> +
>      return len;
>  }
>  
> +int qemu_file_put_fd(QEMUFile *f, int fd)
> +{
> +    int ret = 0;
> +    QIOChannel *ioc = qemu_file_get_ioc(f);
> +    Error *err = NULL;
> +    struct iovec iov = { (void *)" ", 1 };
> +
> +    /*
> +     * Send a dummy byte so qemu_fill_buffer on the receiving side does not
> +     * fail with a len=0 error.  Flush first to maintain ordering wrt other
> +     * data.
> +     */
> +
> +    qemu_fflush(f);
> +    if (qio_channel_writev_full(ioc, &iov, 1, &fd, 1, 0, &err) < 1) {
> +        error_report_err(error_copy(err));
> +        qemu_file_set_error_obj(f, -EIO, err);
> +        ret = -1;
> +    }
> +    trace_qemu_file_put_fd(f->ioc->name, fd, ret);
> +    return ret;
> +}
> +
> +int qemu_file_get_fd(QEMUFile *f)
> +{
> +    int fd = -1;
> +    FdEntry *fde;
> +
> +    if (!f->fd_pass) {
> +        Error *err = NULL;
> +        error_setg(&err, "%s does not support fd passing", f->ioc->name);
> +        error_report_err(error_copy(err));
> +        qemu_file_set_error_obj(f, -EIO, err);
> +        goto out;
> +    }
> +
> +    /* Force the dummy byte and its fd passenger to appear. */
> +    qemu_peek_byte(f, 0);
> +
> +    fde = QTAILQ_FIRST(&f->fds);
> +    if (fde) {
> +        qemu_get_byte(f);       /* Drop the dummy byte */

Can we still try to get rid of this magical byte?

Ideally this function should check for no byte but f->fds bening non-empty,
if it is it could invoke qemu_fill_buffer(). OTOH, qemu_fill_buffer() needs
to take len==0&&nfds!=0 as legal.  Would that work?

> +        fd = fde->fd;
> +        QTAILQ_REMOVE(&f->fds, fde, entry);
> +    }
> +out:
> +    trace_qemu_file_get_fd(f->ioc->name, fd);
> +    return fd;
> +}
> +
>  /** Closes the file
>   *
>   * Returns negative error value if any error happened on previous operations or
> @@ -361,11 +430,17 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
>   */
>  int qemu_fclose(QEMUFile *f)
>  {
> +    FdEntry *fde, *next;
>      int ret = qemu_fflush(f);
>      int ret2 = qio_channel_close(f->ioc, NULL);
>      if (ret >= 0) {
>          ret = ret2;
>      }
> +    QTAILQ_FOREACH_SAFE(fde, &f->fds, entry, next) {
> +        warn_report("qemu_fclose: received fd %d was never claimed", fde->fd);
> +        close(fde->fd);
> +        g_free(fde);
> +    }
>      g_clear_pointer(&f->ioc, object_unref);
>      error_free(f->last_error_obj);
>      g_free(f);
> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index 11c2120..3e47a20 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -79,5 +79,7 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
>                            off_t pos);
>  
>  QIOChannel *qemu_file_get_ioc(QEMUFile *file);
> +int qemu_file_put_fd(QEMUFile *f, int fd);
> +int qemu_file_get_fd(QEMUFile *f);
>  
>  #endif
> diff --git a/migration/trace-events b/migration/trace-events
> index 5356fb5..345506b 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -88,6 +88,8 @@ put_qlist_end(const char *field_name, const char *vmsd_name) "%s(%s)"
>  
>  # qemu-file.c
>  qemu_file_fclose(void) ""
> +qemu_file_put_fd(const char *name, int fd, int ret) "ioc %s, fd %d -> status %d"
> +qemu_file_get_fd(const char *name, int fd) "ioc %s -> fd %d"
>  
>  # ram.c
>  get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-10-04 13:24           ` Peter Xu
@ 2024-10-07 16:23             ` David Hildenbrand
  2024-10-07 19:05               ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: David Hildenbrand @ 2024-10-07 16:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Igor Mammedov

On 04.10.24 15:24, Peter Xu wrote:
> On Fri, Oct 04, 2024 at 02:54:38PM +0200, David Hildenbrand wrote:
>> On 04.10.24 14:33, Peter Xu wrote:
>>> On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote:
>>>> On 03.10.24 18:14, Peter Xu wrote:
>>>>> On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>
>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>
>>>>> [Igor seems missing in the loop; added]
>>>>>
>>>>>> ---
>>>>>>     hw/core/machine.c   | 19 +++++++++++++++++++
>>>>>>     include/hw/boards.h |  1 +
>>>>>>     qapi/machine.json   | 14 ++++++++++++++
>>>>>>     qemu-options.hx     | 11 +++++++++++
>>>>>>     system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
>>>>>>     system/trace-events |  3 +++
>>>>>>     6 files changed, 83 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>>>> index adaba17..a89a32b 100644
>>>>>> --- a/hw/core/machine.c
>>>>>> +++ b/hw/core/machine.c
>>>>>> @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>>>>>         ms->mem_merge = value;
>>>>>>     }
>>>>>> +static int machine_get_anon_alloc(Object *obj, Error **errp)
>>>>>> +{
>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>> +
>>>>>> +    return ms->anon_alloc;
>>>>>> +}
>>>>>> +
>>>>>> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
>>>>>> +{
>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>> +
>>>>>> +    ms->anon_alloc = value;
>>>>>> +}
>>>>>> +
>>>>>>     static bool machine_get_usb(Object *obj, Error **errp)
>>>>>>     {
>>>>>>         MachineState *ms = MACHINE(obj);
>>>>>> @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>>>>>         object_class_property_set_description(oc, "mem-merge",
>>>>>>             "Enable/disable memory merge support");
>>>>>> +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
>>>>>> +                                   &AnonAllocOption_lookup,
>>>>>> +                                   machine_get_anon_alloc,
>>>>>> +                                   machine_set_anon_alloc);
>>>>>> +
>>>>>>         object_class_property_add_bool(oc, "usb",
>>>>>>             machine_get_usb, machine_set_usb);
>>>>>>         object_class_property_set_description(oc, "usb",
>>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>>>>> index 5966069..5a87647 100644
>>>>>> --- a/include/hw/boards.h
>>>>>> +++ b/include/hw/boards.h
>>>>>> @@ -393,6 +393,7 @@ struct MachineState {
>>>>>>         bool enable_graphics;
>>>>>>         ConfidentialGuestSupport *cgs;
>>>>>>         HostMemoryBackend *memdev;
>>>>>> +    AnonAllocOption anon_alloc;
>>>>>>         /*
>>>>>>          * convenience alias to ram_memdev_id backend memory region
>>>>>>          * or to numa container memory region
>>>>>> diff --git a/qapi/machine.json b/qapi/machine.json
>>>>>> index a6b8795..d4a63f5 100644
>>>>>> --- a/qapi/machine.json
>>>>>> +++ b/qapi/machine.json
>>>>>> @@ -1898,3 +1898,17 @@
>>>>>>     { 'command': 'x-query-interrupt-controllers',
>>>>>>       'returns': 'HumanReadableText',
>>>>>>       'features': [ 'unstable' ]}
>>>>>> +
>>>>>> +##
>>>>>> +# @AnonAllocOption:
>>>>>> +#
>>>>>> +# An enumeration of the options for allocating anonymous guest memory.
>>>>>> +#
>>>>>> +# @mmap: allocate using mmap MAP_ANON
>>>>>> +#
>>>>>> +# @memfd: allocate using memfd_create
>>>>>> +#
>>>>>> +# Since: 9.2
>>>>>> +##
>>>>>> +{ 'enum': 'AnonAllocOption',
>>>>>> +  'data': [ 'mmap', 'memfd' ] }
>>>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>>>> index d94e2cb..90ab943 100644
>>>>>> --- a/qemu-options.hx
>>>>>> +++ b/qemu-options.hx
>>>>>> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>>>>         "                nvdimm=on|off controls NVDIMM support (default=off)\n"
>>>>>>         "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
>>>>>>         "                hmat=on|off controls ACPI HMAT support (default=off)\n"
>>>>>> +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
>>>>>>         "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
>>>>>>         "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
>>>>>>         QEMU_ARCH_ALL)
>>>>>> @@ -101,6 +102,16 @@ SRST
>>>>>>             Enables or disables ACPI Heterogeneous Memory Attribute Table
>>>>>>             (HMAT) support. The default is off.
>>>>>> +    ``anon-alloc=mmap|memfd``
>>>>>> +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
>>>>>> +        or memfd_create.  This option applies to memory allocated as a
>>>>>> +        side effect of creating various devices. It does not apply to
>>>>>> +        memory-backend-objects, whether explicitly specified on the
>>>>>> +        command line, or implicitly created by the -m command line
>>>>>> +        option.
>>>>>> +
>>>>>> +        Some migration modes require anon-alloc=memfd.
>>>>>> +
>>>>>>         ``memory-backend='id'``
>>>>>>             An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
>>>>>>             Allows to use a memory backend as main RAM.
>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>> index dc1db3a..174f7e0 100644
>>>>>> --- a/system/physmem.c
>>>>>> +++ b/system/physmem.c
>>>>>> @@ -47,6 +47,7 @@
>>>>>>     #include "qemu/qemu-print.h"
>>>>>>     #include "qemu/log.h"
>>>>>>     #include "qemu/memalign.h"
>>>>>> +#include "qemu/memfd.h"
>>>>>>     #include "exec/memory.h"
>>>>>>     #include "exec/ioport.h"
>>>>>>     #include "sysemu/dma.h"
>>>>>> @@ -69,6 +70,8 @@
>>>>>>     #include "qemu/pmem.h"
>>>>>> +#include "qapi/qapi-types-migration.h"
>>>>>> +#include "migration/options.h"
>>>>>>     #include "migration/vmstate.h"
>>>>>>     #include "qemu/range.h"
>>>>>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>>>                     qemu_mutex_unlock_ramlist();
>>>>>>                     return;
>>>>>>                 }
>>>>>> +
>>>>>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>>>>>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>>>>>> +                                        TYPE_MEMORY_BACKEND)) {
>>>>>
>>>>> This is pretty fragile.. if someone adds yet another layer on top of memory
>>>>> backend objects, the ownership links can change and this might silently run
>>>>> into something else even without any warning..
>>>>>
>>>>> I wished we dig into what is missing, but maybe that's too trivial.  If
>>>>> not, we still need to make this as solid.  Perhaps that can be a ram flag
>>>>> and let relevant callers pass in that flag explicitly.
>>>>
>>>> How would they decide whether or not we want to set the flag in the current
>>>> configuration?
>>>
>>> It was in the previous email where it got cut..  I listed four paths that
>>> may need change.
>>
>> That's not my question. Who would decide whether we want to set MAP_SHARED
>> in these callers or not?
>>
>> If you have "unconditionally" in mind, I think it's a bad idea. If there is
>> some other toggle to perform that setting conditionally, why not.
> 
> Yes I thought it could be unconditionally.  We can discuss downside below,
> I think we can still use a new flag otherwise, but the idea would be the
> same, where I want the flag to be explicit in the callers not implicitly
> with the object type check, which I think can be hackish.

I agree that the caller should specify it.

But I don't think using shared memory where shared memory is not 
warranted is a reasonable approach.

I'm quite surprise you're considering such changes with unclear impacts 
on other OSes besides Linux (Freedbsd? Windows that doeasn';t even 
support shared memory?) just to make one corner-case QEMU use case happy.

But I'm sure there are valid reasons why you had that idea, so I'm happy 
to learn why using shared memory unconditionally here is better than 
providing a clean alternative path with the feature enabled and memfd 
actually being supported on the setup (e.g., newer Linux kernel).

> 
>>
>>>
>>>>
>>>>>
>>>>> I think RAM_SHARED can actually be that flag already - I mean, in all paths
>>>>> that we may create anon mem (but not memory-backend-* objects), is it
>>>>> always safe we always switch to RAM_SHARED from anon?
>>>>
>>>> Do you mean only setting the flag (-> anonymous shmem) or switching also to
>>>> memfd, which is a bigger change?
>>>
>>> Switching to memfd.  I thought anon shmem (mmap(MAP_SHARED)) is mostly the
>>> same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no?
>>
>> Memfd is Linux specific, keep that in mind. Apart from that there shouldn't
>> be much difference between anon shmem and memfd (there are memory commit
>> differences, though).
> 
> Could you elaborate the memory commit difference and what does that imply
> to QEMU's usage?

Note how memfd code passed VM_NORESERVE to shmem_file_setup() and 
shmem_zero_setup() effectively doesn't (unless MAP_NORESERVE was 
specified IIRC).

Not sure if the change makes a big impact in QEMU's usage, it's just one 
of these differences between memfd and shared anonymous memory. 
(responding to your "mostly the same").

> 
>>
>> Of course, there is a difference between anon memory and shmem, for example
>> regarding what viritofsd faced (e.g., KSM) recently.
> 
> The four paths shouldn't be KSM target, AFAICT.

Do you have a good overview of what is deduplicated in practice and why 
these don't apply? For example, I thought these functions are also used 
for hosting the BIOS, and that might just be deduplciated between VMs?

Anyhow, there are obviously other differences with shmem vs. anonymous 
(THP handling, page fault performance, userfaultfd compatibility on 
older kernels) at least on Linux, but I have absolutely no clue how that 
would differ on other host OSes.

None of them are major

This is probably going to result in a bigger discussion, for which I 
don't have any time. So my opinion on it is above.

Anyhow, this sounds like one of the suggestions I wouldn't suggest Steve 
to actually implement.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-10-07 15:49   ` Peter Xu
@ 2024-10-07 16:28     ` Peter Xu
  2024-10-08 15:17       ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 16:28 UTC (permalink / raw)
  To: Steve Sistare, Igor Mammedov, Michael S. Tsirkin
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Oct 07, 2024 at 11:49:25AM -0400, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
> > Save the memfd for anonymous ramblocks in CPR state, along with a name
> > that uniquely identifies it.  The block's idstr is not yet set, so it
> > cannot be used for this purpose.  Find the saved memfd in new QEMU when
> > creating a block.  QEMU hard-codes the length of some internally-created
> > blocks, so to guard against that length changing, use lseek to get the
> > actual length of an incoming memfd.
> > 
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >  system/physmem.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/system/physmem.c b/system/physmem.c
> > index 174f7e0..ddbeec9 100644
> > --- a/system/physmem.c
> > +++ b/system/physmem.c
> > @@ -72,6 +72,7 @@
> >  
> >  #include "qapi/qapi-types-migration.h"
> >  #include "migration/options.h"
> > +#include "migration/cpr.h"
> >  #include "migration/vmstate.h"
> >  
> >  #include "qemu/range.h"
> > @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
> >      }
> >  }
> >  
> > +static char *cpr_name(RAMBlock *block)
> > +{
> > +    MemoryRegion *mr = block->mr;
> > +    const char *mr_name = memory_region_name(mr);
> > +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
> > +
> > +    if (id) {
> > +        return g_strdup_printf("%s/%s", id, mr_name);
> > +    } else {
> > +        return g_strdup(mr_name);
> > +    }
> > +}
> > +
> >  size_t qemu_ram_pagesize(RAMBlock *rb)
> >  {
> >      return rb->page_size;
> > @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >                                          TYPE_MEMORY_BACKEND)) {
> >              size_t max_length = new_block->max_length;
> >              MemoryRegion *mr = new_block->mr;
> > -            const char *name = memory_region_name(mr);
> > +            g_autofree char *name = cpr_name(new_block);
> >  
> >              new_block->mr->align = QEMU_VMALLOC_ALIGN;
> >              new_block->flags |= RAM_SHARED;
> > +            new_block->fd = cpr_find_fd(name, 0);
> >  
> >              if (new_block->fd == -1) {
> >                  new_block->fd = qemu_memfd_create(name, max_length + mr->align,
> >                                                    0, 0, 0, errp);
> > +                cpr_save_fd(name, 0, new_block->fd);
> > +            } else {
> > +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
> 
> So this can overwrite the max_length that the caller specified..
> 
> I remember we used to have some tricks on specifying different max_length
> for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
> host so the size can be bigger than src qemu's old ramblocks), so that the
> MR is always large enough to reload even the new firmwares, while migration
> only migrates the smaller size (used_length) so it's fine as we keep the
> extra sizes empty. I think that can relevant to the qemu_ram_resize() call
> of parse_ramblock().
> 
> The reload will not happen until some point, perhaps system resets.  I
> wonder whether that is an issue in this case.
> 
> +Igor +Mst for this.

PS: If this is needed by CPR-transfer only because mmap() later can fail
due to a bigger max_length, I wonder whether it can be fixed by passing
truncate=true in the upcoming file_ram_alloc(), rather than overwritting
the max_length value itself.

> 
> >              }
> >  
> >              if (new_block->fd >= 0) {
> > @@ -1875,6 +1893,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >                                                   false, 0, errp);
> >              }
> >              if (!new_block->host) {
> > +                cpr_delete_fd(name, 0);
> >                  qemu_mutex_unlock_ramlist();
> >                  return;
> >              }
> > @@ -2182,6 +2201,8 @@ static void reclaim_ramblock(RAMBlock *block)
> >  
> >  void qemu_ram_free(RAMBlock *block)
> >  {
> > +    g_autofree char *name = NULL;
> > +
> >      if (!block) {
> >          return;
> >      }
> > @@ -2192,6 +2213,8 @@ void qemu_ram_free(RAMBlock *block)
> >      }
> >  
> >      qemu_mutex_lock_ramlist();
> > +    name = cpr_name(block);
> > +    cpr_delete_fd(name, 0);
> >      QLIST_REMOVE_RCU(block, next);
> >      ram_list.mru_block = NULL;
> >      /* Write list before version */
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile
  2024-10-07 16:06   ` Peter Xu
@ 2024-10-07 16:35     ` Daniel P. Berrangé
  2024-10-07 18:12       ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Daniel P. Berrangé @ 2024-10-07 16:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
	Paolo Bonzini, Markus Armbruster

On Mon, Oct 07, 2024 at 12:06:21PM -0400, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:38PM -0700, Steve Sistare wrote:
> > Define functions to put/get file descriptors to/from a QEMUFile, for qio
> > channels that support SCM_RIGHTS.  Maintain ordering such that
> >   put(A), put(fd), put(B)
> > followed by
> >   get(A), get(fd), get(B)
> > always succeeds.  Other get orderings may succeed but are not guaranteed.
> > 
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >  migration/qemu-file.c  | 83 +++++++++++++++++++++++++++++++++++++++++++++++---
> >  migration/qemu-file.h  |  2 ++
> >  migration/trace-events |  2 ++
> >  3 files changed, 83 insertions(+), 4 deletions(-)
> > 
> > diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> > index b6d2f58..7f951ab 100644
> > --- a/migration/qemu-file.c
> > +++ b/migration/qemu-file.c
> > @@ -37,6 +37,11 @@
> >  #define IO_BUF_SIZE 32768
> >  #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
> >  
> > +typedef struct FdEntry {
> > +    QTAILQ_ENTRY(FdEntry) entry;
> > +    int fd;
> > +} FdEntry;
> > +
> >  struct QEMUFile {
> >      QIOChannel *ioc;
> >      bool is_writable;
> > @@ -51,6 +56,9 @@ struct QEMUFile {
> >  
> >      int last_error;
> >      Error *last_error_obj;
> > +
> > +    bool fd_pass;
> > +    QTAILQ_HEAD(, FdEntry) fds;
> >  };
> >  
> >  /*
> > @@ -109,6 +117,8 @@ static QEMUFile *qemu_file_new_impl(QIOChannel *ioc, bool is_writable)
> >      object_ref(ioc);
> >      f->ioc = ioc;
> >      f->is_writable = is_writable;
> > +    f->fd_pass = qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS);
> > +    QTAILQ_INIT(&f->fds);
> >  
> >      return f;
> >  }
> > @@ -310,6 +320,10 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
> >      int len;
> >      int pending;
> >      Error *local_error = NULL;
> > +    g_autofree int *fds = NULL;
> > +    size_t nfd = 0;
> > +    int **pfds = f->fd_pass ? &fds : NULL;
> > +    size_t *pnfd = f->fd_pass ? &nfd : NULL;
> >  
> >      assert(!qemu_file_is_writable(f));
> >  
> > @@ -325,10 +339,9 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
> >      }
> >  
> >      do {
> > -        len = qio_channel_read(f->ioc,
> > -                               (char *)f->buf + pending,
> > -                               IO_BUF_SIZE - pending,
> > -                               &local_error);
> > +        struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
> > +        len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,
> > +                                     &local_error);
> >          if (len == QIO_CHANNEL_ERR_BLOCK) {
> >              if (qemu_in_coroutine()) {
> >                  qio_channel_yield(f->ioc, G_IO_IN);
> > @@ -348,9 +361,65 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
> >          qemu_file_set_error_obj(f, len, local_error);
> >      }
> >  
> > +    for (int i = 0; i < nfd; i++) {
> > +        FdEntry *fde = g_new0(FdEntry, 1);
> > +        fde->fd = fds[i];
> > +        QTAILQ_INSERT_TAIL(&f->fds, fde, entry);
> > +    }
> > +
> >      return len;
> >  }
> >  
> > +int qemu_file_put_fd(QEMUFile *f, int fd)
> > +{
> > +    int ret = 0;
> > +    QIOChannel *ioc = qemu_file_get_ioc(f);
> > +    Error *err = NULL;
> > +    struct iovec iov = { (void *)" ", 1 };
> > +
> > +    /*
> > +     * Send a dummy byte so qemu_fill_buffer on the receiving side does not
> > +     * fail with a len=0 error.  Flush first to maintain ordering wrt other
> > +     * data.
> > +     */
> > +
> > +    qemu_fflush(f);
> > +    if (qio_channel_writev_full(ioc, &iov, 1, &fd, 1, 0, &err) < 1) {
> > +        error_report_err(error_copy(err));
> > +        qemu_file_set_error_obj(f, -EIO, err);
> > +        ret = -1;
> > +    }
> > +    trace_qemu_file_put_fd(f->ioc->name, fd, ret);
> > +    return ret;
> > +}
> > +
> > +int qemu_file_get_fd(QEMUFile *f)
> > +{
> > +    int fd = -1;
> > +    FdEntry *fde;
> > +
> > +    if (!f->fd_pass) {
> > +        Error *err = NULL;
> > +        error_setg(&err, "%s does not support fd passing", f->ioc->name);
> > +        error_report_err(error_copy(err));
> > +        qemu_file_set_error_obj(f, -EIO, err);
> > +        goto out;
> > +    }
> > +
> > +    /* Force the dummy byte and its fd passenger to appear. */
> > +    qemu_peek_byte(f, 0);
> > +
> > +    fde = QTAILQ_FIRST(&f->fds);
> > +    if (fde) {
> > +        qemu_get_byte(f);       /* Drop the dummy byte */
> 
> Can we still try to get rid of this magical byte?
> 
> Ideally this function should check for no byte but f->fds bening non-empty,
> if it is it could invoke qemu_fill_buffer(). OTOH, qemu_fill_buffer() needs
> to take len==0&&nfds!=0 as legal.  Would that work?

When passing ancilliary data with sendmsg/recvmsg, on Linux at least,
it is required that there is at least 1 byte of data present.

See 'man 7 unix':

[quote]
   At  least  one  byte  of real data should be sent when
   sending ancillary data.  On Linux, this is required to
   successfully send ancillary data over  a  UNIX  domain
   stream  socket.   When  sending  ancillary data over a
   UNIX domain datagram socket, it is  not  necessary  on
   Linux  to  send  any accompanying real data.  However,
   portable applications should also include at least one
   byte of real data when sending ancillary data  over  a
   datagram socket.
[/quote]

So if your protocol doesn't already have a convenient bit of real data to
attach the SCM_RIGHTS data to, it is common practice to send a single dummy
data byte that is discarded.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 08/13] migration: VMSTATE_FD
  2024-09-30 19:40 ` [PATCH V2 08/13] migration: VMSTATE_FD Steve Sistare
@ 2024-10-07 16:36   ` Peter Xu
  2024-10-07 19:31     ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 16:36 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:39PM -0700, Steve Sistare wrote:
> Define VMSTATE_FD for declaring a file descriptor field in a
> VMStateDescription.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/vmstate.h |  9 +++++++++
>  migration/vmstate-types.c   | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index f313f2f..a1dfab4 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -230,6 +230,7 @@ extern const VMStateInfo vmstate_info_uint8;
>  extern const VMStateInfo vmstate_info_uint16;
>  extern const VMStateInfo vmstate_info_uint32;
>  extern const VMStateInfo vmstate_info_uint64;
> +extern const VMStateInfo vmstate_info_fd;
>  
>  /** Put this in the stream when migrating a null pointer.*/
>  #define VMS_NULLPTR_MARKER (0x30U) /* '0' */
> @@ -902,6 +903,9 @@ extern const VMStateInfo vmstate_info_qlist;
>  #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
>      VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)
>  
> +#define VMSTATE_FD_V(_f, _s, _v)                                  \
> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_fd, int32_t)
> +
>  #ifdef CONFIG_LINUX
>  
>  #define VMSTATE_U8_V(_f, _s, _v)                                   \
> @@ -936,6 +940,9 @@ extern const VMStateInfo vmstate_info_qlist;
>  #define VMSTATE_UINT64(_f, _s)                                        \
>      VMSTATE_UINT64_V(_f, _s, 0)
>  
> +#define VMSTATE_FD(_f, _s)                                            \
> +    VMSTATE_FD_V(_f, _s, 0)
> +
>  #ifdef CONFIG_LINUX
>  
>  #define VMSTATE_U8(_f, _s)                                         \
> @@ -1009,6 +1016,8 @@ extern const VMStateInfo vmstate_info_qlist;
>  #define VMSTATE_UINT64_TEST(_f, _s, _t)                                  \
>      VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_uint64, uint64_t)
>  
> +#define VMSTATE_FD_TEST(_f, _s, _t)                                            \
> +    VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_fd, int32_t)
>  
>  #define VMSTATE_TIMER_PTR_TEST(_f, _s, _test)                             \
>      VMSTATE_POINTER_TEST(_f, _s, _test, vmstate_info_timer, QEMUTimer *)
> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> index e83bfcc..6e45a4a 100644
> --- a/migration/vmstate-types.c
> +++ b/migration/vmstate-types.c
> @@ -314,6 +314,38 @@ const VMStateInfo vmstate_info_uint64 = {
>      .put  = put_uint64,
>  };
>  
> +/* File descriptor communicated via SCM_RIGHTS */
> +
> +static int get_fd(QEMUFile *f, void *pv, size_t size,
> +                  const VMStateField *field)
> +{
> +    int32_t *v = pv;
> +    qemu_get_sbe32s(f, v);

Why we need to send/recv the fd integer alone?  Can't that change anyway
across migration?  What happens if we drop this (and the put side)?

> +    if (*v < 0) {
> +        return 0;
> +    }
> +    *v = qemu_file_get_fd(f);
> +    return 0;
> +}
> +
> +static int put_fd(QEMUFile *f, void *pv, size_t size,
> +                  const VMStateField *field, JSONWriter *vmdesc)
> +{
> +    int32_t *v = pv;
> +
> +    qemu_put_sbe32s(f, v);
> +    if (*v < 0) {
> +        return 0;
> +    }
> +    return qemu_file_put_fd(f, *v);
> +}
> +
> +const VMStateInfo vmstate_info_fd = {
> +    .name = "fd",
> +    .get  = get_fd,
> +    .put  = put_fd,
> +};
> +
>  static int get_nullptr(QEMUFile *f, void *pv, size_t size,
>                         const VMStateField *field)
>  
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 09/13] migration: cpr-transfer save and load
  2024-09-30 19:40 ` [PATCH V2 09/13] migration: cpr-transfer save and load Steve Sistare
@ 2024-10-07 16:47   ` Peter Xu
  2024-10-07 19:31     ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 16:47 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:40PM -0700, Steve Sistare wrote:
> Add functions to create a QEMUFile based on a unix URI, for saving or
> loading, for use by cpr-transfer mode to preserve CPR state.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

There're a few extra newlines below, though, which could be removed.

> ---
>  include/migration/cpr.h  |  3 ++
>  migration/cpr-transfer.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
>  migration/meson.build    |  1 +
>  3 files changed, 85 insertions(+)
>  create mode 100644 migration/cpr-transfer.c
> 
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index ac7a63e..51c19ed 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -30,4 +30,7 @@ int cpr_state_load(Error **errp);
>  void cpr_state_close(void);
>  struct QIOChannel *cpr_state_ioc(void);
>  
> +QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
> +QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
> +
>  #endif
> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
> new file mode 100644
> index 0000000..fb9ecd8
> --- /dev/null
> +++ b/migration/cpr-transfer.c
> @@ -0,0 +1,81 @@
> +/*
> + * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "io/channel-file.h"
> +#include "io/channel-socket.h"
> +#include "io/net-listener.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/savevm.h"
> +#include "migration/qemu-file.h"
> +#include "migration/vmstate.h"
> +
> +QEMUFile *cpr_transfer_output(const char *uri, Error **errp)
> +{
> +    g_autoptr(MigrationChannel) channel = NULL;
> +    QIOChannel *ioc;
> +
> +    if (!migrate_uri_parse(uri, &channel, errp)) {
> +        return NULL;
> +    }
> +
> +    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
> +        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
> +

here

> +        QIOChannelSocket *sioc = qio_channel_socket_new();
> +        SocketAddress *saddr = &channel->addr->u.socket;
> +
> +        if (qio_channel_socket_connect_sync(sioc, saddr, errp)) {
> +            object_unref(OBJECT(sioc));
> +            return NULL;
> +        }
> +        ioc = QIO_CHANNEL(sioc);
> +

here

> +    } else {
> +        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
> +        return NULL;
> +    }
> +
> +    qio_channel_set_name(ioc, "cpr-out");
> +    return qemu_file_new_output(ioc);
> +}
> +
> +QEMUFile *cpr_transfer_input(const char *uri, Error **errp)
> +{
> +    g_autoptr(MigrationChannel) channel = NULL;
> +    QIOChannel *ioc;
> +
> +    if (!migrate_uri_parse(uri, &channel, errp)) {
> +        return NULL;
> +    }
> +
> +    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
> +        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
> +

here

> +        QIOChannelSocket *sioc;
> +        SocketAddress *saddr = &channel->addr->u.socket;
> +        QIONetListener *listener = qio_net_listener_new();
> +
> +        qio_net_listener_set_name(listener, "cpr-socket-listener");
> +        if (qio_net_listener_open_sync(listener, saddr, 1, errp) < 0) {
> +            object_unref(OBJECT(listener));
> +            return NULL;
> +        }
> +
> +        sioc = qio_net_listener_wait_client(listener);
> +        ioc = QIO_CHANNEL(sioc);
> +

here

> +    } else {
> +        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
> +        return NULL;
> +    }
> +
> +    qio_channel_set_name(ioc, "cpr-in");
> +    return qemu_file_new_input(ioc);
> +}
> diff --git a/migration/meson.build b/migration/meson.build
> index e5f4211..684ba98 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -14,6 +14,7 @@ system_ss.add(files(
>    'channel.c',
>    'channel-block.c',
>    'cpr.c',
> +  'cpr-transfer.c',
>    'dirtyrate.c',
>    'exec.c',
>    'fd.c',
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 10/13] migration: cpr-uri parameter
  2024-09-30 19:40 ` [PATCH V2 10/13] migration: cpr-uri parameter Steve Sistare
@ 2024-10-07 16:49   ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-07 16:49 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:41PM -0700, Steve Sistare wrote:
> Define the cpr-uri migration parameter to specify the URI to which
> CPR vmstate is saved for cpr-transfer mode.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 11/13] migration: cpr-uri option
  2024-09-30 19:40 ` [PATCH V2 11/13] migration: cpr-uri option Steve Sistare
@ 2024-10-07 16:50   ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-07 16:50 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:42PM -0700, Steve Sistare wrote:
> Define the cpr-uri QEMU command-line option to specify the URI from
> which CPR vmstate is loaded for cpr-transfer mode.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile
  2024-10-07 16:35     ` Daniel P. Berrangé
@ 2024-10-07 18:12       ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-07 18:12 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
	Paolo Bonzini, Markus Armbruster

On Mon, Oct 07, 2024 at 05:35:37PM +0100, Daniel P. Berrangé wrote:
> See 'man 7 unix':
> 
> [quote]
>    At  least  one  byte  of real data should be sent when
>    sending ancillary data.  On Linux, this is required to
>    successfully send ancillary data over  a  UNIX  domain
>    stream  socket.   When  sending  ancillary data over a
>    UNIX domain datagram socket, it is  not  necessary  on
>    Linux  to  send  any accompanying real data.  However,
>    portable applications should also include at least one
>    byte of real data when sending ancillary data  over  a
>    datagram socket.
> [/quote]
> 
> So if your protocol doesn't already have a convenient bit of real data to
> attach the SCM_RIGHTS data to, it is common practice to send a single dummy
> data byte that is discarded.

Ah OK, thanks.  Then maybe we can still consider dropping the initial four
bytes for fd migrations; I left the other comment in the next patch.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-10-07 16:23             ` David Hildenbrand
@ 2024-10-07 19:05               ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-07 19:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Igor Mammedov

On Mon, Oct 07, 2024 at 06:23:26PM +0200, David Hildenbrand wrote:
> > Yes I thought it could be unconditionally.  We can discuss downside below,
> > I think we can still use a new flag otherwise, but the idea would be the
> > same, where I want the flag to be explicit in the callers not implicitly
> > with the object type check, which I think can be hackish.
> 
> I agree that the caller should specify it.
> 
> But I don't think using shared memory where shared memory is not warranted
> is a reasonable approach.
> 
> I'm quite surprise you're considering such changes with unclear impacts on
> other OSes besides Linux (Freedbsd? Windows that doeasn';t even support
> shared memory?) just to make one corner-case QEMU use case happy.
> 
> But I'm sure there are valid reasons why you had that idea, so I'm happy to
> learn why using shared memory unconditionally here is better than providing
> a clean alternative path with the feature enabled and memfd actually being
> supported on the setup (e.g., newer Linux kernel).

I was thinking whether cpr-transfer can be enabled by default, so whenever
people want to use it, no code reset needed.  It's also easier for Libvirt
to not need to add yet another machine flags if possible.

Currently this parameter is the only one left that needs to be manually
enabled on src.  It means if we can get rid of it then any QEMU based VM on
Linux can do a cpr-transfer any time as long as QEMU supports it.

Without it, this new parameter will need to be manually enabled otherwise
another system code reboot / live migration needs to happen first without
CPR, just to enable this flag.

But yeah I don't think I think it all through, so I left my pure question.
I think it looks still like an option, the other option if we still want to
enable it by default is, keep the option, then only enable it on new
machines that is based on Linux.

OS dependency is definitely an issue.  AFAICT CPR is only available for
Linux anyway, but I'm happy to be corrected..  IOW, those chunk of new code
(if only unconditionally done..) would need proper #ifdef, so that
non-Linux OSes work like before.

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > I think RAM_SHARED can actually be that flag already - I mean, in all paths
> > > > > > that we may create anon mem (but not memory-backend-* objects), is it
> > > > > > always safe we always switch to RAM_SHARED from anon?
> > > > > 
> > > > > Do you mean only setting the flag (-> anonymous shmem) or switching also to
> > > > > memfd, which is a bigger change?
> > > > 
> > > > Switching to memfd.  I thought anon shmem (mmap(MAP_SHARED)) is mostly the
> > > > same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no?
> > > 
> > > Memfd is Linux specific, keep that in mind. Apart from that there shouldn't
> > > be much difference between anon shmem and memfd (there are memory commit
> > > differences, though).
> > 
> > Could you elaborate the memory commit difference and what does that imply
> > to QEMU's usage?
> 
> Note how memfd code passed VM_NORESERVE to shmem_file_setup() and
> shmem_zero_setup() effectively doesn't (unless MAP_NORESERVE was specified
> IIRC).
> 
> Not sure if the change makes a big impact in QEMU's usage, it's just one of
> these differences between memfd and shared anonymous memory. (responding to
> your "mostly the same").

So yeah, I hoped the memory commit won't be a problem, because I think they
should be corner case MRs, and should be small.

Vram can take up to 16MB, that's the max I'm aware of, but indeed I don't
know all the use cases to be sure.  I think it means some tens of MBs can
be accounted later during fault rather than upfront to fail QEMU from boot.
Ideally mgmt apps should leave enough space for these ones, but if we worry
on that we can stick with the current option (but create a new flag besides
RAM_SHARED).

> 
> > 
> > > 
> > > Of course, there is a difference between anon memory and shmem, for example
> > > regarding what viritofsd faced (e.g., KSM) recently.
> > 
> > The four paths shouldn't be KSM target, AFAICT.
> 
> Do you have a good overview of what is deduplicated in practice and why
> these don't apply? For example, I thought these functions are also used for
> hosting the BIOS, and that might just be deduplciated between VMs?

I was thinking KSM was for merging OS/App pages (rather than BIOSs, which
are normally very, very small)?  Though I could be wrong.

> 
> Anyhow, there are obviously other differences with shmem vs. anonymous (THP
> handling, page fault performance, userfaultfd compatibility on older
> kernels) at least on Linux, but I have absolutely no clue how that would
> differ on other host OSes.
> 
> None of them are major
> 
> This is probably going to result in a bigger discussion, for which I don't
> have any time. So my opinion on it is above.
> 
> Anyhow, this sounds like one of the suggestions I wouldn't suggest Steve to
> actually implement.

We don't need to make it a bigger discussion.  If there's concern with it,
we can stick with a new flag.

The next question is whether if with a new flag we should enable it by
default sometimes (e.g. on new machine types on Linux).  But when with a
new option, that can be discussed later.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 12/13] migration: split qmp_migrate
  2024-09-30 19:40 ` [PATCH V2 12/13] migration: split qmp_migrate Steve Sistare
@ 2024-10-07 19:18   ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-07 19:18 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:43PM -0700, Steve Sistare wrote:
> Split qmp_migrate into start and finish functions.  Finish will be
> called asynchronously in a subsequent patch, but for now, call it
> immediately.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 01/13] machine: alloc-anon option
  2024-10-07 15:36   ` Peter Xu
@ 2024-10-07 19:30     ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-07 19:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 11:36 AM, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
>> diff --git a/system/trace-events b/system/trace-events
>> index 074d001..4669411 100644
>> --- a/system/trace-events
>> +++ b/system/trace-events
>> @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
>>   
>>   # cpu-throttle.c
>>   cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
>> +
>> +#physmem.c
>> +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
> 
> This breaks 32bit build:
> 
> ../system/trace-events: In function ‘_nocheck__trace_ram_block_add’:
> ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 8 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
>     52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
>        |                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ......
> ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 9 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
>     52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
>        |                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ......
> ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
>     52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
>        |                      ^~~~~~~~~~~~~~~~                                                                   ~~~~~~~~~~~
>        |                                                                                                         |
>        |                                                                                                         size_t {aka unsigned int}
> ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
>     52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
>        |                      ^~~~~~~~~~~~~~~~                                                                                ~~~~~~~~~~
>        |                                                                                                                      |
>        |                                                                                                                      size_t {aka unsigned int}
> 
> Probably need to switch to %zu for size_t's.

Sorry for not building 31-bit!  And thanks for the tip about %zu, that's new to me - steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 02/13] migration: cpr-state
  2024-10-07 14:14   ` Peter Xu
@ 2024-10-07 19:30     ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-07 19:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 10:14 AM, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:33PM -0700, Steve Sistare wrote:
>> CPR must save state that is needed after QEMU is restarted, when devices
>> are realized.  Thus the extra state cannot be saved in the migration stream,
>> as objects must already exist before that stream can be loaded.  Instead,
>> define auxilliary state structures and vmstate descriptions, not associated
>> with any registered object, and serialize the aux state to a cpr-specific
>> stream in cpr_state_save.  Deserialize in cpr_state_load after QEMU
>> restarts, before devices are realized.
>>
>> Provide accessors for clients to register file descriptors for saving.
>> The mechanism for passing the fd's to the new process will be specific
>> to each migration mode, and added in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 
> Only two trivial comments below.
> 
>> ---
>>   include/migration/cpr.h |  26 ++++++
>>   migration/cpr.c         | 217 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/meson.build   |   1 +
>>   migration/migration.c   |   6 ++
>>   migration/trace-events  |   5 ++
>>   system/vl.c             |   7 ++
>>   6 files changed, 262 insertions(+)
>>   create mode 100644 include/migration/cpr.h
>>   create mode 100644 migration/cpr.c
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> new file mode 100644
>> index 0000000..e7b898b
>> --- /dev/null
>> +++ b/include/migration/cpr.h
>> @@ -0,0 +1,26 @@
>> +/*
>> + * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef MIGRATION_CPR_H
>> +#define MIGRATION_CPR_H
>> +
>> +#define QEMU_CPR_FILE_MAGIC     0x51435052
>> +#define QEMU_CPR_FILE_VERSION   0x00000001
>> +
>> +typedef int (*cpr_walk_fd_cb)(int fd);
>> +void cpr_save_fd(const char *name, int id, int fd);
>> +void cpr_delete_fd(const char *name, int id);
>> +int cpr_find_fd(const char *name, int id);
>> +int cpr_walk_fd(cpr_walk_fd_cb cb);
>> +void cpr_resave_fd(const char *name, int id, int fd);
>> +
>> +int cpr_state_save(Error **errp);
>> +int cpr_state_load(Error **errp);
>> +void cpr_state_close(void);
>> +struct QIOChannel *cpr_state_ioc(void);
>> +
>> +#endif
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> new file mode 100644
>> index 0000000..e50fc75
>> --- /dev/null
>> +++ b/migration/cpr.c
>> @@ -0,0 +1,217 @@
>> +/*
>> + * Copyright (c) 2021-2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "migration/cpr.h"
>> +#include "migration/misc.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/savevm.h"
>> +#include "migration/vmstate.h"
>> +#include "sysemu/runstate.h"
>> +#include "trace.h"
>> +
>> +/*************************************************************************/
>> +/* cpr state container for all information to be saved. */
>> +
>> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>> +
>> +typedef struct CprState {
>> +    CprFdList fds;
>> +} CprState;
>> +
>> +static CprState cpr_state;
>> +
>> +/****************************************************************************/
>> +
>> +typedef struct CprFd {
>> +    char *name;
>> +    unsigned int namelen;
>> +    int id;
>> +    int fd;
>> +    QLIST_ENTRY(CprFd) next;
>> +} CprFd;
>> +
>> +static const VMStateDescription vmstate_cpr_fd = {
>> +    .name = "cpr fd",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32(namelen, CprFd),
>> +        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
>> +        VMSTATE_INT32(id, CprFd),
>> +        VMSTATE_INT32(fd, CprFd),
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +void cpr_save_fd(const char *name, int id, int fd)
>> +{
>> +    CprFd *elem = g_new0(CprFd, 1);
>> +
>> +    trace_cpr_save_fd(name, id, fd);
>> +    elem->name = g_strdup(name);
>> +    elem->namelen = strlen(name) + 1;
>> +    elem->id = id;
>> +    elem->fd = fd;
>> +    QLIST_INSERT_HEAD(&cpr_state.fds, elem, next);
>> +}
>> +
>> +static CprFd *find_fd(CprFdList *head, const char *name, int id)
>> +{
>> +    CprFd *elem;
>> +
>> +    QLIST_FOREACH(elem, head, next) {
>> +        if (!strcmp(elem->name, name) && elem->id == id) {
>> +            return elem;
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>> +void cpr_delete_fd(const char *name, int id)
>> +{
>> +    CprFd *elem = find_fd(&cpr_state.fds, name, id);
>> +
>> +    if (elem) {
>> +        QLIST_REMOVE(elem, next);
>> +        g_free(elem->name);
>> +        g_free(elem);
>> +    }
>> +
>> +    trace_cpr_delete_fd(name, id);
>> +}
>> +
>> +int cpr_find_fd(const char *name, int id)
>> +{
>> +    CprFd *elem = find_fd(&cpr_state.fds, name, id);
>> +    int fd = elem ? elem->fd : -1;
>> +
>> +    trace_cpr_find_fd(name, id, fd);
>> +    return fd;
>> +}
>> +
>> +int cpr_walk_fd(cpr_walk_fd_cb cb)
>> +{
>> +    CprFd *elem;
>> +
>> +    QLIST_FOREACH(elem, &cpr_state.fds, next) {
>> +        if (elem->fd >= 0 && cb(elem->fd)) {
>> +            return 1;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +void cpr_resave_fd(const char *name, int id, int fd)
>> +{
>> +    CprFd *elem = find_fd(&cpr_state.fds, name, id);
>> +    int old_fd = elem ? elem->fd : -1;
>> +
>> +    if (old_fd < 0) {
>> +        cpr_save_fd(name, id, fd);
>> +    } else if (old_fd != fd) {
>> +        error_setg(&error_fatal,
>> +                   "internal error: cpr fd '%s' id %d value %d "
>> +                   "already saved with a different value %d",
>> +                   name, id, fd, old_fd);
>> +    }
>> +}
> 
> I remember I commented this, maybe not.. cpr_walk_fd() and cpr_resave_fd()
> are not used in this series.  Suggest introduce them only when they're
> used.

Thanks, you probably did.  I just edited my tree, so I will not forget again.

>> +/*************************************************************************/
>> +#define CPR_STATE "CprState"
>> +
>> +static const VMStateDescription vmstate_cpr_state = {
>> +    .name = CPR_STATE,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +/*************************************************************************/
>> +
>> +static QEMUFile *cpr_state_file;
>> +
>> +QIOChannel *cpr_state_ioc(void)
>> +{
>> +    return qemu_file_get_ioc(cpr_state_file);
>> +}
>> +
>> +int cpr_state_save(Error **errp)
>> +{
>> +    int ret;
>> +    QEMUFile *f;
>> +
>> +    /* set f based on mode in a later patch in this series */
>> +    return 0;
>> +
>> +    qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
>> +    qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
>> +
>> +    ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
>> +    if (ret) {
>> +        error_setg(errp, "vmstate_save_state error %d", ret);
>> +        qemu_fclose(f);
>> +        return ret;
>> +    }
>> +
>> +    /*
>> +     * Close the socket only partially so we can later detect when the other
>> +     * end closes by getting a HUP event.
>> +     */
>> +    qemu_fflush(f);
>> +    qio_channel_shutdown(qemu_file_get_ioc(f), QIO_CHANNEL_SHUTDOWN_WRITE,
>> +                         NULL);
> 
> What happens if we send everything and close immediately?
> 
> I didn't see how this cached file is used later throughout the whole
> series.  Is it used in some follow up series?

The complete usage and rationale is in the last patch, "migration: cpr-transfer mode"

- Steve

>> +    cpr_state_file = f;
>> +    return 0;
>> +}
>> +
>> +int cpr_state_load(Error **errp)
>> +{
>> +    int ret;
>> +    uint32_t v;
>> +    QEMUFile *f;
>> +
>> +    /* set f based on mode in a later patch in this series */
>> +    return 0;
>> +
>> +    v = qemu_get_be32(f);
>> +    if (v != QEMU_CPR_FILE_MAGIC) {
>> +        error_setg(errp, "Not a migration stream (bad magic %x)", v);
>> +        qemu_fclose(f);
>> +        return -EINVAL;
>> +    }
>> +    v = qemu_get_be32(f);
>> +    if (v != QEMU_CPR_FILE_VERSION) {
>> +        error_setg(errp, "Unsupported migration stream version %d", v);
>> +        qemu_fclose(f);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
>> +    if (ret) {
>> +        error_setg(errp, "vmstate_load_state error %d", ret);
>> +        qemu_fclose(f);
>> +        return ret;
>> +    }
>> +
>> +    /*
>> +     * Let the caller decide when to close the socket (and generate a HUP event
>> +     * for the sending side).
>> +     */
>> +    cpr_state_file = f;
>> +    return ret;
>> +}
>> +
>> +void cpr_state_close(void)
>> +{
>> +    if (cpr_state_file) {
>> +        qemu_fclose(cpr_state_file);
>> +        cpr_state_file = NULL;
>> +    }
>> +}
>> diff --git a/migration/meson.build b/migration/meson.build
>> index 66d3de8..e5f4211 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -13,6 +13,7 @@ system_ss.add(files(
>>     'block-dirty-bitmap.c',
>>     'channel.c',
>>     'channel-block.c',
>> +  'cpr.c',
>>     'dirtyrate.c',
>>     'exec.c',
>>     'fd.c',
>> diff --git a/migration/migration.c b/migration/migration.c
>> index ae2be31..834b0a2 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -27,6 +27,7 @@
>>   #include "sysemu/cpu-throttle.h"
>>   #include "rdma.h"
>>   #include "ram.h"
>> +#include "migration/cpr.h"
>>   #include "migration/global_state.h"
>>   #include "migration/misc.h"
>>   #include "migration.h"
>> @@ -2123,6 +2124,10 @@ void qmp_migrate(const char *uri, bool has_channels,
>>           }
>>       }
>>   
>> +    if (cpr_state_save(&local_err)) {
>> +        goto out;
>> +    }
>> +
>>       if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
>>           SocketAddress *saddr = &addr->u.socket;
>>           if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
>> @@ -2147,6 +2152,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>>                             MIGRATION_STATUS_FAILED);
>>       }
>>   
>> +out:
>>       if (local_err) {
>>           if (!resume_requested) {
>>               yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>> diff --git a/migration/trace-events b/migration/trace-events
>> index c65902f..5356fb5 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -341,6 +341,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
>>   # colo-failover.c
>>   colo_failover_set_state(const char *new_state) "new state %s"
>>   
>> +# cpr.c
>> +cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
>> +cpr_delete_fd(const char *name, int id) "%s, id %d"
>> +cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
>> +
>>   # block-dirty-bitmap.c
>>   send_bitmap_header_enter(void) ""
>>   send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
>> diff --git a/system/vl.c b/system/vl.c
>> index 752a1da..565d932 100644
>> --- a/system/vl.c
>> +++ b/system/vl.c
>> @@ -77,6 +77,7 @@
>>   #include "hw/block/block.h"
>>   #include "hw/i386/x86.h"
>>   #include "hw/i386/pc.h"
>> +#include "migration/cpr.h"
>>   #include "migration/misc.h"
>>   #include "migration/snapshot.h"
>>   #include "sysemu/tpm.h"
>> @@ -3720,6 +3721,12 @@ void qemu_init(int argc, char **argv)
>>   
>>       qemu_create_machine(machine_opts_dict);
>>   
>> +    /*
>> +     * Load incoming CPR state before any devices are created, because it
>> +     * contains file descriptors that are needed in device initialization code.
>> +     */
>> +    cpr_state_load(&error_fatal);
>> +
>>       suspend_mux_open();
>>   
>>       qemu_disable_default_devices();
>> -- 
>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 03/13] migration: save cpr mode
  2024-10-07 15:18   ` Peter Xu
@ 2024-10-07 19:31     ` Steven Sistare
  2024-10-07 20:10       ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-07 19:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 11:18 AM, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:34PM -0700, Steve Sistare wrote:
>> Save the mode in CPR state, so the user does not need to explicitly specify
>> it for the target.  Modify migrate_mode() so it returns the incoming mode on
>> the target.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/cpr.h |  7 +++++++
>>   migration/cpr.c         | 23 ++++++++++++++++++++++-
>>   migration/migration.c   |  1 +
>>   migration/options.c     |  9 +++++++--
>>   4 files changed, 37 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index e7b898b..ac7a63e 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -8,9 +8,16 @@
>>   #ifndef MIGRATION_CPR_H
>>   #define MIGRATION_CPR_H
>>   
>> +#include "qapi/qapi-types-migration.h"
>> +
>> +#define MIG_MODE_NONE           -1
>> +
>>   #define QEMU_CPR_FILE_MAGIC     0x51435052
>>   #define QEMU_CPR_FILE_VERSION   0x00000001
>>   
>> +MigMode cpr_get_incoming_mode(void);
>> +void cpr_set_incoming_mode(MigMode mode);
>> +
>>   typedef int (*cpr_walk_fd_cb)(int fd);
>>   void cpr_save_fd(const char *name, int id, int fd);
>>   void cpr_delete_fd(const char *name, int id);
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index e50fc75..7514c4e 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -21,10 +21,23 @@
>>   typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>>   
>>   typedef struct CprState {
>> +    MigMode mode;
>>       CprFdList fds;
>>   } CprState;
>>   
>> -static CprState cpr_state;
>> +static CprState cpr_state = {
>> +    .mode = MIG_MODE_NONE,
>> +};
>> +
>> +MigMode cpr_get_incoming_mode(void)
>> +{
>> +    return cpr_state.mode;
>> +}
>> +
>> +void cpr_set_incoming_mode(MigMode mode)
>> +{
>> +    cpr_state.mode = mode;
>> +}
>>   
>>   /****************************************************************************/
>>   
>> @@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
>>   /*************************************************************************/
>>   #define CPR_STATE "CprState"
>>   
>> +static int cpr_state_presave(void *opaque)
>> +{
>> +    cpr_state.mode = migrate_mode();
>> +    return 0;
>> +}
>> +
>>   static const VMStateDescription vmstate_cpr_state = {
>>       .name = CPR_STATE,
>>       .version_id = 1,
>>       .minimum_version_id = 1,
>> +    .pre_save = cpr_state_presave,
>>       .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32(mode, CprState),
>>           VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>>           VMSTATE_END_OF_LIST()
>>       }
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 834b0a2..df00e5c 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -416,6 +416,7 @@ void migration_incoming_state_destroy(void)
>>           mis->postcopy_qemufile_dst = NULL;
>>       }
>>   
>> +    cpr_set_incoming_mode(MIG_MODE_NONE);
>>       yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>>   }
>>   
>> diff --git a/migration/options.c b/migration/options.c
>> index 147cd2b..cc85a84 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -22,6 +22,7 @@
>>   #include "qapi/qmp/qnull.h"
>>   #include "sysemu/runstate.h"
>>   #include "migration/colo.h"
>> +#include "migration/cpr.h"
>>   #include "migration/misc.h"
>>   #include "migration.h"
>>   #include "migration-stats.h"
>> @@ -768,8 +769,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
>>   
>>   MigMode migrate_mode(void)
>>   {
>> -    MigrationState *s = migrate_get_current();
>> -    MigMode mode = s->parameters.mode;
>> +    MigMode mode = cpr_get_incoming_mode();
>> +
>> +    if (mode == MIG_MODE_NONE) {
>> +        MigrationState *s = migrate_get_current();
>> +        mode = s->parameters.mode;
>> +    }
> 
> Is this trying to avoid interfering with what user specified?

No.

> I can kind of get the point of it, but it'll also look pretty werid in this
> case that user can set the mode but then when query before cpr-transfer
> incoming completes it won't read what was set previously, but what was
> migrated via the cpr channel.
> 
> And IIUC it is needed to migrate this mode in cpr stream so as to avoid
> another new qemu cmdline on dest qemu.  If true this needs to be mentioned
> in the commit message; so far it reads like it's optional, then it's not
> clear why only cpr-mode needs to be migrated not other migration parameters.

The mode is needed on the incoming side early -- before migration_object_init,
and before the monitor is started.  Thus the user cannot set it as a normal
migration parameter.

> If that won't get right easily, I wonder whether we could just overwrite
> parameters.mode directly by the cpr stream.  

I considered that, but parameters.mode cannot be set before migration_object_init,
and some code needs to know mode before that.

- Steve

>After all IIUC that's before
> QMP is available, so there's no legal way to set it, then no legal way that
> it overwrites an user input?
> 
>>   
>>       assert(mode >= 0 && mode < MIG_MODE__MAX);
>>       return mode;
>> -- 
>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 09/13] migration: cpr-transfer save and load
  2024-10-07 16:47   ` Peter Xu
@ 2024-10-07 19:31     ` Steven Sistare
  2024-10-08 15:36       ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-07 19:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 12:47 PM, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:40PM -0700, Steve Sistare wrote:
>> Add functions to create a QEMUFile based on a unix URI, for saving or
>> loading, for use by cpr-transfer mode to preserve CPR state.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> There're a few extra newlines below, though, which could be removed.

I added the extra lines for readability.  They separate multi-line conditional
expressions from the body that follows, and separate one if-then-else body
from the next body.

- Steve

>> ---
>>   include/migration/cpr.h  |  3 ++
>>   migration/cpr-transfer.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/meson.build    |  1 +
>>   3 files changed, 85 insertions(+)
>>   create mode 100644 migration/cpr-transfer.c
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index ac7a63e..51c19ed 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -30,4 +30,7 @@ int cpr_state_load(Error **errp);
>>   void cpr_state_close(void);
>>   struct QIOChannel *cpr_state_ioc(void);
>>   
>> +QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
>> +QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
>> +
>>   #endif
>> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
>> new file mode 100644
>> index 0000000..fb9ecd8
>> --- /dev/null
>> +++ b/migration/cpr-transfer.c
>> @@ -0,0 +1,81 @@
>> +/*
>> + * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "io/channel-file.h"
>> +#include "io/channel-socket.h"
>> +#include "io/net-listener.h"
>> +#include "migration/cpr.h"
>> +#include "migration/migration.h"
>> +#include "migration/savevm.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/vmstate.h"
>> +
>> +QEMUFile *cpr_transfer_output(const char *uri, Error **errp)
>> +{
>> +    g_autoptr(MigrationChannel) channel = NULL;
>> +    QIOChannel *ioc;
>> +
>> +    if (!migrate_uri_parse(uri, &channel, errp)) {
>> +        return NULL;
>> +    }
>> +
>> +    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
>> +        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
>> +
> 
> here
> 
>> +        QIOChannelSocket *sioc = qio_channel_socket_new();
>> +        SocketAddress *saddr = &channel->addr->u.socket;
>> +
>> +        if (qio_channel_socket_connect_sync(sioc, saddr, errp)) {
>> +            object_unref(OBJECT(sioc));
>> +            return NULL;
>> +        }
>> +        ioc = QIO_CHANNEL(sioc);
>> +
> 
> here
> 
>> +    } else {
>> +        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
>> +        return NULL;
>> +    }
>> +
>> +    qio_channel_set_name(ioc, "cpr-out");
>> +    return qemu_file_new_output(ioc);
>> +}
>> +
>> +QEMUFile *cpr_transfer_input(const char *uri, Error **errp)
>> +{
>> +    g_autoptr(MigrationChannel) channel = NULL;
>> +    QIOChannel *ioc;
>> +
>> +    if (!migrate_uri_parse(uri, &channel, errp)) {
>> +        return NULL;
>> +    }
>> +
>> +    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
>> +        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
>> +
> 
> here
> 
>> +        QIOChannelSocket *sioc;
>> +        SocketAddress *saddr = &channel->addr->u.socket;
>> +        QIONetListener *listener = qio_net_listener_new();
>> +
>> +        qio_net_listener_set_name(listener, "cpr-socket-listener");
>> +        if (qio_net_listener_open_sync(listener, saddr, 1, errp) < 0) {
>> +            object_unref(OBJECT(listener));
>> +            return NULL;
>> +        }
>> +
>> +        sioc = qio_net_listener_wait_client(listener);
>> +        ioc = QIO_CHANNEL(sioc);
>> +
> 
> here
> 
>> +    } else {
>> +        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
>> +        return NULL;
>> +    }
>> +
>> +    qio_channel_set_name(ioc, "cpr-in");
>> +    return qemu_file_new_input(ioc);
>> +}
>> diff --git a/migration/meson.build b/migration/meson.build
>> index e5f4211..684ba98 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -14,6 +14,7 @@ system_ss.add(files(
>>     'channel.c',
>>     'channel-block.c',
>>     'cpr.c',
>> +  'cpr-transfer.c',
>>     'dirtyrate.c',
>>     'exec.c',
>>     'fd.c',
>> -- 
>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 08/13] migration: VMSTATE_FD
  2024-10-07 16:36   ` Peter Xu
@ 2024-10-07 19:31     ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-07 19:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 12:36 PM, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:39PM -0700, Steve Sistare wrote:
>> Define VMSTATE_FD for declaring a file descriptor field in a
>> VMStateDescription.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/vmstate.h |  9 +++++++++
>>   migration/vmstate-types.c   | 32 ++++++++++++++++++++++++++++++++
>>   2 files changed, 41 insertions(+)
>>
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index f313f2f..a1dfab4 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -230,6 +230,7 @@ extern const VMStateInfo vmstate_info_uint8;
>>   extern const VMStateInfo vmstate_info_uint16;
>>   extern const VMStateInfo vmstate_info_uint32;
>>   extern const VMStateInfo vmstate_info_uint64;
>> +extern const VMStateInfo vmstate_info_fd;
>>   
>>   /** Put this in the stream when migrating a null pointer.*/
>>   #define VMS_NULLPTR_MARKER (0x30U) /* '0' */
>> @@ -902,6 +903,9 @@ extern const VMStateInfo vmstate_info_qlist;
>>   #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
>>       VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)
>>   
>> +#define VMSTATE_FD_V(_f, _s, _v)                                  \
>> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_fd, int32_t)
>> +
>>   #ifdef CONFIG_LINUX
>>   
>>   #define VMSTATE_U8_V(_f, _s, _v)                                   \
>> @@ -936,6 +940,9 @@ extern const VMStateInfo vmstate_info_qlist;
>>   #define VMSTATE_UINT64(_f, _s)                                        \
>>       VMSTATE_UINT64_V(_f, _s, 0)
>>   
>> +#define VMSTATE_FD(_f, _s)                                            \
>> +    VMSTATE_FD_V(_f, _s, 0)
>> +
>>   #ifdef CONFIG_LINUX
>>   
>>   #define VMSTATE_U8(_f, _s)                                         \
>> @@ -1009,6 +1016,8 @@ extern const VMStateInfo vmstate_info_qlist;
>>   #define VMSTATE_UINT64_TEST(_f, _s, _t)                                  \
>>       VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_uint64, uint64_t)
>>   
>> +#define VMSTATE_FD_TEST(_f, _s, _t)                                            \
>> +    VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_fd, int32_t)
>>   
>>   #define VMSTATE_TIMER_PTR_TEST(_f, _s, _test)                             \
>>       VMSTATE_POINTER_TEST(_f, _s, _test, vmstate_info_timer, QEMUTimer *)
>> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
>> index e83bfcc..6e45a4a 100644
>> --- a/migration/vmstate-types.c
>> +++ b/migration/vmstate-types.c
>> @@ -314,6 +314,38 @@ const VMStateInfo vmstate_info_uint64 = {
>>       .put  = put_uint64,
>>   };
>>   
>> +/* File descriptor communicated via SCM_RIGHTS */
>> +
>> +static int get_fd(QEMUFile *f, void *pv, size_t size,
>> +                  const VMStateField *field)
>> +{
>> +    int32_t *v = pv;
>> +    qemu_get_sbe32s(f, v);
> 
> Why we need to send/recv the fd integer alone?  Can't that change anyway
> across migration?  What happens if we drop this (and the put side)?

This is a remnant from cpr-exec mode, where the fd value did not change across
exec, and SCM_RIGHTS was not used.  I will delete it, and I will delete the mode
test that appears in the "cpr-transfer mode" patch:

     qemu_get_sbe32s(f, v);
     if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
         return 0;
     }

- Steve

>> +    if (*v < 0) {
>> +        return 0;
>> +    }
>> +    *v = qemu_file_get_fd(f);
>> +    return 0;
>> +}
>> +
>> +static int put_fd(QEMUFile *f, void *pv, size_t size,
>> +                  const VMStateField *field, JSONWriter *vmdesc)
>> +{
>> +    int32_t *v = pv;
>> +
>> +    qemu_put_sbe32s(f, v);
>> +    if (*v < 0) {
>> +        return 0;
>> +    }
>> +    return qemu_file_put_fd(f, *v);
>> +}
>> +
>> +const VMStateInfo vmstate_info_fd = {
>> +    .name = "fd",
>> +    .get  = get_fd,
>> +    .put  = put_fd,
>> +};
>> +
>>   static int get_nullptr(QEMUFile *f, void *pv, size_t size,
>>                          const VMStateField *field)
>>   
>> -- 
>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-09-30 19:40 ` [PATCH V2 13/13] migration: cpr-transfer mode Steve Sistare
@ 2024-10-07 19:44   ` Peter Xu
  2024-10-07 20:39     ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 19:44 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Sep 30, 2024 at 12:40:44PM -0700, Steve Sistare wrote:
> Add the cpr-transfer migration mode.  Usage:
>   qemu-system-$arch -machine anon-alloc=memfd ...
> 
>   start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
> 
>   Issue commands to old QEMU:
>   migrate_set_parameter mode cpr-transfer
>   migrate_set_parameter cpr-uri <uri-2>
>   migrate -d <uri-1>
> 
> The migrate command stops the VM, saves CPR state to uri-2, saves
> normal migration state to uri-1, and old QEMU enters the postmigrate
> state.  The user starts new QEMU on the same host as old QEMU, with the
> same arguments as old QEMU, plus the -incoming option.  Guest RAM is
> preserved in place, albeit with new virtual addresses in new QEMU.
> 
> This mode requires a second migration channel, specified by the
> cpr-uri migration property on the outgoing side, and by the cpr-uri
> QEMU command-line option on the incoming side.  The channel must
> be a type, such as unix socket, that supports SCM_RIGHTS.
> 
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported.  The VM must be started with
> the '-machine anon-alloc=memfd' option, which allows anonymous
> memory to be transferred in place to the new process.  The memfds
> are kept open by sending the descriptors to new QEMU via the
> cpr-uri, which must support SCM_RIGHTS, and they are mmap'd
> in new QEMU.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/migration/cpr.h   |  1 +
>  migration/cpr.c           | 34 +++++++++++++++++++----
>  migration/migration.c     | 69 +++++++++++++++++++++++++++++++++++++++++++++--
>  migration/migration.h     |  2 ++
>  migration/ram.c           |  2 ++
>  migration/vmstate-types.c |  5 ++--
>  qapi/migration.json       | 27 ++++++++++++++++++-
>  stubs/vmstate.c           |  7 +++++
>  8 files changed, 137 insertions(+), 10 deletions(-)
> 
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index e886c98..5cd373f 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -30,6 +30,7 @@ int cpr_state_save(Error **errp);
>  int cpr_state_load(Error **errp);
>  void cpr_state_close(void);
>  struct QIOChannel *cpr_state_ioc(void);
> +bool cpr_needed_for_reuse(void *opaque);
>  
>  QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
>  QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 86f66c1..911b556 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -9,6 +9,7 @@
>  #include "qapi/error.h"
>  #include "migration/cpr.h"
>  #include "migration/misc.h"
> +#include "migration/options.h"
>  #include "migration/qemu-file.h"
>  #include "migration/savevm.h"
>  #include "migration/vmstate.h"
> @@ -57,7 +58,7 @@ static const VMStateDescription vmstate_cpr_fd = {
>          VMSTATE_UINT32(namelen, CprFd),
>          VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
>          VMSTATE_INT32(id, CprFd),
> -        VMSTATE_INT32(fd, CprFd),
> +        VMSTATE_FD(fd, CprFd),
>          VMSTATE_END_OF_LIST()
>      }
>  };
> @@ -174,9 +175,16 @@ int cpr_state_save(Error **errp)
>  {
>      int ret;
>      QEMUFile *f;
> +    MigMode mode = migrate_mode();
>  
> -    /* set f based on mode in a later patch in this series */
> -    return 0;
> +    if (mode == MIG_MODE_CPR_TRANSFER) {
> +        f = cpr_transfer_output(migrate_cpr_uri(), errp);
> +    } else {
> +        return 0;
> +    }
> +    if (!f) {
> +        return -1;
> +    }
>  
>      qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
>      qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
> @@ -205,8 +213,18 @@ int cpr_state_load(Error **errp)
>      uint32_t v;
>      QEMUFile *f;
>  
> -    /* set f based on mode in a later patch in this series */
> -    return 0;
> +    /*
> +     * Mode will be loaded in CPR state, so cannot use it to decide which
> +     * form of state to load.
> +     */
> +    if (cpr_uri) {
> +        f = cpr_transfer_input(cpr_uri, errp);
> +    } else {
> +        return 0;
> +    }
> +    if (!f) {
> +        return -1;
> +    }
>  
>      v = qemu_get_be32(f);
>      if (v != QEMU_CPR_FILE_MAGIC) {
> @@ -243,3 +261,9 @@ void cpr_state_close(void)
>          cpr_state_file = NULL;
>      }
>  }
> +
> +bool cpr_needed_for_reuse(void *opaque)
> +{
> +    MigMode mode = migrate_mode();
> +    return mode == MIG_MODE_CPR_TRANSFER;
> +}

Drop it until used?

> diff --git a/migration/migration.c b/migration/migration.c
> index 3301583..73b85aa 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -76,6 +76,7 @@
>  static NotifierWithReturnList migration_state_notifiers[] = {
>      NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
>      NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
> +    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
>  };
>  
>  /* Messages sent on the return path from destination to source */
> @@ -109,6 +110,7 @@ static int migration_maybe_pause(MigrationState *s,
>  static void migrate_fd_cancel(MigrationState *s);
>  static bool close_return_path_on_source(MigrationState *s);
>  static void migration_completion_end(MigrationState *s);
> +static void migrate_hup_delete(MigrationState *s);
>  
>  static void migration_downtime_start(MigrationState *s)
>  {
> @@ -204,6 +206,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
>          return false;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
> +        addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
> +        error_setg(errp, "Migration requires streamable transport (eg unix)");
> +        return false;
> +    }
> +
>      return true;
>  }
>  
> @@ -316,6 +324,7 @@ void migration_cancel(const Error *error)
>          qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
>      }
>      migrate_fd_cancel(current_migration);
> +    migrate_hup_delete(current_migration);
>  }
>  
>  void migration_shutdown(void)
> @@ -718,6 +727,9 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>      } else {
>          error_setg(errp, "unknown migration protocol: %s", uri);
>      }
> +
> +    /* Close cpr socket to tell source that we are listening */
> +    cpr_state_close();

Would it be possible to use some explicit reply message to mark this?  So
far looks like src QEMU will continue with qmp_migrate_finish() even if the
cpr channel was closed due to error.

I still didn't see how that kind of issue was captured below [1] (e.g., cpr
channel broken after sending partial fds)?

>  }
>  
>  static void process_incoming_migration_bh(void *opaque)
> @@ -1414,6 +1426,8 @@ static void migrate_fd_cleanup(MigrationState *s)
>      s->vmdesc = NULL;
>  
>      qemu_savevm_state_cleanup();
> +    cpr_state_close();
> +    migrate_hup_delete(s);
>  
>      close_return_path_on_source(s);
>  
> @@ -1698,7 +1712,9 @@ bool migration_thread_is_self(void)
>  
>  bool migrate_mode_is_cpr(MigrationState *s)
>  {
> -    return s->parameters.mode == MIG_MODE_CPR_REBOOT;
> +    MigMode mode = s->parameters.mode;
> +    return mode == MIG_MODE_CPR_REBOOT ||
> +           mode == MIG_MODE_CPR_TRANSFER;
>  }
>  
>  int migrate_init(MigrationState *s, Error **errp)
> @@ -2033,6 +2049,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>          return false;
>      }
>  
> +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
> +        !s->parameters.cpr_uri) {
> +        error_setg(errp, "cpr-transfer mode requires setting cpr-uri");
> +        return false;
> +    }
> +
>      if (migration_is_blocked(errp)) {
>          return false;
>      }
> @@ -2076,6 +2098,37 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>  static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
>                                 Error **errp);
>  
> +static void migrate_hup_add(MigrationState *s, QIOChannel *ioc, GSourceFunc cb,
> +                            void *opaque)
> +{
> +        s->hup_source = qio_channel_create_watch(ioc, G_IO_HUP);
> +        g_source_set_callback(s->hup_source, cb, opaque, NULL);
> +        g_source_attach(s->hup_source, NULL);
> +}
> +
> +static void migrate_hup_delete(MigrationState *s)
> +{
> +    if (s->hup_source) {
> +        g_source_destroy(s->hup_source);
> +        g_source_unref(s->hup_source);
> +        s->hup_source = NULL;
> +    }
> +}
> +
> +static gboolean qmp_migrate_finish_cb(QIOChannel *channel,
> +                                      GIOCondition cond,
> +                                      void *opaque)
> +{
> +    MigrationAddress *addr = opaque;

[1]

> +
> +    qmp_migrate_finish(addr, false, NULL);
> +
> +    cpr_state_close();
> +    migrate_hup_delete(migrate_get_current());
> +    qapi_free_MigrationAddress(addr);
> +    return G_SOURCE_REMOVE;
> +}
> +
>  void qmp_migrate(const char *uri, bool has_channels,
>                   MigrationChannelList *channels, bool has_detach, bool detach,
>                   bool has_resume, bool resume, Error **errp)
> @@ -2136,7 +2189,19 @@ void qmp_migrate(const char *uri, bool has_channels,
>          goto out;
>      }
>  
> -    qmp_migrate_finish(addr, resume_requested, errp);
> +    /*
> +     * For cpr-transfer, the target may not be listening yet on the migration
> +     * channel, because first it must finish cpr_load_state.  The target tells
> +     * us it is listening by closing the cpr-state socket.  Wait for that HUP
> +     * event before connecting in qmp_migrate_finish.
> +     */
> +    if (s->parameters.mode == MIG_MODE_CPR_TRANSFER) {
> +        migrate_hup_add(s, cpr_state_ioc(), (GSourceFunc)qmp_migrate_finish_cb,
> +                        QAPI_CLONE(MigrationAddress, addr));
> +
> +    } else {
> +        qmp_migrate_finish(addr, resume_requested, errp);
> +    }
>  
>  out:
>      if (local_err) {
> diff --git a/migration/migration.h b/migration/migration.h
> index 38aa140..74c167b 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -457,6 +457,8 @@ struct MigrationState {
>      bool switchover_acked;
>      /* Is this a rdma migration */
>      bool rdma_migration;
> +
> +    GSource *hup_source;
>  };
>  
>  void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> diff --git a/migration/ram.c b/migration/ram.c
> index 81eda27..e2cef50 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -216,7 +216,9 @@ static bool postcopy_preempt_active(void)
>  
>  bool migrate_ram_is_ignored(RAMBlock *block)
>  {
> +    MigMode mode = migrate_mode();
>      return !qemu_ram_is_migratable(block) ||
> +           mode == MIG_MODE_CPR_TRANSFER ||
>             (migrate_ignore_shared() && qemu_ram_is_shared(block)
>                                      && qemu_ram_is_named_file(block));
>  }
> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> index 6e45a4a..b5a55b8 100644
> --- a/migration/vmstate-types.c
> +++ b/migration/vmstate-types.c
> @@ -15,6 +15,7 @@
>  #include "qemu-file.h"
>  #include "migration.h"
>  #include "migration/vmstate.h"
> +#include "migration/client-options.h"
>  #include "qemu/error-report.h"
>  #include "qemu/queue.h"
>  #include "trace.h"
> @@ -321,7 +322,7 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>  {
>      int32_t *v = pv;
>      qemu_get_sbe32s(f, v);
> -    if (*v < 0) {
> +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
>          return 0;
>      }
>      *v = qemu_file_get_fd(f);
> @@ -334,7 +335,7 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>      int32_t *v = pv;
>  
>      qemu_put_sbe32s(f, v);
> -    if (*v < 0) {
> +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {

So I suppose you wanted to guard VMSTATE_FD being abused.  Then I wonder
whether it'll help more by adding a comment above VMSTATE_FD instead; it'll
be more straightforward to me.

And if you want to fail hard, assert should work better too in runtime, or
the "return 0" can be pretty hard to notice.

>          return 0;
>      }
>      return qemu_file_put_fd(f, *v);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index c0d8bcc..f51b4cb 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -611,9 +611,34 @@
>  #     or COLO.
>  #
>  #     (since 8.2)
> +#
> +# @cpr-transfer: This mode allows the user to transfer a guest to a
> +#     new QEMU instance on the same host with minimal guest pause
> +#     time, by preserving guest RAM in place, albeit with new virtual
> +#     addresses in new QEMU.
> +#
> +#     The user starts new QEMU on the same host as old QEMU, with the
> +#     the same arguments as old QEMU, plus the -incoming option.  The
> +#     user issues the migrate command to old QEMU, which stops the VM,
> +#     saves state to the migration channels, and enters the
> +#     postmigrate state.  Execution resumes in new QEMU.  Guest RAM is
> +#     preserved in place, albeit with new virtual addresses in new
> +#     QEMU.  The incoming migration channel cannot be a file type.
> +#
> +#     This mode requires a second migration channel, specified by the
> +#     cpr-uri migration property on the outgoing side, and by
> +#     the cpr-uri QEMU command-line option on the incoming
> +#     side.  The channel must be a type, such as unix socket, that
> +#     supports SCM_RIGHTS.
> +#
> +#     Memory-backend objects must have the share=on attribute, but
> +#     memory-backend-epc is not supported.  The VM must be started
> +#     with the '-machine anon-alloc=memfd' option.
> +#
> +#     (since 9.2)
>  ##
>  { 'enum': 'MigMode',
> -  'data': [ 'normal', 'cpr-reboot' ] }
> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }

No need to rush, but please add the CPR.rst and unit test updates when you
feel confident on the protocol.  It looks pretty good to me now.

Especially it'll be nice to describe the separate cpr-channel protocol in
the new doc page.

Thanks,

>  
>  ##
>  # @ZeroPageDetection:
> diff --git a/stubs/vmstate.c b/stubs/vmstate.c
> index 8513d92..c190762 100644
> --- a/stubs/vmstate.c
> +++ b/stubs/vmstate.c
> @@ -1,5 +1,7 @@
>  #include "qemu/osdep.h"
>  #include "migration/vmstate.h"
> +#include "qapi/qapi-types-migration.h"
> +#include "migration/client-options.h"
>  
>  int vmstate_register_with_alias_id(VMStateIf *obj,
>                                     uint32_t instance_id,
> @@ -21,3 +23,8 @@ bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
>  {
>      return true;
>  }
> +
> +MigMode migrate_mode(void)
> +{
> +    return MIG_MODE_NORMAL;
> +}
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 03/13] migration: save cpr mode
  2024-10-07 19:31     ` Steven Sistare
@ 2024-10-07 20:10       ` Peter Xu
  2024-10-08 15:57         ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-07 20:10 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Oct 07, 2024 at 03:31:09PM -0400, Steven Sistare wrote:
> On 10/7/2024 11:18 AM, Peter Xu wrote:
> > On Mon, Sep 30, 2024 at 12:40:34PM -0700, Steve Sistare wrote:
> > > Save the mode in CPR state, so the user does not need to explicitly specify
> > > it for the target.  Modify migrate_mode() so it returns the incoming mode on
> > > the target.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   include/migration/cpr.h |  7 +++++++
> > >   migration/cpr.c         | 23 ++++++++++++++++++++++-
> > >   migration/migration.c   |  1 +
> > >   migration/options.c     |  9 +++++++--
> > >   4 files changed, 37 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > > index e7b898b..ac7a63e 100644
> > > --- a/include/migration/cpr.h
> > > +++ b/include/migration/cpr.h
> > > @@ -8,9 +8,16 @@
> > >   #ifndef MIGRATION_CPR_H
> > >   #define MIGRATION_CPR_H
> > > +#include "qapi/qapi-types-migration.h"
> > > +
> > > +#define MIG_MODE_NONE           -1
> > > +
> > >   #define QEMU_CPR_FILE_MAGIC     0x51435052
> > >   #define QEMU_CPR_FILE_VERSION   0x00000001
> > > +MigMode cpr_get_incoming_mode(void);
> > > +void cpr_set_incoming_mode(MigMode mode);
> > > +
> > >   typedef int (*cpr_walk_fd_cb)(int fd);
> > >   void cpr_save_fd(const char *name, int id, int fd);
> > >   void cpr_delete_fd(const char *name, int id);
> > > diff --git a/migration/cpr.c b/migration/cpr.c
> > > index e50fc75..7514c4e 100644
> > > --- a/migration/cpr.c
> > > +++ b/migration/cpr.c
> > > @@ -21,10 +21,23 @@
> > >   typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
> > >   typedef struct CprState {
> > > +    MigMode mode;
> > >       CprFdList fds;
> > >   } CprState;
> > > -static CprState cpr_state;
> > > +static CprState cpr_state = {
> > > +    .mode = MIG_MODE_NONE,
> > > +};
> > > +
> > > +MigMode cpr_get_incoming_mode(void)
> > > +{
> > > +    return cpr_state.mode;
> > > +}
> > > +
> > > +void cpr_set_incoming_mode(MigMode mode)
> > > +{
> > > +    cpr_state.mode = mode;
> > > +}
> > >   /****************************************************************************/
> > > @@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
> > >   /*************************************************************************/
> > >   #define CPR_STATE "CprState"
> > > +static int cpr_state_presave(void *opaque)
> > > +{
> > > +    cpr_state.mode = migrate_mode();
> > > +    return 0;
> > > +}
> > > +
> > >   static const VMStateDescription vmstate_cpr_state = {
> > >       .name = CPR_STATE,
> > >       .version_id = 1,
> > >       .minimum_version_id = 1,
> > > +    .pre_save = cpr_state_presave,
> > >       .fields = (VMStateField[]) {
> > > +        VMSTATE_UINT32(mode, CprState),
> > >           VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
> > >           VMSTATE_END_OF_LIST()
> > >       }
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 834b0a2..df00e5c 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -416,6 +416,7 @@ void migration_incoming_state_destroy(void)
> > >           mis->postcopy_qemufile_dst = NULL;
> > >       }
> > > +    cpr_set_incoming_mode(MIG_MODE_NONE);
> > >       yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > >   }
> > > diff --git a/migration/options.c b/migration/options.c
> > > index 147cd2b..cc85a84 100644
> > > --- a/migration/options.c
> > > +++ b/migration/options.c
> > > @@ -22,6 +22,7 @@
> > >   #include "qapi/qmp/qnull.h"
> > >   #include "sysemu/runstate.h"
> > >   #include "migration/colo.h"
> > > +#include "migration/cpr.h"
> > >   #include "migration/misc.h"
> > >   #include "migration.h"
> > >   #include "migration-stats.h"
> > > @@ -768,8 +769,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
> > >   MigMode migrate_mode(void)
> > >   {
> > > -    MigrationState *s = migrate_get_current();
> > > -    MigMode mode = s->parameters.mode;
> > > +    MigMode mode = cpr_get_incoming_mode();
> > > +
> > > +    if (mode == MIG_MODE_NONE) {
> > > +        MigrationState *s = migrate_get_current();
> > > +        mode = s->parameters.mode;
> > > +    }
> > 
> > Is this trying to avoid interfering with what user specified?
> 
> No.
> 
> > I can kind of get the point of it, but it'll also look pretty werid in this
> > case that user can set the mode but then when query before cpr-transfer
> > incoming completes it won't read what was set previously, but what was
> > migrated via the cpr channel.
> > 
> > And IIUC it is needed to migrate this mode in cpr stream so as to avoid
> > another new qemu cmdline on dest qemu.  If true this needs to be mentioned
> > in the commit message; so far it reads like it's optional, then it's not
> > clear why only cpr-mode needs to be migrated not other migration parameters.
> 
> The mode is needed on the incoming side early -- before migration_object_init,
> and before the monitor is started.  Thus the user cannot set it as a normal
> migration parameter.
> 
> > If that won't get right easily, I wonder whether we could just overwrite
> > parameters.mode directly by the cpr stream.
> 
> I considered that, but parameters.mode cannot be set before migration_object_init,
> and some code needs to know mode before that.

Ah OK...

I wonder whether it really helps in migrating this mode at all, knowing
that no other mode should be there but the cpr-transfer mode when with
-cpr-uri cmdline.

How about we use cpr_uri to detect early stage cpr transfer mode, then
after early load stage we unset cpr_uri and always stick with what user
specified (instead of special casing NONE mode)?  Then it looks like:

MigMode migrate_mode(void)
{
  /*
   * When cpr_uri set, it always means QEMU is currently in early
   * cpr-transfer loading stage.
   */ 
  if (cpr_uri) {
      return MIG_MODE_CPR_TRANSFER;
  }

  return migrate_get_current()->parameters.mode;
}

Then we don't need to migrate the mode either, which is good as it aligns
with other migration parameters.

Would this look slightly cleaner?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-07 19:44   ` Peter Xu
@ 2024-10-07 20:39     ` Steven Sistare
  2024-10-08 15:45       ` Peter Xu
  2024-10-08 18:28       ` Fabiano Rosas
  0 siblings, 2 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-07 20:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 3:44 PM, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:44PM -0700, Steve Sistare wrote:
>> Add the cpr-transfer migration mode.  Usage:
>>    qemu-system-$arch -machine anon-alloc=memfd ...
>>
>>    start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
>>
>>    Issue commands to old QEMU:
>>    migrate_set_parameter mode cpr-transfer
>>    migrate_set_parameter cpr-uri <uri-2>
>>    migrate -d <uri-1>
>>
>> The migrate command stops the VM, saves CPR state to uri-2, saves
>> normal migration state to uri-1, and old QEMU enters the postmigrate
>> state.  The user starts new QEMU on the same host as old QEMU, with the
>> same arguments as old QEMU, plus the -incoming option.  Guest RAM is
>> preserved in place, albeit with new virtual addresses in new QEMU.
>>
>> This mode requires a second migration channel, specified by the
>> cpr-uri migration property on the outgoing side, and by the cpr-uri
>> QEMU command-line option on the incoming side.  The channel must
>> be a type, such as unix socket, that supports SCM_RIGHTS.
>>
>> Memory-backend objects must have the share=on attribute, but
>> memory-backend-epc is not supported.  The VM must be started with
>> the '-machine anon-alloc=memfd' option, which allows anonymous
>> memory to be transferred in place to the new process.  The memfds
>> are kept open by sending the descriptors to new QEMU via the
>> cpr-uri, which must support SCM_RIGHTS, and they are mmap'd
>> in new QEMU.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/migration/cpr.h   |  1 +
>>   migration/cpr.c           | 34 +++++++++++++++++++----
>>   migration/migration.c     | 69 +++++++++++++++++++++++++++++++++++++++++++++--
>>   migration/migration.h     |  2 ++
>>   migration/ram.c           |  2 ++
>>   migration/vmstate-types.c |  5 ++--
>>   qapi/migration.json       | 27 ++++++++++++++++++-
>>   stubs/vmstate.c           |  7 +++++
>>   8 files changed, 137 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index e886c98..5cd373f 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -30,6 +30,7 @@ int cpr_state_save(Error **errp);
>>   int cpr_state_load(Error **errp);
>>   void cpr_state_close(void);
>>   struct QIOChannel *cpr_state_ioc(void);
>> +bool cpr_needed_for_reuse(void *opaque);
>>   
>>   QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
>>   QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 86f66c1..911b556 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -9,6 +9,7 @@
>>   #include "qapi/error.h"
>>   #include "migration/cpr.h"
>>   #include "migration/misc.h"
>> +#include "migration/options.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/savevm.h"
>>   #include "migration/vmstate.h"
>> @@ -57,7 +58,7 @@ static const VMStateDescription vmstate_cpr_fd = {
>>           VMSTATE_UINT32(namelen, CprFd),
>>           VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
>>           VMSTATE_INT32(id, CprFd),
>> -        VMSTATE_INT32(fd, CprFd),
>> +        VMSTATE_FD(fd, CprFd),
>>           VMSTATE_END_OF_LIST()
>>       }
>>   };
>> @@ -174,9 +175,16 @@ int cpr_state_save(Error **errp)
>>   {
>>       int ret;
>>       QEMUFile *f;
>> +    MigMode mode = migrate_mode();
>>   
>> -    /* set f based on mode in a later patch in this series */
>> -    return 0;
>> +    if (mode == MIG_MODE_CPR_TRANSFER) {
>> +        f = cpr_transfer_output(migrate_cpr_uri(), errp);
>> +    } else {
>> +        return 0;
>> +    }
>> +    if (!f) {
>> +        return -1;
>> +    }
>>   
>>       qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
>>       qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
>> @@ -205,8 +213,18 @@ int cpr_state_load(Error **errp)
>>       uint32_t v;
>>       QEMUFile *f;
>>   
>> -    /* set f based on mode in a later patch in this series */
>> -    return 0;
>> +    /*
>> +     * Mode will be loaded in CPR state, so cannot use it to decide which
>> +     * form of state to load.
>> +     */
>> +    if (cpr_uri) {
>> +        f = cpr_transfer_input(cpr_uri, errp);
>> +    } else {
>> +        return 0;
>> +    }
>> +    if (!f) {
>> +        return -1;
>> +    }
>>   
>>       v = qemu_get_be32(f);
>>       if (v != QEMU_CPR_FILE_MAGIC) {
>> @@ -243,3 +261,9 @@ void cpr_state_close(void)
>>           cpr_state_file = NULL;
>>       }
>>   }
>> +
>> +bool cpr_needed_for_reuse(void *opaque)
>> +{
>> +    MigMode mode = migrate_mode();
>> +    return mode == MIG_MODE_CPR_TRANSFER;
>> +}
> 
> Drop it until used?

Maybe, but here is my reason for including it here.

These common functions like cpr_needed_for_reuse and cpr_resave_fd are needed
by multiple follow-on series: vfio, tap, iommufd.  To send those for comment,
as I have beem, I need to prepend a patch for cpr_needed_for_reuse to each of
those series, which is redundant.  It makes more sense IMO to include them in
this initial series.

But, it's your call.

>> diff --git a/migration/migration.c b/migration/migration.c
>> index 3301583..73b85aa 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -76,6 +76,7 @@
>>   static NotifierWithReturnList migration_state_notifiers[] = {
>>       NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
>>       NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
>> +    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
>>   };
>>   
>>   /* Messages sent on the return path from destination to source */
>> @@ -109,6 +110,7 @@ static int migration_maybe_pause(MigrationState *s,
>>   static void migrate_fd_cancel(MigrationState *s);
>>   static bool close_return_path_on_source(MigrationState *s);
>>   static void migration_completion_end(MigrationState *s);
>> +static void migrate_hup_delete(MigrationState *s);
>>   
>>   static void migration_downtime_start(MigrationState *s)
>>   {
>> @@ -204,6 +206,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
>>           return false;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
>> +        addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
>> +        error_setg(errp, "Migration requires streamable transport (eg unix)");
>> +        return false;
>> +    }
>> +
>>       return true;
>>   }
>>   
>> @@ -316,6 +324,7 @@ void migration_cancel(const Error *error)
>>           qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
>>       }
>>       migrate_fd_cancel(current_migration);
>> +    migrate_hup_delete(current_migration);
>>   }
>>   
>>   void migration_shutdown(void)
>> @@ -718,6 +727,9 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>>       } else {
>>           error_setg(errp, "unknown migration protocol: %s", uri);
>>       }
>> +
>> +    /* Close cpr socket to tell source that we are listening */
>> +    cpr_state_close();
> 
> Would it be possible to use some explicit reply message to mark this?  

In theory yes, but I fear that using a return channel with message parsing and
dispatch adds more code than it is worth.

> So
> far looks like src QEMU will continue with qmp_migrate_finish() even if the
> cpr channel was closed due to error.

Yes, but we recover just fine.  The target hits some error, fails to read all the
cpr state, closes the channel prematurely, and does *not* create a listen socket
for the normal migration channel.  Hence qmp_migrate_finish fails to connect to the
normal channel, and recovers.

> I still didn't see how that kind of issue was captured below [1] (e.g., cpr
> channel broken after sending partial fds)?

Same as above.

>>   }
>>   
>>   static void process_incoming_migration_bh(void *opaque)
>> @@ -1414,6 +1426,8 @@ static void migrate_fd_cleanup(MigrationState *s)
>>       s->vmdesc = NULL;
>>   
>>       qemu_savevm_state_cleanup();
>> +    cpr_state_close();
>> +    migrate_hup_delete(s);
>>   
>>       close_return_path_on_source(s);
>>   
>> @@ -1698,7 +1712,9 @@ bool migration_thread_is_self(void)
>>   
>>   bool migrate_mode_is_cpr(MigrationState *s)
>>   {
>> -    return s->parameters.mode == MIG_MODE_CPR_REBOOT;
>> +    MigMode mode = s->parameters.mode;
>> +    return mode == MIG_MODE_CPR_REBOOT ||
>> +           mode == MIG_MODE_CPR_TRANSFER;
>>   }
>>   
>>   int migrate_init(MigrationState *s, Error **errp)
>> @@ -2033,6 +2049,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>           return false;
>>       }
>>   
>> +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
>> +        !s->parameters.cpr_uri) {
>> +        error_setg(errp, "cpr-transfer mode requires setting cpr-uri");
>> +        return false;
>> +    }
>> +
>>       if (migration_is_blocked(errp)) {
>>           return false;
>>       }
>> @@ -2076,6 +2098,37 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>   static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
>>                                  Error **errp);
>>   
>> +static void migrate_hup_add(MigrationState *s, QIOChannel *ioc, GSourceFunc cb,
>> +                            void *opaque)
>> +{
>> +        s->hup_source = qio_channel_create_watch(ioc, G_IO_HUP);
>> +        g_source_set_callback(s->hup_source, cb, opaque, NULL);
>> +        g_source_attach(s->hup_source, NULL);
>> +}
>> +
>> +static void migrate_hup_delete(MigrationState *s)
>> +{
>> +    if (s->hup_source) {
>> +        g_source_destroy(s->hup_source);
>> +        g_source_unref(s->hup_source);
>> +        s->hup_source = NULL;
>> +    }
>> +}
>> +
>> +static gboolean qmp_migrate_finish_cb(QIOChannel *channel,
>> +                                      GIOCondition cond,
>> +                                      void *opaque)
>> +{
>> +    MigrationAddress *addr = opaque;
> 
> [1]
> 
>> +
>> +    qmp_migrate_finish(addr, false, NULL);
>> +
>> +    cpr_state_close();
>> +    migrate_hup_delete(migrate_get_current());
>> +    qapi_free_MigrationAddress(addr);
>> +    return G_SOURCE_REMOVE;
>> +}
>> +
>>   void qmp_migrate(const char *uri, bool has_channels,
>>                    MigrationChannelList *channels, bool has_detach, bool detach,
>>                    bool has_resume, bool resume, Error **errp)
>> @@ -2136,7 +2189,19 @@ void qmp_migrate(const char *uri, bool has_channels,
>>           goto out;
>>       }
>>   
>> -    qmp_migrate_finish(addr, resume_requested, errp);
>> +    /*
>> +     * For cpr-transfer, the target may not be listening yet on the migration
>> +     * channel, because first it must finish cpr_load_state.  The target tells
>> +     * us it is listening by closing the cpr-state socket.  Wait for that HUP
>> +     * event before connecting in qmp_migrate_finish.
>> +     */
>> +    if (s->parameters.mode == MIG_MODE_CPR_TRANSFER) {
>> +        migrate_hup_add(s, cpr_state_ioc(), (GSourceFunc)qmp_migrate_finish_cb,
>> +                        QAPI_CLONE(MigrationAddress, addr));
>> +
>> +    } else {
>> +        qmp_migrate_finish(addr, resume_requested, errp);
>> +    }
>>   
>>   out:
>>       if (local_err) {
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 38aa140..74c167b 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -457,6 +457,8 @@ struct MigrationState {
>>       bool switchover_acked;
>>       /* Is this a rdma migration */
>>       bool rdma_migration;
>> +
>> +    GSource *hup_source;
>>   };
>>   
>>   void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 81eda27..e2cef50 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -216,7 +216,9 @@ static bool postcopy_preempt_active(void)
>>   
>>   bool migrate_ram_is_ignored(RAMBlock *block)
>>   {
>> +    MigMode mode = migrate_mode();
>>       return !qemu_ram_is_migratable(block) ||
>> +           mode == MIG_MODE_CPR_TRANSFER ||
>>              (migrate_ignore_shared() && qemu_ram_is_shared(block)
>>                                       && qemu_ram_is_named_file(block));
>>   }
>> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
>> index 6e45a4a..b5a55b8 100644
>> --- a/migration/vmstate-types.c
>> +++ b/migration/vmstate-types.c
>> @@ -15,6 +15,7 @@
>>   #include "qemu-file.h"
>>   #include "migration.h"
>>   #include "migration/vmstate.h"
>> +#include "migration/client-options.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/queue.h"
>>   #include "trace.h"
>> @@ -321,7 +322,7 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>>   {
>>       int32_t *v = pv;
>>       qemu_get_sbe32s(f, v);
>> -    if (*v < 0) {
>> +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
>>           return 0;
>>       }
>>       *v = qemu_file_get_fd(f);
>> @@ -334,7 +335,7 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>>       int32_t *v = pv;
>>   
>>       qemu_put_sbe32s(f, v);
>> -    if (*v < 0) {
>> +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
> 
> So I suppose you wanted to guard VMSTATE_FD being abused.  Then I wonder
> whether it'll help more by adding a comment above VMSTATE_FD instead; it'll
> be more straightforward to me.
> 
> And if you want to fail hard, assert should work better too in runtime, or
> the "return 0" can be pretty hard to notice.

No, this code is not about detecting abuse or errors.  It is there to skip
the qemu_file_put_fd for cpr-exec mode.  In my next version this function will
simply be:

static int put_fd(QEMUFile *f, void *pv, size_t size,
                   const VMStateField *field, JSONWriter *vmdesc)
{
     int32_t *v = pv;
     return qemu_file_put_fd(f, *v);
}

>>           return 0;
>>       }
>>       return qemu_file_put_fd(f, *v);
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index c0d8bcc..f51b4cb 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -611,9 +611,34 @@
>>   #     or COLO.
>>   #
>>   #     (since 8.2)
>> +#
>> +# @cpr-transfer: This mode allows the user to transfer a guest to a
>> +#     new QEMU instance on the same host with minimal guest pause
>> +#     time, by preserving guest RAM in place, albeit with new virtual
>> +#     addresses in new QEMU.
>> +#
>> +#     The user starts new QEMU on the same host as old QEMU, with the
>> +#     the same arguments as old QEMU, plus the -incoming option.  The
>> +#     user issues the migrate command to old QEMU, which stops the VM,
>> +#     saves state to the migration channels, and enters the
>> +#     postmigrate state.  Execution resumes in new QEMU.  Guest RAM is
>> +#     preserved in place, albeit with new virtual addresses in new
>> +#     QEMU.  The incoming migration channel cannot be a file type.
>> +#
>> +#     This mode requires a second migration channel, specified by the
>> +#     cpr-uri migration property on the outgoing side, and by
>> +#     the cpr-uri QEMU command-line option on the incoming
>> +#     side.  The channel must be a type, such as unix socket, that
>> +#     supports SCM_RIGHTS.
>> +#
>> +#     Memory-backend objects must have the share=on attribute, but
>> +#     memory-backend-epc is not supported.  The VM must be started
>> +#     with the '-machine anon-alloc=memfd' option.
>> +#
>> +#     (since 9.2)
>>   ##
>>   { 'enum': 'MigMode',
>> -  'data': [ 'normal', 'cpr-reboot' ] }
>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> 
> No need to rush, but please add the CPR.rst and unit test updates when you
> feel confident on the protocol.  It looks pretty good to me now.
> 
> Especially it'll be nice to describe the separate cpr-channel protocol in
> the new doc page.

Will do, now that there is light at the end of the tunnel.

- Steve

>>   ##
>>   # @ZeroPageDetection:
>> diff --git a/stubs/vmstate.c b/stubs/vmstate.c
>> index 8513d92..c190762 100644
>> --- a/stubs/vmstate.c
>> +++ b/stubs/vmstate.c
>> @@ -1,5 +1,7 @@
>>   #include "qemu/osdep.h"
>>   #include "migration/vmstate.h"
>> +#include "qapi/qapi-types-migration.h"
>> +#include "migration/client-options.h"
>>   
>>   int vmstate_register_with_alias_id(VMStateIf *obj,
>>                                      uint32_t instance_id,
>> @@ -21,3 +23,8 @@ bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
>>   {
>>       return true;
>>   }
>> +
>> +MigMode migrate_mode(void)
>> +{
>> +    return MIG_MODE_NORMAL;
>> +}
>> -- 
>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 04/13] migration: stop vm earlier for cpr
  2024-10-07 15:27   ` Peter Xu
@ 2024-10-07 20:52     ` Steven Sistare
  2024-10-08 15:35       ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-07 20:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 11:27 AM, Peter Xu wrote:
> On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote:
>> Stop the vm earlier for cpr, to guarantee consistent device state when
>> CPR state is saved.
> 
> Could you add some more info on why this order matters?
> 
> E.g., qmp_migrate should switch migration state machine to SETUP, while
> this path holds BQL, I think it means there's no way devices got hot added
> concurrently of the whole process.
> 
> Would other things change in the cpr states (name, fd, etc.)?  It'll be
> great to mention these details in the commit message.

Because of the new cpr-state save operation needed by this mode,
I created this patch to be future proof.  Performing a save operation while
the machine is running is asking for trouble.  But right now, I am not aware
of any specific issues.

Later in the "tap and vhost" series there is another reason to stop the vm here and
save cpr state, because the devices must be stopped in old qemu before they
are initialized in new qemu.  If you are curious, see the 2 patches I attached
to the email at
   https://lore.kernel.org/qemu-devel/fa95c40d-b5e5-41eb-bba7-7842bca2f73e@oracle.com/
But, that has nothing to do with the contents of cpr state.

- Steve

>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   migration/migration.c | 22 +++++++++++++---------
>>   1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index df00e5c..868bf0e 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2082,6 +2082,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>>       MigrationState *s = migrate_get_current();
>>       g_autoptr(MigrationChannel) channel = NULL;
>>       MigrationAddress *addr = NULL;
>> +    bool stopped = false;
>>   
>>       /*
>>        * Having preliminary checks for uri and channel
>> @@ -2125,6 +2126,15 @@ void qmp_migrate(const char *uri, bool has_channels,
>>           }
>>       }
>>   
>> +    if (migrate_mode_is_cpr(s)) {
>> +        int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>> +        if (ret < 0) {
>> +            error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>> +            goto out;
>> +        }
>> +        stopped = true;
>> +    }
>> +
>>       if (cpr_state_save(&local_err)) {
>>           goto out;
>>       }
>> @@ -2160,6 +2170,9 @@ out:
>>           }
>>           migrate_fd_error(s, local_err);
>>           error_propagate(errp, local_err);
>> +        if (stopped) {
>> +            vm_resume(s->vm_old_state);
>> +        }
>>           return;
>>       }
>>   }
>> @@ -3743,7 +3756,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>       Error *local_err = NULL;
>>       uint64_t rate_limit;
>>       bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
>> -    int ret;
>>   
>>       /*
>>        * If there's a previous error, free it and prepare for another one.
>> @@ -3815,14 +3827,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>           return;
>>       }
>>   
>> -    if (migrate_mode_is_cpr(s)) {
>> -        ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>> -        if (ret < 0) {
>> -            error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>> -            goto fail;
>> -        }
>> -    }
>> -
>>       if (migrate_background_snapshot()) {
>>           qemu_thread_create(&s->thread, "mig/snapshot",
>>                   bg_migration_thread, s, QEMU_THREAD_JOINABLE);
>> -- 
>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 00/13] Live update: cpr-transfer
  2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
                   ` (12 preceding siblings ...)
  2024-09-30 19:40 ` [PATCH V2 13/13] migration: cpr-transfer mode Steve Sistare
@ 2024-10-08 14:33 ` Vladimir Sementsov-Ogievskiy
  2024-10-08 21:13   ` Steven Sistare
  13 siblings, 1 reply; 79+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2024-10-08 14:33 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 30.09.24 22:40, Steve Sistare wrote:
> Some devices need new kernel software interfaces
> to allow a descriptor to be used in a process that did not originally open it.

Hi Steve!

Could you please describe, which kernel version / features are required? I'm mostly interested in migration of tap and vhost-user devices.

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-10-07 16:28     ` Peter Xu
@ 2024-10-08 15:17       ` Steven Sistare
  2024-10-08 16:26         ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 15:17 UTC (permalink / raw)
  To: Peter Xu, Igor Mammedov, Michael S. Tsirkin
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 12:28 PM, Peter Xu wrote:
> On Mon, Oct 07, 2024 at 11:49:25AM -0400, Peter Xu wrote:
>> On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
>>> Save the memfd for anonymous ramblocks in CPR state, along with a name
>>> that uniquely identifies it.  The block's idstr is not yet set, so it
>>> cannot be used for this purpose.  Find the saved memfd in new QEMU when
>>> creating a block.  QEMU hard-codes the length of some internally-created
>>> blocks, so to guard against that length changing, use lseek to get the
>>> actual length of an incoming memfd.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   system/physmem.c | 25 ++++++++++++++++++++++++-
>>>   1 file changed, 24 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index 174f7e0..ddbeec9 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -72,6 +72,7 @@
>>>   
>>>   #include "qapi/qapi-types-migration.h"
>>>   #include "migration/options.h"
>>> +#include "migration/cpr.h"
>>>   #include "migration/vmstate.h"
>>>   
>>>   #include "qemu/range.h"
>>> @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
>>>       }
>>>   }
>>>   
>>> +static char *cpr_name(RAMBlock *block)
>>> +{
>>> +    MemoryRegion *mr = block->mr;
>>> +    const char *mr_name = memory_region_name(mr);
>>> +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
>>> +
>>> +    if (id) {
>>> +        return g_strdup_printf("%s/%s", id, mr_name);
>>> +    } else {
>>> +        return g_strdup(mr_name);
>>> +    }
>>> +}
>>> +
>>>   size_t qemu_ram_pagesize(RAMBlock *rb)
>>>   {
>>>       return rb->page_size;
>>> @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>                                           TYPE_MEMORY_BACKEND)) {
>>>               size_t max_length = new_block->max_length;
>>>               MemoryRegion *mr = new_block->mr;
>>> -            const char *name = memory_region_name(mr);
>>> +            g_autofree char *name = cpr_name(new_block);
>>>   
>>>               new_block->mr->align = QEMU_VMALLOC_ALIGN;
>>>               new_block->flags |= RAM_SHARED;
>>> +            new_block->fd = cpr_find_fd(name, 0);
>>>   
>>>               if (new_block->fd == -1) {
>>>                   new_block->fd = qemu_memfd_create(name, max_length + mr->align,
>>>                                                     0, 0, 0, errp);
>>> +                cpr_save_fd(name, 0, new_block->fd);
>>> +            } else {
>>> +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
>>
>> So this can overwrite the max_length that the caller specified..
>>
>> I remember we used to have some tricks on specifying different max_length
>> for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
>> host so the size can be bigger than src qemu's old ramblocks), so that the
>> MR is always large enough to reload even the new firmwares, while migration
>> only migrates the smaller size (used_length) so it's fine as we keep the
>> extra sizes empty. I think that can relevant to the qemu_ram_resize() call
>> of parse_ramblock().

Yes, resizable ram block for firmware blob is the only case I know of where
the length changed in the past.  If a length changes in the future, we will
need to detect and accommodate that change here, and I believe the fix will
be to simply use the actual length, as per the code above.  But if you prefer,
for now I can check for length change and return an error. New qemu will fail
to start, and old qemu will recover.

>> The reload will not happen until some point, perhaps system resets.  I
>> wonder whether that is an issue in this case.

Firmware is only generated once, via this path on x86:
   qmp_x_exit_preconfig
     qemu_machine_creation_done
       qdev_machine_creation_done
         pc_machine_done
           acpi_setup
             acpi_add_rom_blob
               rom_add_blob
                 rom_set_mr

After a system reset, the ramblock contents from memory are used as-is.

> PS: If this is needed by CPR-transfer only because mmap() later can fail
> due to a bigger max_length, 

That is the reason.  IMO adjusting max_length is more robust than fiddling
with truncate and pretending that max_length is larger, when qemu will never
be able to use the phantom space up to max_length.

- Steve

> I wonder whether it can be fixed by passing
> truncate=true in the upcoming file_ram_alloc(), rather than overwritting
> the max_length value itself.
> 
>>
>>>               }
>>>   
>>>               if (new_block->fd >= 0) {
>>> @@ -1875,6 +1893,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>                                                    false, 0, errp);
>>>               }
>>>               if (!new_block->host) {
>>> +                cpr_delete_fd(name, 0);
>>>                   qemu_mutex_unlock_ramlist();
>>>                   return;
>>>               }
>>> @@ -2182,6 +2201,8 @@ static void reclaim_ramblock(RAMBlock *block)
>>>   
>>>   void qemu_ram_free(RAMBlock *block)
>>>   {
>>> +    g_autofree char *name = NULL;
>>> +
>>>       if (!block) {
>>>           return;
>>>       }
>>> @@ -2192,6 +2213,8 @@ void qemu_ram_free(RAMBlock *block)
>>>       }
>>>   
>>>       qemu_mutex_lock_ramlist();
>>> +    name = cpr_name(block);
>>> +    cpr_delete_fd(name, 0);
>>>       QLIST_REMOVE_RCU(block, next);
>>>       ram_list.mru_block = NULL;
>>>       /* Write list before version */
>>> -- 
>>> 1.8.3.1
>>>
>>
>> -- 
>> Peter Xu
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 04/13] migration: stop vm earlier for cpr
  2024-10-07 20:52     ` Steven Sistare
@ 2024-10-08 15:35       ` Peter Xu
  2024-10-08 19:13         ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-08 15:35 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Oct 07, 2024 at 04:52:43PM -0400, Steven Sistare wrote:
> On 10/7/2024 11:27 AM, Peter Xu wrote:
> > On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote:
> > > Stop the vm earlier for cpr, to guarantee consistent device state when
> > > CPR state is saved.
> > 
> > Could you add some more info on why this order matters?
> > 
> > E.g., qmp_migrate should switch migration state machine to SETUP, while
> > this path holds BQL, I think it means there's no way devices got hot added
> > concurrently of the whole process.
> > 
> > Would other things change in the cpr states (name, fd, etc.)?  It'll be
> > great to mention these details in the commit message.
> 
> Because of the new cpr-state save operation needed by this mode,
> I created this patch to be future proof.  Performing a save operation while
> the machine is running is asking for trouble.  But right now, I am not aware
> of any specific issues.
> 
> Later in the "tap and vhost" series there is another reason to stop the vm here and
> save cpr state, because the devices must be stopped in old qemu before they
> are initialized in new qemu.  If you are curious, see the 2 patches I attached
> to the email at
>   https://lore.kernel.org/qemu-devel/fa95c40d-b5e5-41eb-bba7-7842bca2f73e@oracle.com/
> But, that has nothing to do with the contents of cpr state.

Then I suggest we leave this patch to the vhost/tap series, then please
document clearly in the commit mesasge on why this is needed.  Linking to
that discussion thread could work too.

Side note: I saw you have MIG_EVENT_PRECOPY_CPR_SETUP in you own tree, I
wonder whether we could reuse MIG_EVENT_PRECOPY_SETUP by moving it earlier
in qmp_migrate().  After all CPR-* notifiers are already registered
separately with the list of migration_state_notifiers[], so I suppose it'll
service the same purpose.  But we can discuss that later.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 09/13] migration: cpr-transfer save and load
  2024-10-07 19:31     ` Steven Sistare
@ 2024-10-08 15:36       ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-08 15:36 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Oct 07, 2024 at 03:31:18PM -0400, Steven Sistare wrote:
> On 10/7/2024 12:47 PM, Peter Xu wrote:
> > On Mon, Sep 30, 2024 at 12:40:40PM -0700, Steve Sistare wrote:
> > > Add functions to create a QEMUFile based on a unix URI, for saving or
> > > loading, for use by cpr-transfer mode to preserve CPR state.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > 
> > There're a few extra newlines below, though, which could be removed.
> 
> I added the extra lines for readability.  They separate multi-line conditional
> expressions from the body that follows, and separate one if-then-else body
> from the next body.

I think that's not what we normally do in QEMU's code base, but that's
still OK if you prefer; I don't think we have strong requirement on such
format yet.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-07 20:39     ` Steven Sistare
@ 2024-10-08 15:45       ` Peter Xu
  2024-10-08 19:12         ` Steven Sistare
  2024-10-08 18:28       ` Fabiano Rosas
  1 sibling, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-08 15:45 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Oct 07, 2024 at 04:39:25PM -0400, Steven Sistare wrote:
> On 10/7/2024 3:44 PM, Peter Xu wrote:
> > On Mon, Sep 30, 2024 at 12:40:44PM -0700, Steve Sistare wrote:
> > > Add the cpr-transfer migration mode.  Usage:
> > >    qemu-system-$arch -machine anon-alloc=memfd ...
> > > 
> > >    start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
> > > 
> > >    Issue commands to old QEMU:
> > >    migrate_set_parameter mode cpr-transfer
> > >    migrate_set_parameter cpr-uri <uri-2>
> > >    migrate -d <uri-1>
> > > 
> > > The migrate command stops the VM, saves CPR state to uri-2, saves
> > > normal migration state to uri-1, and old QEMU enters the postmigrate
> > > state.  The user starts new QEMU on the same host as old QEMU, with the
> > > same arguments as old QEMU, plus the -incoming option.  Guest RAM is
> > > preserved in place, albeit with new virtual addresses in new QEMU.
> > > 
> > > This mode requires a second migration channel, specified by the
> > > cpr-uri migration property on the outgoing side, and by the cpr-uri
> > > QEMU command-line option on the incoming side.  The channel must
> > > be a type, such as unix socket, that supports SCM_RIGHTS.
> > > 
> > > Memory-backend objects must have the share=on attribute, but
> > > memory-backend-epc is not supported.  The VM must be started with
> > > the '-machine anon-alloc=memfd' option, which allows anonymous
> > > memory to be transferred in place to the new process.  The memfds
> > > are kept open by sending the descriptors to new QEMU via the
> > > cpr-uri, which must support SCM_RIGHTS, and they are mmap'd
> > > in new QEMU.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   include/migration/cpr.h   |  1 +
> > >   migration/cpr.c           | 34 +++++++++++++++++++----
> > >   migration/migration.c     | 69 +++++++++++++++++++++++++++++++++++++++++++++--
> > >   migration/migration.h     |  2 ++
> > >   migration/ram.c           |  2 ++
> > >   migration/vmstate-types.c |  5 ++--
> > >   qapi/migration.json       | 27 ++++++++++++++++++-
> > >   stubs/vmstate.c           |  7 +++++
> > >   8 files changed, 137 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > > index e886c98..5cd373f 100644
> > > --- a/include/migration/cpr.h
> > > +++ b/include/migration/cpr.h
> > > @@ -30,6 +30,7 @@ int cpr_state_save(Error **errp);
> > >   int cpr_state_load(Error **errp);
> > >   void cpr_state_close(void);
> > >   struct QIOChannel *cpr_state_ioc(void);
> > > +bool cpr_needed_for_reuse(void *opaque);
> > >   QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
> > >   QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
> > > diff --git a/migration/cpr.c b/migration/cpr.c
> > > index 86f66c1..911b556 100644
> > > --- a/migration/cpr.c
> > > +++ b/migration/cpr.c
> > > @@ -9,6 +9,7 @@
> > >   #include "qapi/error.h"
> > >   #include "migration/cpr.h"
> > >   #include "migration/misc.h"
> > > +#include "migration/options.h"
> > >   #include "migration/qemu-file.h"
> > >   #include "migration/savevm.h"
> > >   #include "migration/vmstate.h"
> > > @@ -57,7 +58,7 @@ static const VMStateDescription vmstate_cpr_fd = {
> > >           VMSTATE_UINT32(namelen, CprFd),
> > >           VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
> > >           VMSTATE_INT32(id, CprFd),
> > > -        VMSTATE_INT32(fd, CprFd),
> > > +        VMSTATE_FD(fd, CprFd),
> > >           VMSTATE_END_OF_LIST()
> > >       }
> > >   };
> > > @@ -174,9 +175,16 @@ int cpr_state_save(Error **errp)
> > >   {
> > >       int ret;
> > >       QEMUFile *f;
> > > +    MigMode mode = migrate_mode();
> > > -    /* set f based on mode in a later patch in this series */
> > > -    return 0;
> > > +    if (mode == MIG_MODE_CPR_TRANSFER) {
> > > +        f = cpr_transfer_output(migrate_cpr_uri(), errp);
> > > +    } else {
> > > +        return 0;
> > > +    }
> > > +    if (!f) {
> > > +        return -1;
> > > +    }
> > >       qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
> > >       qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
> > > @@ -205,8 +213,18 @@ int cpr_state_load(Error **errp)
> > >       uint32_t v;
> > >       QEMUFile *f;
> > > -    /* set f based on mode in a later patch in this series */
> > > -    return 0;
> > > +    /*
> > > +     * Mode will be loaded in CPR state, so cannot use it to decide which
> > > +     * form of state to load.
> > > +     */
> > > +    if (cpr_uri) {
> > > +        f = cpr_transfer_input(cpr_uri, errp);
> > > +    } else {
> > > +        return 0;
> > > +    }
> > > +    if (!f) {
> > > +        return -1;
> > > +    }
> > >       v = qemu_get_be32(f);
> > >       if (v != QEMU_CPR_FILE_MAGIC) {
> > > @@ -243,3 +261,9 @@ void cpr_state_close(void)
> > >           cpr_state_file = NULL;
> > >       }
> > >   }
> > > +
> > > +bool cpr_needed_for_reuse(void *opaque)
> > > +{
> > > +    MigMode mode = migrate_mode();
> > > +    return mode == MIG_MODE_CPR_TRANSFER;
> > > +}
> > 
> > Drop it until used?
> 
> Maybe, but here is my reason for including it here.
> 
> These common functions like cpr_needed_for_reuse and cpr_resave_fd are needed
> by multiple follow-on series: vfio, tap, iommufd.  To send those for comment,
> as I have beem, I need to prepend a patch for cpr_needed_for_reuse to each of
> those series, which is redundant.  It makes more sense IMO to include them in
> this initial series.
> 
> But, it's your call.

Hmm, logically we shouldn't keep any dead code in QEMU, but indeed this is
slightly special.

Would you mind keeping all these helpers in a separate patch after the base
patches?  The commit message should describe what future projects will
start to use it, then whoever noticed later (I at least know Dave has quite
a few patches recently removing dead code in QEMU) will know that's
potentially to-be-used code, so should keep them around.

> 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 3301583..73b85aa 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -76,6 +76,7 @@
> > >   static NotifierWithReturnList migration_state_notifiers[] = {
> > >       NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
> > >       NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
> > > +    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
> > >   };
> > >   /* Messages sent on the return path from destination to source */
> > > @@ -109,6 +110,7 @@ static int migration_maybe_pause(MigrationState *s,
> > >   static void migrate_fd_cancel(MigrationState *s);
> > >   static bool close_return_path_on_source(MigrationState *s);
> > >   static void migration_completion_end(MigrationState *s);
> > > +static void migrate_hup_delete(MigrationState *s);
> > >   static void migration_downtime_start(MigrationState *s)
> > >   {
> > > @@ -204,6 +206,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
> > >           return false;
> > >       }
> > > +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
> > > +        addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
> > > +        error_setg(errp, "Migration requires streamable transport (eg unix)");
> > > +        return false;
> > > +    }
> > > +
> > >       return true;
> > >   }
> > > @@ -316,6 +324,7 @@ void migration_cancel(const Error *error)
> > >           qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
> > >       }
> > >       migrate_fd_cancel(current_migration);
> > > +    migrate_hup_delete(current_migration);
> > >   }
> > >   void migration_shutdown(void)
> > > @@ -718,6 +727,9 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
> > >       } else {
> > >           error_setg(errp, "unknown migration protocol: %s", uri);
> > >       }
> > > +
> > > +    /* Close cpr socket to tell source that we are listening */
> > > +    cpr_state_close();
> > 
> > Would it be possible to use some explicit reply message to mark this?
> 
> In theory yes, but I fear that using a return channel with message parsing and
> dispatch adds more code than it is worth.
> 
> > So
> > far looks like src QEMU will continue with qmp_migrate_finish() even if the
> > cpr channel was closed due to error.
> 
> Yes, but we recover just fine.  The target hits some error, fails to read all the
> cpr state, closes the channel prematurely, and does *not* create a listen socket
> for the normal migration channel.  Hence qmp_migrate_finish fails to connect to the
> normal channel, and recovers.

This is slightly tricky part and would be nice to be documented somewhere,
perhaps starting from in the commit message.

Then the error will say "failed to connect to destination QEMU" hiding the
real failure (cpr save/load failed), right?  That's slightly a pity.

I'm OK with the HUP as of now, but if you care about accurate CPR-stage
error reporting, then feel free to draft something else in the next post.

> 
> > I still didn't see how that kind of issue was captured below [1] (e.g., cpr
> > channel broken after sending partial fds)?
> 
> Same as above.
> 
> > >   }
> > >   static void process_incoming_migration_bh(void *opaque)
> > > @@ -1414,6 +1426,8 @@ static void migrate_fd_cleanup(MigrationState *s)
> > >       s->vmdesc = NULL;
> > >       qemu_savevm_state_cleanup();
> > > +    cpr_state_close();
> > > +    migrate_hup_delete(s);
> > >       close_return_path_on_source(s);
> > > @@ -1698,7 +1712,9 @@ bool migration_thread_is_self(void)
> > >   bool migrate_mode_is_cpr(MigrationState *s)
> > >   {
> > > -    return s->parameters.mode == MIG_MODE_CPR_REBOOT;
> > > +    MigMode mode = s->parameters.mode;
> > > +    return mode == MIG_MODE_CPR_REBOOT ||
> > > +           mode == MIG_MODE_CPR_TRANSFER;
> > >   }
> > >   int migrate_init(MigrationState *s, Error **errp)
> > > @@ -2033,6 +2049,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
> > >           return false;
> > >       }
> > > +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
> > > +        !s->parameters.cpr_uri) {
> > > +        error_setg(errp, "cpr-transfer mode requires setting cpr-uri");
> > > +        return false;
> > > +    }
> > > +
> > >       if (migration_is_blocked(errp)) {
> > >           return false;
> > >       }
> > > @@ -2076,6 +2098,37 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
> > >   static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
> > >                                  Error **errp);
> > > +static void migrate_hup_add(MigrationState *s, QIOChannel *ioc, GSourceFunc cb,
> > > +                            void *opaque)
> > > +{
> > > +        s->hup_source = qio_channel_create_watch(ioc, G_IO_HUP);
> > > +        g_source_set_callback(s->hup_source, cb, opaque, NULL);
> > > +        g_source_attach(s->hup_source, NULL);
> > > +}
> > > +
> > > +static void migrate_hup_delete(MigrationState *s)
> > > +{
> > > +    if (s->hup_source) {
> > > +        g_source_destroy(s->hup_source);
> > > +        g_source_unref(s->hup_source);
> > > +        s->hup_source = NULL;
> > > +    }
> > > +}
> > > +
> > > +static gboolean qmp_migrate_finish_cb(QIOChannel *channel,
> > > +                                      GIOCondition cond,
> > > +                                      void *opaque)
> > > +{
> > > +    MigrationAddress *addr = opaque;
> > 
> > [1]
> > 
> > > +
> > > +    qmp_migrate_finish(addr, false, NULL);
> > > +
> > > +    cpr_state_close();
> > > +    migrate_hup_delete(migrate_get_current());
> > > +    qapi_free_MigrationAddress(addr);
> > > +    return G_SOURCE_REMOVE;
> > > +}
> > > +
> > >   void qmp_migrate(const char *uri, bool has_channels,
> > >                    MigrationChannelList *channels, bool has_detach, bool detach,
> > >                    bool has_resume, bool resume, Error **errp)
> > > @@ -2136,7 +2189,19 @@ void qmp_migrate(const char *uri, bool has_channels,
> > >           goto out;
> > >       }
> > > -    qmp_migrate_finish(addr, resume_requested, errp);
> > > +    /*
> > > +     * For cpr-transfer, the target may not be listening yet on the migration
> > > +     * channel, because first it must finish cpr_load_state.  The target tells
> > > +     * us it is listening by closing the cpr-state socket.  Wait for that HUP
> > > +     * event before connecting in qmp_migrate_finish.
> > > +     */
> > > +    if (s->parameters.mode == MIG_MODE_CPR_TRANSFER) {
> > > +        migrate_hup_add(s, cpr_state_ioc(), (GSourceFunc)qmp_migrate_finish_cb,
> > > +                        QAPI_CLONE(MigrationAddress, addr));
> > > +
> > > +    } else {
> > > +        qmp_migrate_finish(addr, resume_requested, errp);
> > > +    }
> > >   out:
> > >       if (local_err) {
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index 38aa140..74c167b 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -457,6 +457,8 @@ struct MigrationState {
> > >       bool switchover_acked;
> > >       /* Is this a rdma migration */
> > >       bool rdma_migration;
> > > +
> > > +    GSource *hup_source;
> > >   };
> > >   void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 81eda27..e2cef50 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -216,7 +216,9 @@ static bool postcopy_preempt_active(void)
> > >   bool migrate_ram_is_ignored(RAMBlock *block)
> > >   {
> > > +    MigMode mode = migrate_mode();
> > >       return !qemu_ram_is_migratable(block) ||
> > > +           mode == MIG_MODE_CPR_TRANSFER ||
> > >              (migrate_ignore_shared() && qemu_ram_is_shared(block)
> > >                                       && qemu_ram_is_named_file(block));
> > >   }
> > > diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
> > > index 6e45a4a..b5a55b8 100644
> > > --- a/migration/vmstate-types.c
> > > +++ b/migration/vmstate-types.c
> > > @@ -15,6 +15,7 @@
> > >   #include "qemu-file.h"
> > >   #include "migration.h"
> > >   #include "migration/vmstate.h"
> > > +#include "migration/client-options.h"
> > >   #include "qemu/error-report.h"
> > >   #include "qemu/queue.h"
> > >   #include "trace.h"
> > > @@ -321,7 +322,7 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
> > >   {
> > >       int32_t *v = pv;
> > >       qemu_get_sbe32s(f, v);
> > > -    if (*v < 0) {
> > > +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
> > >           return 0;
> > >       }
> > >       *v = qemu_file_get_fd(f);
> > > @@ -334,7 +335,7 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
> > >       int32_t *v = pv;
> > >       qemu_put_sbe32s(f, v);
> > > -    if (*v < 0) {
> > > +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
> > 
> > So I suppose you wanted to guard VMSTATE_FD being abused.  Then I wonder
> > whether it'll help more by adding a comment above VMSTATE_FD instead; it'll
> > be more straightforward to me.
> > 
> > And if you want to fail hard, assert should work better too in runtime, or
> > the "return 0" can be pretty hard to notice.
> 
> No, this code is not about detecting abuse or errors.  It is there to skip
> the qemu_file_put_fd for cpr-exec mode.  In my next version this function will
> simply be:
> 
> static int put_fd(QEMUFile *f, void *pv, size_t size,
>                   const VMStateField *field, JSONWriter *vmdesc)
> {
>     int32_t *v = pv;
>     return qemu_file_put_fd(f, *v);
> }

Great, thanks.

> 
> > >           return 0;
> > >       }
> > >       return qemu_file_put_fd(f, *v);
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index c0d8bcc..f51b4cb 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -611,9 +611,34 @@
> > >   #     or COLO.
> > >   #
> > >   #     (since 8.2)
> > > +#
> > > +# @cpr-transfer: This mode allows the user to transfer a guest to a
> > > +#     new QEMU instance on the same host with minimal guest pause
> > > +#     time, by preserving guest RAM in place, albeit with new virtual
> > > +#     addresses in new QEMU.
> > > +#
> > > +#     The user starts new QEMU on the same host as old QEMU, with the
> > > +#     the same arguments as old QEMU, plus the -incoming option.  The
> > > +#     user issues the migrate command to old QEMU, which stops the VM,
> > > +#     saves state to the migration channels, and enters the
> > > +#     postmigrate state.  Execution resumes in new QEMU.  Guest RAM is
> > > +#     preserved in place, albeit with new virtual addresses in new
> > > +#     QEMU.  The incoming migration channel cannot be a file type.
> > > +#
> > > +#     This mode requires a second migration channel, specified by the
> > > +#     cpr-uri migration property on the outgoing side, and by
> > > +#     the cpr-uri QEMU command-line option on the incoming
> > > +#     side.  The channel must be a type, such as unix socket, that
> > > +#     supports SCM_RIGHTS.
> > > +#
> > > +#     Memory-backend objects must have the share=on attribute, but
> > > +#     memory-backend-epc is not supported.  The VM must be started
> > > +#     with the '-machine anon-alloc=memfd' option.
> > > +#
> > > +#     (since 9.2)
> > >   ##
> > >   { 'enum': 'MigMode',
> > > -  'data': [ 'normal', 'cpr-reboot' ] }
> > > +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> > 
> > No need to rush, but please add the CPR.rst and unit test updates when you
> > feel confident on the protocol.  It looks pretty good to me now.
> > 
> > Especially it'll be nice to describe the separate cpr-channel protocol in
> > the new doc page.
> 
> Will do, now that there is light at the end of the tunnel.

I just noticed that we have 1 month left before soft freeze. I'll try to
prioritize review of this series (and the other VFIO one) in the upcoming
month.  Let's see whether it can hit 9.2.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 03/13] migration: save cpr mode
  2024-10-07 20:10       ` Peter Xu
@ 2024-10-08 15:57         ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 15:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/7/2024 4:10 PM, Peter Xu wrote:
> On Mon, Oct 07, 2024 at 03:31:09PM -0400, Steven Sistare wrote:
>> On 10/7/2024 11:18 AM, Peter Xu wrote:
>>> On Mon, Sep 30, 2024 at 12:40:34PM -0700, Steve Sistare wrote:
>>>> Save the mode in CPR state, so the user does not need to explicitly specify
>>>> it for the target.  Modify migrate_mode() so it returns the incoming mode on
>>>> the target.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    include/migration/cpr.h |  7 +++++++
>>>>    migration/cpr.c         | 23 ++++++++++++++++++++++-
>>>>    migration/migration.c   |  1 +
>>>>    migration/options.c     |  9 +++++++--
>>>>    4 files changed, 37 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index e7b898b..ac7a63e 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -8,9 +8,16 @@
>>>>    #ifndef MIGRATION_CPR_H
>>>>    #define MIGRATION_CPR_H
>>>> +#include "qapi/qapi-types-migration.h"
>>>> +
>>>> +#define MIG_MODE_NONE           -1
>>>> +
>>>>    #define QEMU_CPR_FILE_MAGIC     0x51435052
>>>>    #define QEMU_CPR_FILE_VERSION   0x00000001
>>>> +MigMode cpr_get_incoming_mode(void);
>>>> +void cpr_set_incoming_mode(MigMode mode);
>>>> +
>>>>    typedef int (*cpr_walk_fd_cb)(int fd);
>>>>    void cpr_save_fd(const char *name, int id, int fd);
>>>>    void cpr_delete_fd(const char *name, int id);
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index e50fc75..7514c4e 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -21,10 +21,23 @@
>>>>    typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>>>>    typedef struct CprState {
>>>> +    MigMode mode;
>>>>        CprFdList fds;
>>>>    } CprState;
>>>> -static CprState cpr_state;
>>>> +static CprState cpr_state = {
>>>> +    .mode = MIG_MODE_NONE,
>>>> +};
>>>> +
>>>> +MigMode cpr_get_incoming_mode(void)
>>>> +{
>>>> +    return cpr_state.mode;
>>>> +}
>>>> +
>>>> +void cpr_set_incoming_mode(MigMode mode)
>>>> +{
>>>> +    cpr_state.mode = mode;
>>>> +}
>>>>    /****************************************************************************/
>>>> @@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
>>>>    /*************************************************************************/
>>>>    #define CPR_STATE "CprState"
>>>> +static int cpr_state_presave(void *opaque)
>>>> +{
>>>> +    cpr_state.mode = migrate_mode();
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    static const VMStateDescription vmstate_cpr_state = {
>>>>        .name = CPR_STATE,
>>>>        .version_id = 1,
>>>>        .minimum_version_id = 1,
>>>> +    .pre_save = cpr_state_presave,
>>>>        .fields = (VMStateField[]) {
>>>> +        VMSTATE_UINT32(mode, CprState),
>>>>            VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>>>>            VMSTATE_END_OF_LIST()
>>>>        }
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 834b0a2..df00e5c 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -416,6 +416,7 @@ void migration_incoming_state_destroy(void)
>>>>            mis->postcopy_qemufile_dst = NULL;
>>>>        }
>>>> +    cpr_set_incoming_mode(MIG_MODE_NONE);
>>>>        yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>>>>    }
>>>> diff --git a/migration/options.c b/migration/options.c
>>>> index 147cd2b..cc85a84 100644
>>>> --- a/migration/options.c
>>>> +++ b/migration/options.c
>>>> @@ -22,6 +22,7 @@
>>>>    #include "qapi/qmp/qnull.h"
>>>>    #include "sysemu/runstate.h"
>>>>    #include "migration/colo.h"
>>>> +#include "migration/cpr.h"
>>>>    #include "migration/misc.h"
>>>>    #include "migration.h"
>>>>    #include "migration-stats.h"
>>>> @@ -768,8 +769,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
>>>>    MigMode migrate_mode(void)
>>>>    {
>>>> -    MigrationState *s = migrate_get_current();
>>>> -    MigMode mode = s->parameters.mode;
>>>> +    MigMode mode = cpr_get_incoming_mode();
>>>> +
>>>> +    if (mode == MIG_MODE_NONE) {
>>>> +        MigrationState *s = migrate_get_current();
>>>> +        mode = s->parameters.mode;
>>>> +    }
>>>
>>> Is this trying to avoid interfering with what user specified?
>>
>> No.
>>
>>> I can kind of get the point of it, but it'll also look pretty werid in this
>>> case that user can set the mode but then when query before cpr-transfer
>>> incoming completes it won't read what was set previously, but what was
>>> migrated via the cpr channel.
>>>
>>> And IIUC it is needed to migrate this mode in cpr stream so as to avoid
>>> another new qemu cmdline on dest qemu.  If true this needs to be mentioned
>>> in the commit message; so far it reads like it's optional, then it's not
>>> clear why only cpr-mode needs to be migrated not other migration parameters.
>>
>> The mode is needed on the incoming side early -- before migration_object_init,
>> and before the monitor is started.  Thus the user cannot set it as a normal
>> migration parameter.
>>
>>> If that won't get right easily, I wonder whether we could just overwrite
>>> parameters.mode directly by the cpr stream.
>>
>> I considered that, but parameters.mode cannot be set before migration_object_init,
>> and some code needs to know mode before that.
> 
> Ah OK...
> 
> I wonder whether it really helps in migrating this mode at all, knowing
> that no other mode should be there but the cpr-transfer mode when with
> -cpr-uri cmdline.
> 
> How about we use cpr_uri to detect early stage cpr transfer mode, then
> after early load stage we unset cpr_uri and always stick with what user
> specified (instead of special casing NONE mode)?  Then it looks like:
> 
> MigMode migrate_mode(void)
> {
>    /*
>     * When cpr_uri set, it always means QEMU is currently in early
>     * cpr-transfer loading stage.
>     */
>    if (cpr_uri) {
>        return MIG_MODE_CPR_TRANSFER;
>    }
> 
>    return migrate_get_current()->parameters.mode;
> }
> 
> Then we don't need to migrate the mode either, which is good as it aligns
> with other migration parameters.
> 
> Would this look slightly cleaner?

Sure.  Mode does not need to be sent in cpr_state.

- Steve


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-10-08 15:17       ` Steven Sistare
@ 2024-10-08 16:26         ` Peter Xu
  2024-10-08 21:05           ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-08 16:26 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Igor Mammedov, Michael S. Tsirkin, qemu-devel, Fabiano Rosas,
	David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On Tue, Oct 08, 2024 at 11:17:46AM -0400, Steven Sistare wrote:
> On 10/7/2024 12:28 PM, Peter Xu wrote:
> > On Mon, Oct 07, 2024 at 11:49:25AM -0400, Peter Xu wrote:
> > > On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
> > > > Save the memfd for anonymous ramblocks in CPR state, along with a name
> > > > that uniquely identifies it.  The block's idstr is not yet set, so it
> > > > cannot be used for this purpose.  Find the saved memfd in new QEMU when
> > > > creating a block.  QEMU hard-codes the length of some internally-created
> > > > blocks, so to guard against that length changing, use lseek to get the
> > > > actual length of an incoming memfd.
> > > > 
> > > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > > ---
> > > >   system/physmem.c | 25 ++++++++++++++++++++++++-
> > > >   1 file changed, 24 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/system/physmem.c b/system/physmem.c
> > > > index 174f7e0..ddbeec9 100644
> > > > --- a/system/physmem.c
> > > > +++ b/system/physmem.c
> > > > @@ -72,6 +72,7 @@
> > > >   #include "qapi/qapi-types-migration.h"
> > > >   #include "migration/options.h"
> > > > +#include "migration/cpr.h"
> > > >   #include "migration/vmstate.h"
> > > >   #include "qemu/range.h"
> > > > @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
> > > >       }
> > > >   }
> > > > +static char *cpr_name(RAMBlock *block)
> > > > +{
> > > > +    MemoryRegion *mr = block->mr;
> > > > +    const char *mr_name = memory_region_name(mr);
> > > > +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
> > > > +
> > > > +    if (id) {
> > > > +        return g_strdup_printf("%s/%s", id, mr_name);
> > > > +    } else {
> > > > +        return g_strdup(mr_name);
> > > > +    }
> > > > +}
> > > > +
> > > >   size_t qemu_ram_pagesize(RAMBlock *rb)
> > > >   {
> > > >       return rb->page_size;
> > > > @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> > > >                                           TYPE_MEMORY_BACKEND)) {
> > > >               size_t max_length = new_block->max_length;
> > > >               MemoryRegion *mr = new_block->mr;
> > > > -            const char *name = memory_region_name(mr);
> > > > +            g_autofree char *name = cpr_name(new_block);
> > > >               new_block->mr->align = QEMU_VMALLOC_ALIGN;
> > > >               new_block->flags |= RAM_SHARED;
> > > > +            new_block->fd = cpr_find_fd(name, 0);
> > > >               if (new_block->fd == -1) {
> > > >                   new_block->fd = qemu_memfd_create(name, max_length + mr->align,
> > > >                                                     0, 0, 0, errp);
> > > > +                cpr_save_fd(name, 0, new_block->fd);
> > > > +            } else {
> > > > +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
> > > 
> > > So this can overwrite the max_length that the caller specified..
> > > 
> > > I remember we used to have some tricks on specifying different max_length
> > > for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
> > > host so the size can be bigger than src qemu's old ramblocks), so that the
> > > MR is always large enough to reload even the new firmwares, while migration
> > > only migrates the smaller size (used_length) so it's fine as we keep the
> > > extra sizes empty. I think that can relevant to the qemu_ram_resize() call
> > > of parse_ramblock().
> 
> Yes, resizable ram block for firmware blob is the only case I know of where
> the length changed in the past.  If a length changes in the future, we will
> need to detect and accommodate that change here, and I believe the fix will
> be to simply use the actual length, as per the code above.  But if you prefer,
> for now I can check for length change and return an error. New qemu will fail
> to start, and old qemu will recover.
> 
> > > The reload will not happen until some point, perhaps system resets.  I
> > > wonder whether that is an issue in this case.
> 
> Firmware is only generated once, via this path on x86:
>   qmp_x_exit_preconfig
>     qemu_machine_creation_done
>       qdev_machine_creation_done
>         pc_machine_done
>           acpi_setup
>             acpi_add_rom_blob
>               rom_add_blob
>                 rom_set_mr
> 
> After a system reset, the ramblock contents from memory are used as-is.
> 
> > PS: If this is needed by CPR-transfer only because mmap() later can fail
> > due to a bigger max_length,
> 
> That is the reason.  IMO adjusting max_length is more robust than fiddling
> with truncate and pretending that max_length is larger, when qemu will never
> be able to use the phantom space up to max_length.

I thought it was not pretending, but the ROM region might be resized after
a system reset?  I worry that your change here can violate with such
resizing later, so that qemu_ram_resize() can potentially fail after (1)
CPR-transfer upgrades completes, then follow with (2) a system reset.

We can observe such resizing kick off in every reboot, like:

(gdb) bt
#0  qemu_ram_resize
#1  0x00005602b623b740 in memory_region_ram_resize
#2  0x00005602b60f5580 in acpi_ram_update
#3  0x00005602b60f5667 in acpi_build_update
#4  0x00005602b5e1028b in fw_cfg_select
#5  0x00005602b5e105af in fw_cfg_dma_transfer
#6  0x00005602b5e109a8 in fw_cfg_dma_mem_write
#7  0x00005602b62352ec in memory_region_write_accessor
#8  0x00005602b62355e6 in access_with_adjusted_size
#9  0x00005602b6238de8 in memory_region_dispatch_write
#10 0x00005602b62488c5 in flatview_write_continue_step
#11 0x00005602b6248997 in flatview_write_continue
#12 0x00005602b6248abf in flatview_write
#13 0x00005602b6248f39 in address_space_write
#14 0x00005602b6248fb1 in address_space_rw
#15 0x00005602b62a5d86 in kvm_handle_io
#16 0x00005602b62a6cb2 in kvm_cpu_exec
#17 0x00005602b62aa37a in kvm_vcpu_thread_fn
#18 0x00005602b655da57 in qemu_thread_start
#19 0x00007f120224a1b7 in start_thread
#20 0x00007f12022cc39c in clone3

Specifically, see this code clip:

acpi_ram_update():
    memory_region_ram_resize(mr, size, &error_abort);
    memcpy(memory_region_get_ram_ptr(mr), data->data, size);

Per my understanding, what it does is during the reset the ROM ramblock
will resize to the new size (normally, only larger, in my memory there used
to have a ROM grew from 256K->512K, or something like that), then the
memcpy() injects the latest firmware that it pre-loaded into mem.

So after such system reset, QEMU might start to see new ROM code loaded
here (not the one that got migrated anymore, which will only match the
version installed on src QEMU).  Here the problem is the new firmware can
be larger, so I _think_ we need to make sure max_length is not modified by
CPR to allow resizing happen here, while if we use truncate=true here it
should just work in all cases.

I think it could be verified with an old QEMU running with old ROM files
(which is smaller), then CPR migrate to a new QEMU running new ROM files
(which is larger), then reboot to see whether that new QEMU crash.  Maybe
we can emulate that with "romfile=XXX" parameter.

I am not fluent with ROM/firmware code, but please double check..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-07 20:39     ` Steven Sistare
  2024-10-08 15:45       ` Peter Xu
@ 2024-10-08 18:28       ` Fabiano Rosas
  2024-10-08 18:47         ` Peter Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Fabiano Rosas @ 2024-10-08 18:28 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: qemu-devel, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

Steven Sistare <steven.sistare@oracle.com> writes:

> On 10/7/2024 3:44 PM, Peter Xu wrote:
>> On Mon, Sep 30, 2024 at 12:40:44PM -0700, Steve Sistare wrote:
>>> Add the cpr-transfer migration mode.  Usage:
>>>    qemu-system-$arch -machine anon-alloc=memfd ...
>>>
>>>    start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
>>>
>>>    Issue commands to old QEMU:
>>>    migrate_set_parameter mode cpr-transfer
>>>    migrate_set_parameter cpr-uri <uri-2>
>>>    migrate -d <uri-1>
>>>
>>> The migrate command stops the VM, saves CPR state to uri-2, saves
>>> normal migration state to uri-1, and old QEMU enters the postmigrate
>>> state.  The user starts new QEMU on the same host as old QEMU, with the
>>> same arguments as old QEMU, plus the -incoming option.  Guest RAM is
>>> preserved in place, albeit with new virtual addresses in new QEMU.
>>>
>>> This mode requires a second migration channel, specified by the
>>> cpr-uri migration property on the outgoing side, and by the cpr-uri
>>> QEMU command-line option on the incoming side.  The channel must
>>> be a type, such as unix socket, that supports SCM_RIGHTS.
>>>
>>> Memory-backend objects must have the share=on attribute, but
>>> memory-backend-epc is not supported.  The VM must be started with
>>> the '-machine anon-alloc=memfd' option, which allows anonymous
>>> memory to be transferred in place to the new process.  The memfds
>>> are kept open by sending the descriptors to new QEMU via the
>>> cpr-uri, which must support SCM_RIGHTS, and they are mmap'd
>>> in new QEMU.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   include/migration/cpr.h   |  1 +
>>>   migration/cpr.c           | 34 +++++++++++++++++++----
>>>   migration/migration.c     | 69 +++++++++++++++++++++++++++++++++++++++++++++--
>>>   migration/migration.h     |  2 ++
>>>   migration/ram.c           |  2 ++
>>>   migration/vmstate-types.c |  5 ++--
>>>   qapi/migration.json       | 27 ++++++++++++++++++-
>>>   stubs/vmstate.c           |  7 +++++
>>>   8 files changed, 137 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>> index e886c98..5cd373f 100644
>>> --- a/include/migration/cpr.h
>>> +++ b/include/migration/cpr.h
>>> @@ -30,6 +30,7 @@ int cpr_state_save(Error **errp);
>>>   int cpr_state_load(Error **errp);
>>>   void cpr_state_close(void);
>>>   struct QIOChannel *cpr_state_ioc(void);
>>> +bool cpr_needed_for_reuse(void *opaque);
>>>   
>>>   QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
>>>   QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>> index 86f66c1..911b556 100644
>>> --- a/migration/cpr.c
>>> +++ b/migration/cpr.c
>>> @@ -9,6 +9,7 @@
>>>   #include "qapi/error.h"
>>>   #include "migration/cpr.h"
>>>   #include "migration/misc.h"
>>> +#include "migration/options.h"
>>>   #include "migration/qemu-file.h"
>>>   #include "migration/savevm.h"
>>>   #include "migration/vmstate.h"
>>> @@ -57,7 +58,7 @@ static const VMStateDescription vmstate_cpr_fd = {
>>>           VMSTATE_UINT32(namelen, CprFd),
>>>           VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
>>>           VMSTATE_INT32(id, CprFd),
>>> -        VMSTATE_INT32(fd, CprFd),
>>> +        VMSTATE_FD(fd, CprFd),
>>>           VMSTATE_END_OF_LIST()
>>>       }
>>>   };
>>> @@ -174,9 +175,16 @@ int cpr_state_save(Error **errp)
>>>   {
>>>       int ret;
>>>       QEMUFile *f;
>>> +    MigMode mode = migrate_mode();
>>>   
>>> -    /* set f based on mode in a later patch in this series */
>>> -    return 0;
>>> +    if (mode == MIG_MODE_CPR_TRANSFER) {
>>> +        f = cpr_transfer_output(migrate_cpr_uri(), errp);
>>> +    } else {
>>> +        return 0;
>>> +    }
>>> +    if (!f) {
>>> +        return -1;
>>> +    }
>>>   
>>>       qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
>>>       qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
>>> @@ -205,8 +213,18 @@ int cpr_state_load(Error **errp)
>>>       uint32_t v;
>>>       QEMUFile *f;
>>>   
>>> -    /* set f based on mode in a later patch in this series */
>>> -    return 0;
>>> +    /*
>>> +     * Mode will be loaded in CPR state, so cannot use it to decide which
>>> +     * form of state to load.
>>> +     */
>>> +    if (cpr_uri) {
>>> +        f = cpr_transfer_input(cpr_uri, errp);
>>> +    } else {
>>> +        return 0;
>>> +    }
>>> +    if (!f) {
>>> +        return -1;
>>> +    }
>>>   
>>>       v = qemu_get_be32(f);
>>>       if (v != QEMU_CPR_FILE_MAGIC) {
>>> @@ -243,3 +261,9 @@ void cpr_state_close(void)
>>>           cpr_state_file = NULL;
>>>       }
>>>   }
>>> +
>>> +bool cpr_needed_for_reuse(void *opaque)
>>> +{
>>> +    MigMode mode = migrate_mode();
>>> +    return mode == MIG_MODE_CPR_TRANSFER;
>>> +}
>> 
>> Drop it until used?
>
> Maybe, but here is my reason for including it here.
>
> These common functions like cpr_needed_for_reuse and cpr_resave_fd are needed
> by multiple follow-on series: vfio, tap, iommufd.  To send those for comment,
> as I have beem, I need to prepend a patch for cpr_needed_for_reuse to each of
> those series, which is redundant.  It makes more sense IMO to include them in
> this initial series.
>
> But, it's your call.
>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 3301583..73b85aa 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -76,6 +76,7 @@
>>>   static NotifierWithReturnList migration_state_notifiers[] = {
>>>       NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
>>>       NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
>>> +    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
>>>   };
>>>   
>>>   /* Messages sent on the return path from destination to source */
>>> @@ -109,6 +110,7 @@ static int migration_maybe_pause(MigrationState *s,
>>>   static void migrate_fd_cancel(MigrationState *s);
>>>   static bool close_return_path_on_source(MigrationState *s);
>>>   static void migration_completion_end(MigrationState *s);
>>> +static void migrate_hup_delete(MigrationState *s);
>>>   
>>>   static void migration_downtime_start(MigrationState *s)
>>>   {
>>> @@ -204,6 +206,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
>>>           return false;
>>>       }
>>>   
>>> +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
>>> +        addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
>>> +        error_setg(errp, "Migration requires streamable transport (eg unix)");
>>> +        return false;
>>> +    }
>>> +
>>>       return true;
>>>   }
>>>   
>>> @@ -316,6 +324,7 @@ void migration_cancel(const Error *error)
>>>           qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
>>>       }
>>>       migrate_fd_cancel(current_migration);
>>> +    migrate_hup_delete(current_migration);
>>>   }
>>>   
>>>   void migration_shutdown(void)
>>> @@ -718,6 +727,9 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>>>       } else {
>>>           error_setg(errp, "unknown migration protocol: %s", uri);
>>>       }
>>> +
>>> +    /* Close cpr socket to tell source that we are listening */
>>> +    cpr_state_close();
>> 
>> Would it be possible to use some explicit reply message to mark this?  
>
> In theory yes, but I fear that using a return channel with message parsing and
> dispatch adds more code than it is worth.

I think this approach is fine for now, but I wonder whether we could
reuse the current return path (RP) by starting it earlier and take
benefit from it already having the message passing infrastructure in
place. I'm actually looking ahead to the migration handshake thread[1],
which could be thought to have some similarity with the early cpr
channel. So having a generic channel in place early on to handle
handshake, CPR, RP, etc. could be a good idea.

Anyway, I'm probing on this a bit so I can start drafting something. I
got surprised that we don't even have the capability bits in the stream
in a useful way (currently, configuration_validate_capabilities() does
kind of nothing).

1- https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 18:28       ` Fabiano Rosas
@ 2024-10-08 18:47         ` Peter Xu
  2024-10-08 19:11           ` Fabiano Rosas
  2024-10-08 19:29           ` Steven Sistare
  0 siblings, 2 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-08 18:47 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Steven Sistare, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Tue, Oct 08, 2024 at 03:28:30PM -0300, Fabiano Rosas wrote:
> >>> +    /* Close cpr socket to tell source that we are listening */
> >>> +    cpr_state_close();
> >> 
> >> Would it be possible to use some explicit reply message to mark this?  
> >
> > In theory yes, but I fear that using a return channel with message parsing and
> > dispatch adds more code than it is worth.
> 
> I think this approach is fine for now, but I wonder whether we could
> reuse the current return path (RP) by starting it earlier and take
> benefit from it already having the message passing infrastructure in
> place. I'm actually looking ahead to the migration handshake thread[1],
> which could be thought to have some similarity with the early cpr
> channel. So having a generic channel in place early on to handle
> handshake, CPR, RP, etc. could be a good idea.

The current design relies on CPR stage happens before device realize()s, so
I assume migration channel (including RP) isn't easily applicable at as
early as this stage.

However I think dest qemu can directly write back to the cpr_uri channel
instead if we want and then follow a protocol simple enough (even though
it'll be separate from the migration stream protocol).

What worries me more (besides using HUP as of now..) is cpr_state_save() is
currently synchronous and can block the main iothread.  It means if cpr
destination is not properly setup, it can hang the main thread (including
e.g. QMP monitor) at qio_channel_socket_connect_sync().  Ideally we
shouldn't block the main thread.

If async-mode can be done, it might be even easier, e.g. if we do
cpr_state_save() in a thread, after qemu_put*() we can directly qemu_get*()
in the same context with the pairing return qemufile.

But maybe we can do it in two steps, merging HUP first.  Then when a better
protocol (plus async mode) ready, one can boost QEMU_CPR_FILE_VERSION.
I'll see how Steve wants to address it.

> 
> Anyway, I'm probing on this a bit so I can start drafting something. I
> got surprised that we don't even have the capability bits in the stream
> in a useful way (currently, configuration_validate_capabilities() does
> kind of nothing).
> 
> 1- https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake

Happy to know this. I was thinking whether I should work on this even
earlier, so if you're looking at that it'll be great.

The major pain to me is the channel establishment part where we now have
all kinds of channels, so we should really fix that sooner (e.g., we hope
to enable multifd + postcopy very soon, that requires multifd and preempt
channels appear in the same time).  It was reasonable the vfio/multifd
series tried to fix it.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 18:47         ` Peter Xu
@ 2024-10-08 19:11           ` Fabiano Rosas
  2024-10-08 19:33             ` Steven Sistare
  2024-10-08 19:48             ` Peter Xu
  2024-10-08 19:29           ` Steven Sistare
  1 sibling, 2 replies; 79+ messages in thread
From: Fabiano Rosas @ 2024-10-08 19:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steven Sistare, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

Peter Xu <peterx@redhat.com> writes:

> On Tue, Oct 08, 2024 at 03:28:30PM -0300, Fabiano Rosas wrote:
>> >>> +    /* Close cpr socket to tell source that we are listening */
>> >>> +    cpr_state_close();
>> >> 
>> >> Would it be possible to use some explicit reply message to mark this?  
>> >
>> > In theory yes, but I fear that using a return channel with message parsing and
>> > dispatch adds more code than it is worth.
>> 
>> I think this approach is fine for now, but I wonder whether we could
>> reuse the current return path (RP) by starting it earlier and take
>> benefit from it already having the message passing infrastructure in
>> place. I'm actually looking ahead to the migration handshake thread[1],
>> which could be thought to have some similarity with the early cpr
>> channel. So having a generic channel in place early on to handle
>> handshake, CPR, RP, etc. could be a good idea.
>
> The current design relies on CPR stage happens before device realize()s, so
> I assume migration channel (including RP) isn't easily applicable at as
> early as this stage.

Well, what is the dependency for the RP? If we can send CPR state, we
can send QEMU_VM_COMMAND, no?

>
> However I think dest qemu can directly write back to the cpr_uri channel
> instead if we want and then follow a protocol simple enough (even though
> it'll be separate from the migration stream protocol).
>
> What worries me more (besides using HUP as of now..) is cpr_state_save() is
> currently synchronous and can block the main iothread.  It means if cpr
> destination is not properly setup, it can hang the main thread (including
> e.g. QMP monitor) at qio_channel_socket_connect_sync().  Ideally we
> shouldn't block the main thread.
>
> If async-mode can be done, it might be even easier, e.g. if we do
> cpr_state_save() in a thread, after qemu_put*() we can directly qemu_get*()
> in the same context with the pairing return qemufile.
>
> But maybe we can do it in two steps, merging HUP first.  Then when a better
> protocol (plus async mode) ready, one can boost QEMU_CPR_FILE_VERSION.
> I'll see how Steve wants to address it.

I agree HUP is fine at the moment.

>
>> 
>> Anyway, I'm probing on this a bit so I can start drafting something. I
>> got surprised that we don't even have the capability bits in the stream
>> in a useful way (currently, configuration_validate_capabilities() does
>> kind of nothing).
>> 
>> 1- https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake
>
> Happy to know this. I was thinking whether I should work on this even
> earlier, so if you're looking at that it'll be great.

As of half an hour ago =) We could put a feature branch up and work
together, if you have more concrete thoughts on how this would look like
let me know.

>
> The major pain to me is the channel establishment part where we now have
> all kinds of channels, so we should really fix that sooner (e.g., we hope
> to enable multifd + postcopy very soon, that requires multifd and preempt
> channels appear in the same time).  It was reasonable the vfio/multifd
> series tried to fix it.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 15:45       ` Peter Xu
@ 2024-10-08 19:12         ` Steven Sistare
  2024-10-08 19:38           ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 19:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/8/2024 11:45 AM, Peter Xu wrote:
> On Mon, Oct 07, 2024 at 04:39:25PM -0400, Steven Sistare wrote:
>> On 10/7/2024 3:44 PM, Peter Xu wrote:
>>> On Mon, Sep 30, 2024 at 12:40:44PM -0700, Steve Sistare wrote:
>>>> Add the cpr-transfer migration mode.  Usage:
>>>>     qemu-system-$arch -machine anon-alloc=memfd ...
>>>>
>>>>     start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
>>>>
>>>>     Issue commands to old QEMU:
>>>>     migrate_set_parameter mode cpr-transfer
>>>>     migrate_set_parameter cpr-uri <uri-2>
>>>>     migrate -d <uri-1>
>>>>
>>>> The migrate command stops the VM, saves CPR state to uri-2, saves
>>>> normal migration state to uri-1, and old QEMU enters the postmigrate
>>>> state.  The user starts new QEMU on the same host as old QEMU, with the
>>>> same arguments as old QEMU, plus the -incoming option.  Guest RAM is
>>>> preserved in place, albeit with new virtual addresses in new QEMU.
>>>>
>>>> This mode requires a second migration channel, specified by the
>>>> cpr-uri migration property on the outgoing side, and by the cpr-uri
>>>> QEMU command-line option on the incoming side.  The channel must
>>>> be a type, such as unix socket, that supports SCM_RIGHTS.
>>>>
>>>> Memory-backend objects must have the share=on attribute, but
>>>> memory-backend-epc is not supported.  The VM must be started with
>>>> the '-machine anon-alloc=memfd' option, which allows anonymous
>>>> memory to be transferred in place to the new process.  The memfds
>>>> are kept open by sending the descriptors to new QEMU via the
>>>> cpr-uri, which must support SCM_RIGHTS, and they are mmap'd
>>>> in new QEMU.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    include/migration/cpr.h   |  1 +
>>>>    migration/cpr.c           | 34 +++++++++++++++++++----
>>>>    migration/migration.c     | 69 +++++++++++++++++++++++++++++++++++++++++++++--
>>>>    migration/migration.h     |  2 ++
>>>>    migration/ram.c           |  2 ++
>>>>    migration/vmstate-types.c |  5 ++--
>>>>    qapi/migration.json       | 27 ++++++++++++++++++-
>>>>    stubs/vmstate.c           |  7 +++++
>>>>    8 files changed, 137 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index e886c98..5cd373f 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -30,6 +30,7 @@ int cpr_state_save(Error **errp);
>>>>    int cpr_state_load(Error **errp);
>>>>    void cpr_state_close(void);
>>>>    struct QIOChannel *cpr_state_ioc(void);
>>>> +bool cpr_needed_for_reuse(void *opaque);
>>>>    QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
>>>>    QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index 86f66c1..911b556 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -9,6 +9,7 @@
>>>>    #include "qapi/error.h"
>>>>    #include "migration/cpr.h"
>>>>    #include "migration/misc.h"
>>>> +#include "migration/options.h"
>>>>    #include "migration/qemu-file.h"
>>>>    #include "migration/savevm.h"
>>>>    #include "migration/vmstate.h"
>>>> @@ -57,7 +58,7 @@ static const VMStateDescription vmstate_cpr_fd = {
>>>>            VMSTATE_UINT32(namelen, CprFd),
>>>>            VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
>>>>            VMSTATE_INT32(id, CprFd),
>>>> -        VMSTATE_INT32(fd, CprFd),
>>>> +        VMSTATE_FD(fd, CprFd),
>>>>            VMSTATE_END_OF_LIST()
>>>>        }
>>>>    };
>>>> @@ -174,9 +175,16 @@ int cpr_state_save(Error **errp)
>>>>    {
>>>>        int ret;
>>>>        QEMUFile *f;
>>>> +    MigMode mode = migrate_mode();
>>>> -    /* set f based on mode in a later patch in this series */
>>>> -    return 0;
>>>> +    if (mode == MIG_MODE_CPR_TRANSFER) {
>>>> +        f = cpr_transfer_output(migrate_cpr_uri(), errp);
>>>> +    } else {
>>>> +        return 0;
>>>> +    }
>>>> +    if (!f) {
>>>> +        return -1;
>>>> +    }
>>>>        qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
>>>>        qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
>>>> @@ -205,8 +213,18 @@ int cpr_state_load(Error **errp)
>>>>        uint32_t v;
>>>>        QEMUFile *f;
>>>> -    /* set f based on mode in a later patch in this series */
>>>> -    return 0;
>>>> +    /*
>>>> +     * Mode will be loaded in CPR state, so cannot use it to decide which
>>>> +     * form of state to load.
>>>> +     */
>>>> +    if (cpr_uri) {
>>>> +        f = cpr_transfer_input(cpr_uri, errp);
>>>> +    } else {
>>>> +        return 0;
>>>> +    }
>>>> +    if (!f) {
>>>> +        return -1;
>>>> +    }
>>>>        v = qemu_get_be32(f);
>>>>        if (v != QEMU_CPR_FILE_MAGIC) {
>>>> @@ -243,3 +261,9 @@ void cpr_state_close(void)
>>>>            cpr_state_file = NULL;
>>>>        }
>>>>    }
>>>> +
>>>> +bool cpr_needed_for_reuse(void *opaque)
>>>> +{
>>>> +    MigMode mode = migrate_mode();
>>>> +    return mode == MIG_MODE_CPR_TRANSFER;
>>>> +}
>>>
>>> Drop it until used?
>>
>> Maybe, but here is my reason for including it here.
>>
>> These common functions like cpr_needed_for_reuse and cpr_resave_fd are needed
>> by multiple follow-on series: vfio, tap, iommufd.  To send those for comment,
>> as I have beem, I need to prepend a patch for cpr_needed_for_reuse to each of
>> those series, which is redundant.  It makes more sense IMO to include them in
>> this initial series.
>>
>> But, it's your call.
> 
> Hmm, logically we shouldn't keep any dead code in QEMU, but indeed this is
> slightly special.
> 
> Would you mind keeping all these helpers in a separate patch after the base
> patches?  The commit message should describe what future projects will
> start to use it, then whoever noticed later (I at least know Dave has quite
> a few patches recently removing dead code in QEMU) will know that's
> potentially to-be-used code, so should keep them around.

I have split the functions into a separate patch.
I'll hold onto it until posting the next series, no big deal.

>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 3301583..73b85aa 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -76,6 +76,7 @@
>>>>    static NotifierWithReturnList migration_state_notifiers[] = {
>>>>        NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
>>>>        NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
>>>> +    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
>>>>    };
>>>>    /* Messages sent on the return path from destination to source */
>>>> @@ -109,6 +110,7 @@ static int migration_maybe_pause(MigrationState *s,
>>>>    static void migrate_fd_cancel(MigrationState *s);
>>>>    static bool close_return_path_on_source(MigrationState *s);
>>>>    static void migration_completion_end(MigrationState *s);
>>>> +static void migrate_hup_delete(MigrationState *s);
>>>>    static void migration_downtime_start(MigrationState *s)
>>>>    {
>>>> @@ -204,6 +206,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
>>>>            return false;
>>>>        }
>>>> +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
>>>> +        addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
>>>> +        error_setg(errp, "Migration requires streamable transport (eg unix)");
>>>> +        return false;
>>>> +    }
>>>> +
>>>>        return true;
>>>>    }
>>>> @@ -316,6 +324,7 @@ void migration_cancel(const Error *error)
>>>>            qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
>>>>        }
>>>>        migrate_fd_cancel(current_migration);
>>>> +    migrate_hup_delete(current_migration);
>>>>    }
>>>>    void migration_shutdown(void)
>>>> @@ -718,6 +727,9 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>>>>        } else {
>>>>            error_setg(errp, "unknown migration protocol: %s", uri);
>>>>        }
>>>> +
>>>> +    /* Close cpr socket to tell source that we are listening */
>>>> +    cpr_state_close();
>>>
>>> Would it be possible to use some explicit reply message to mark this?
>>
>> In theory yes, but I fear that using a return channel with message parsing and
>> dispatch adds more code than it is worth.
>>
>>> So
>>> far looks like src QEMU will continue with qmp_migrate_finish() even if the
>>> cpr channel was closed due to error.
>>
>> Yes, but we recover just fine.  The target hits some error, fails to read all the
>> cpr state, closes the channel prematurely, and does *not* create a listen socket
>> for the normal migration channel.  Hence qmp_migrate_finish fails to connect to the
>> normal channel, and recovers.
> 
> This is slightly tricky part and would be nice to be documented somewhere,
> perhaps starting from in the commit message.

I will extend the block comment in qmp_migrate:

     /*
      * For cpr-transfer, the target may not be listening yet on the migration
      * channel, because first it must finish cpr_load_state.  The target tells
      * us it is listening by closing the cpr-state socket.  Wait for that HUP
      * event before connecting in qmp_migrate_finish.
      *
      * The HUP could occur because the target fails while reading CPR state,
      * in which case the target will not listen for the incoming migration
      * connection, so qmp_migrate_finish will fail to connect, and then recover.
      */

> Then the error will say "failed to connect to destination QEMU" hiding the
> real failure (cpr save/load failed), right?  That's slightly a pity.

Yes, but destination qemu will also emit a more specific message.

> I'm OK with the HUP as of now, but if you care about accurate CPR-stage
> error reporting, then feel free to draft something else in the next post.

I'll think about it, but to get cpr into 9.2, this will probably need to be
deferred as a future enhancement.

>>> I still didn't see how that kind of issue was captured below [1] (e.g., cpr
>>> channel broken after sending partial fds)?
>>
>> Same as above.
>>
>>>>    }
>>>>    static void process_incoming_migration_bh(void *opaque)
>>>> @@ -1414,6 +1426,8 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>>        s->vmdesc = NULL;
>>>>        qemu_savevm_state_cleanup();
>>>> +    cpr_state_close();
>>>> +    migrate_hup_delete(s);
>>>>        close_return_path_on_source(s);
>>>> @@ -1698,7 +1712,9 @@ bool migration_thread_is_self(void)
>>>>    bool migrate_mode_is_cpr(MigrationState *s)
>>>>    {
>>>> -    return s->parameters.mode == MIG_MODE_CPR_REBOOT;
>>>> +    MigMode mode = s->parameters.mode;
>>>> +    return mode == MIG_MODE_CPR_REBOOT ||
>>>> +           mode == MIG_MODE_CPR_TRANSFER;
>>>>    }
>>>>    int migrate_init(MigrationState *s, Error **errp)
>>>> @@ -2033,6 +2049,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>>>            return false;
>>>>        }
>>>> +    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
>>>> +        !s->parameters.cpr_uri) {
>>>> +        error_setg(errp, "cpr-transfer mode requires setting cpr-uri");
>>>> +        return false;
>>>> +    }
>>>> +
>>>>        if (migration_is_blocked(errp)) {
>>>>            return false;
>>>>        }
>>>> @@ -2076,6 +2098,37 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>>>    static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
>>>>                                   Error **errp);
>>>> +static void migrate_hup_add(MigrationState *s, QIOChannel *ioc, GSourceFunc cb,
>>>> +                            void *opaque)
>>>> +{
>>>> +        s->hup_source = qio_channel_create_watch(ioc, G_IO_HUP);
>>>> +        g_source_set_callback(s->hup_source, cb, opaque, NULL);
>>>> +        g_source_attach(s->hup_source, NULL);
>>>> +}
>>>> +
>>>> +static void migrate_hup_delete(MigrationState *s)
>>>> +{
>>>> +    if (s->hup_source) {
>>>> +        g_source_destroy(s->hup_source);
>>>> +        g_source_unref(s->hup_source);
>>>> +        s->hup_source = NULL;
>>>> +    }
>>>> +}
>>>> +
>>>> +static gboolean qmp_migrate_finish_cb(QIOChannel *channel,
>>>> +                                      GIOCondition cond,
>>>> +                                      void *opaque)
>>>> +{
>>>> +    MigrationAddress *addr = opaque;
>>>
>>> [1]
>>>
>>>> +
>>>> +    qmp_migrate_finish(addr, false, NULL);
>>>> +
>>>> +    cpr_state_close();
>>>> +    migrate_hup_delete(migrate_get_current());
>>>> +    qapi_free_MigrationAddress(addr);
>>>> +    return G_SOURCE_REMOVE;
>>>> +}
>>>> +
>>>>    void qmp_migrate(const char *uri, bool has_channels,
>>>>                     MigrationChannelList *channels, bool has_detach, bool detach,
>>>>                     bool has_resume, bool resume, Error **errp)
>>>> @@ -2136,7 +2189,19 @@ void qmp_migrate(const char *uri, bool has_channels,
>>>>            goto out;
>>>>        }
>>>> -    qmp_migrate_finish(addr, resume_requested, errp);
>>>> +    /*
>>>> +     * For cpr-transfer, the target may not be listening yet on the migration
>>>> +     * channel, because first it must finish cpr_load_state.  The target tells
>>>> +     * us it is listening by closing the cpr-state socket.  Wait for that HUP
>>>> +     * event before connecting in qmp_migrate_finish.
>>>> +     */
>>>> +    if (s->parameters.mode == MIG_MODE_CPR_TRANSFER) {
>>>> +        migrate_hup_add(s, cpr_state_ioc(), (GSourceFunc)qmp_migrate_finish_cb,
>>>> +                        QAPI_CLONE(MigrationAddress, addr));
>>>> +
>>>> +    } else {
>>>> +        qmp_migrate_finish(addr, resume_requested, errp);
>>>> +    }
>>>>    out:
>>>>        if (local_err) {
>>>> diff --git a/migration/migration.h b/migration/migration.h
>>>> index 38aa140..74c167b 100644
>>>> --- a/migration/migration.h
>>>> +++ b/migration/migration.h
>>>> @@ -457,6 +457,8 @@ struct MigrationState {
>>>>        bool switchover_acked;
>>>>        /* Is this a rdma migration */
>>>>        bool rdma_migration;
>>>> +
>>>> +    GSource *hup_source;
>>>>    };
>>>>    void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index 81eda27..e2cef50 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -216,7 +216,9 @@ static bool postcopy_preempt_active(void)
>>>>    bool migrate_ram_is_ignored(RAMBlock *block)
>>>>    {
>>>> +    MigMode mode = migrate_mode();
>>>>        return !qemu_ram_is_migratable(block) ||
>>>> +           mode == MIG_MODE_CPR_TRANSFER ||
>>>>               (migrate_ignore_shared() && qemu_ram_is_shared(block)
>>>>                                        && qemu_ram_is_named_file(block));
>>>>    }
>>>> diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
>>>> index 6e45a4a..b5a55b8 100644
>>>> --- a/migration/vmstate-types.c
>>>> +++ b/migration/vmstate-types.c
>>>> @@ -15,6 +15,7 @@
>>>>    #include "qemu-file.h"
>>>>    #include "migration.h"
>>>>    #include "migration/vmstate.h"
>>>> +#include "migration/client-options.h"
>>>>    #include "qemu/error-report.h"
>>>>    #include "qemu/queue.h"
>>>>    #include "trace.h"
>>>> @@ -321,7 +322,7 @@ static int get_fd(QEMUFile *f, void *pv, size_t size,
>>>>    {
>>>>        int32_t *v = pv;
>>>>        qemu_get_sbe32s(f, v);
>>>> -    if (*v < 0) {
>>>> +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
>>>>            return 0;
>>>>        }
>>>>        *v = qemu_file_get_fd(f);
>>>> @@ -334,7 +335,7 @@ static int put_fd(QEMUFile *f, void *pv, size_t size,
>>>>        int32_t *v = pv;
>>>>        qemu_put_sbe32s(f, v);
>>>> -    if (*v < 0) {
>>>> +    if (*v < 0 || migrate_mode() != MIG_MODE_CPR_TRANSFER) {
>>>
>>> So I suppose you wanted to guard VMSTATE_FD being abused.  Then I wonder
>>> whether it'll help more by adding a comment above VMSTATE_FD instead; it'll
>>> be more straightforward to me.
>>>
>>> And if you want to fail hard, assert should work better too in runtime, or
>>> the "return 0" can be pretty hard to notice.
>>
>> No, this code is not about detecting abuse or errors.  It is there to skip
>> the qemu_file_put_fd for cpr-exec mode.  In my next version this function will
>> simply be:
>>
>> static int put_fd(QEMUFile *f, void *pv, size_t size,
>>                    const VMStateField *field, JSONWriter *vmdesc)
>> {
>>      int32_t *v = pv;
>>      return qemu_file_put_fd(f, *v);
>> }
> 
> Great, thanks.
> 
>>
>>>>            return 0;
>>>>        }
>>>>        return qemu_file_put_fd(f, *v);
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index c0d8bcc..f51b4cb 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -611,9 +611,34 @@
>>>>    #     or COLO.
>>>>    #
>>>>    #     (since 8.2)
>>>> +#
>>>> +# @cpr-transfer: This mode allows the user to transfer a guest to a
>>>> +#     new QEMU instance on the same host with minimal guest pause
>>>> +#     time, by preserving guest RAM in place, albeit with new virtual
>>>> +#     addresses in new QEMU.
>>>> +#
>>>> +#     The user starts new QEMU on the same host as old QEMU, with the
>>>> +#     the same arguments as old QEMU, plus the -incoming option.  The
>>>> +#     user issues the migrate command to old QEMU, which stops the VM,
>>>> +#     saves state to the migration channels, and enters the
>>>> +#     postmigrate state.  Execution resumes in new QEMU.  Guest RAM is
>>>> +#     preserved in place, albeit with new virtual addresses in new
>>>> +#     QEMU.  The incoming migration channel cannot be a file type.
>>>> +#
>>>> +#     This mode requires a second migration channel, specified by the
>>>> +#     cpr-uri migration property on the outgoing side, and by
>>>> +#     the cpr-uri QEMU command-line option on the incoming
>>>> +#     side.  The channel must be a type, such as unix socket, that
>>>> +#     supports SCM_RIGHTS.
>>>> +#
>>>> +#     Memory-backend objects must have the share=on attribute, but
>>>> +#     memory-backend-epc is not supported.  The VM must be started
>>>> +#     with the '-machine anon-alloc=memfd' option.
>>>> +#
>>>> +#     (since 9.2)
>>>>    ##
>>>>    { 'enum': 'MigMode',
>>>> -  'data': [ 'normal', 'cpr-reboot' ] }
>>>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>>>
>>> No need to rush, but please add the CPR.rst and unit test updates when you
>>> feel confident on the protocol.  It looks pretty good to me now.
>>>
>>> Especially it'll be nice to describe the separate cpr-channel protocol in
>>> the new doc page.
>>
>> Will do, now that there is light at the end of the tunnel.
> 
> I just noticed that we have 1 month left before soft freeze. I'll try to
> prioritize review of this series (and the other VFIO one) in the upcoming
> month.  Let's see whether it can hit 9.2.

Cool, thanks, I will also make an extra effort to hit that goal.

- Steve




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 04/13] migration: stop vm earlier for cpr
  2024-10-08 15:35       ` Peter Xu
@ 2024-10-08 19:13         ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 19:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/8/2024 11:35 AM, Peter Xu wrote:
> On Mon, Oct 07, 2024 at 04:52:43PM -0400, Steven Sistare wrote:
>> On 10/7/2024 11:27 AM, Peter Xu wrote:
>>> On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote:
>>>> Stop the vm earlier for cpr, to guarantee consistent device state when
>>>> CPR state is saved.
>>>
>>> Could you add some more info on why this order matters?
>>>
>>> E.g., qmp_migrate should switch migration state machine to SETUP, while
>>> this path holds BQL, I think it means there's no way devices got hot added
>>> concurrently of the whole process.
>>>
>>> Would other things change in the cpr states (name, fd, etc.)?  It'll be
>>> great to mention these details in the commit message.
>>
>> Because of the new cpr-state save operation needed by this mode,
>> I created this patch to be future proof.  Performing a save operation while
>> the machine is running is asking for trouble.  But right now, I am not aware
>> of any specific issues.
>>
>> Later in the "tap and vhost" series there is another reason to stop the vm here and
>> save cpr state, because the devices must be stopped in old qemu before they
>> are initialized in new qemu.  If you are curious, see the 2 patches I attached
>> to the email at
>>    https://lore.kernel.org/qemu-devel/fa95c40d-b5e5-41eb-bba7-7842bca2f73e@oracle.com/
>> But, that has nothing to do with the contents of cpr state.
> 
> Then I suggest we leave this patch to the vhost/tap series, then please
> document clearly in the commit mesasge on why this is needed.  Linking to
> that discussion thread could work too.

OK.

> Side note: I saw you have MIG_EVENT_PRECOPY_CPR_SETUP in you own tree, I
> wonder whether we could reuse MIG_EVENT_PRECOPY_SETUP by moving it earlier
> in qmp_migrate().  After all CPR-* notifiers are already registered
> separately with the list of migration_state_notifiers[], so I suppose it'll
> service the same purpose.  But we can discuss that later.

Sure, we can discuss later (and I'll take another look before posting the vhost/tap
series).

- Steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 18:47         ` Peter Xu
  2024-10-08 19:11           ` Fabiano Rosas
@ 2024-10-08 19:29           ` Steven Sistare
  1 sibling, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 19:29 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: qemu-devel, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 10/8/2024 2:47 PM, Peter Xu wrote:
> On Tue, Oct 08, 2024 at 03:28:30PM -0300, Fabiano Rosas wrote:
>>>>> +    /* Close cpr socket to tell source that we are listening */
>>>>> +    cpr_state_close();
>>>>
>>>> Would it be possible to use some explicit reply message to mark this?
>>>
>>> In theory yes, but I fear that using a return channel with message parsing and
>>> dispatch adds more code than it is worth.
>>
>> I think this approach is fine for now, but I wonder whether we could
>> reuse the current return path (RP) by starting it earlier and take
>> benefit from it already having the message passing infrastructure in
>> place. I'm actually looking ahead to the migration handshake thread[1],
>> which could be thought to have some similarity with the early cpr
>> channel. So having a generic channel in place early on to handle
>> handshake, CPR, RP, etc. could be a good idea.
> 
> The current design relies on CPR stage happens before device realize()s, so
> I assume migration channel (including RP) isn't easily applicable at as
> early as this stage.
> 
> However I think dest qemu can directly write back to the cpr_uri channel
> instead if we want and then follow a protocol simple enough (even though
> it'll be separate from the migration stream protocol).
> 
> What worries me more (besides using HUP as of now..) is cpr_state_save() is
> currently synchronous and can block the main iothread.  It means if cpr
> destination is not properly setup, it can hang the main thread (including
> e.g. QMP monitor) at qio_channel_socket_connect_sync().  Ideally we
> shouldn't block the main thread.
> 
> If async-mode can be done, it might be even easier, e.g. if we do
> cpr_state_save() in a thread, after qemu_put*() we can directly qemu_get*()
> in the same context with the pairing return qemufile.
> 
> But maybe we can do it in two steps, merging HUP first.  Then when a better
> protocol (plus async mode) ready, one can boost QEMU_CPR_FILE_VERSION.
> I'll see how Steve wants to address it.

Our emails on this subject crossed.
I agree that an async channel both on send and recv sounds like the way to
go, but I would like to keep HUP for now, and pursue that as a future RFE.

- Steve

>> Anyway, I'm probing on this a bit so I can start drafting something. I
>> got surprised that we don't even have the capability bits in the stream
>> in a useful way (currently, configuration_validate_capabilities() does
>> kind of nothing).
>>
>> 1- https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake
> 
> Happy to know this. I was thinking whether I should work on this even
> earlier, so if you're looking at that it'll be great.
> 
> The major pain to me is the channel establishment part where we now have
> all kinds of channels, so we should really fix that sooner (e.g., we hope
> to enable multifd + postcopy very soon, that requires multifd and preempt
> channels appear in the same time).  It was reasonable the vfio/multifd
> series tried to fix it.
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 19:11           ` Fabiano Rosas
@ 2024-10-08 19:33             ` Steven Sistare
  2024-10-08 19:48             ` Peter Xu
  1 sibling, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 19:33 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: qemu-devel, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 10/8/2024 3:11 PM, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Oct 08, 2024 at 03:28:30PM -0300, Fabiano Rosas wrote:
>>>>>> +    /* Close cpr socket to tell source that we are listening */
>>>>>> +    cpr_state_close();
>>>>>
>>>>> Would it be possible to use some explicit reply message to mark this?
>>>>
>>>> In theory yes, but I fear that using a return channel with message parsing and
>>>> dispatch adds more code than it is worth.
>>>
>>> I think this approach is fine for now, but I wonder whether we could
>>> reuse the current return path (RP) by starting it earlier and take
>>> benefit from it already having the message passing infrastructure in
>>> place. I'm actually looking ahead to the migration handshake thread[1],
>>> which could be thought to have some similarity with the early cpr
>>> channel. So having a generic channel in place early on to handle
>>> handshake, CPR, RP, etc. could be a good idea.
>>
>> The current design relies on CPR stage happens before device realize()s, so
>> I assume migration channel (including RP) isn't easily applicable at as
>> early as this stage.
> 
> Well, what is the dependency for the RP? If we can send CPR state, we
> can send QEMU_VM_COMMAND, no?

The CPR state channel is (and must be) used before migration_object_init,
and before the normal migration channel is opened.  Thus we cannot use the
normal return path.

- Steve

>> However I think dest qemu can directly write back to the cpr_uri channel
>> instead if we want and then follow a protocol simple enough (even though
>> it'll be separate from the migration stream protocol).
>>
>> What worries me more (besides using HUP as of now..) is cpr_state_save() is
>> currently synchronous and can block the main iothread.  It means if cpr
>> destination is not properly setup, it can hang the main thread (including
>> e.g. QMP monitor) at qio_channel_socket_connect_sync().  Ideally we
>> shouldn't block the main thread.
>>
>> If async-mode can be done, it might be even easier, e.g. if we do
>> cpr_state_save() in a thread, after qemu_put*() we can directly qemu_get*()
>> in the same context with the pairing return qemufile.
>>
>> But maybe we can do it in two steps, merging HUP first.  Then when a better
>> protocol (plus async mode) ready, one can boost QEMU_CPR_FILE_VERSION.
>> I'll see how Steve wants to address it.
> 
> I agree HUP is fine at the moment.
> 
>>
>>>
>>> Anyway, I'm probing on this a bit so I can start drafting something. I
>>> got surprised that we don't even have the capability bits in the stream
>>> in a useful way (currently, configuration_validate_capabilities() does
>>> kind of nothing).
>>>
>>> 1- https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake
>>
>> Happy to know this. I was thinking whether I should work on this even
>> earlier, so if you're looking at that it'll be great.
> 
> As of half an hour ago =) We could put a feature branch up and work
> together, if you have more concrete thoughts on how this would look like
> let me know.
> 
>>
>> The major pain to me is the channel establishment part where we now have
>> all kinds of channels, so we should really fix that sooner (e.g., we hope
>> to enable multifd + postcopy very soon, that requires multifd and preempt
>> channels appear in the same time).  It was reasonable the vfio/multifd
>> series tried to fix it.
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 19:12         ` Steven Sistare
@ 2024-10-08 19:38           ` Peter Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-08 19:38 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Tue, Oct 08, 2024 at 03:12:32PM -0400, Steven Sistare wrote:
> > This is slightly tricky part and would be nice to be documented somewhere,
> > perhaps starting from in the commit message.
> 
> I will extend the block comment in qmp_migrate:
> 
>     /*
>      * For cpr-transfer, the target may not be listening yet on the migration
>      * channel, because first it must finish cpr_load_state.  The target tells
>      * us it is listening by closing the cpr-state socket.  Wait for that HUP
>      * event before connecting in qmp_migrate_finish.
>      *
>      * The HUP could occur because the target fails while reading CPR state,
>      * in which case the target will not listen for the incoming migration
>      * connection, so qmp_migrate_finish will fail to connect, and then recover.
>      */

Yes this is better, thanks.

> 
> > Then the error will say "failed to connect to destination QEMU" hiding the
> > real failure (cpr save/load failed), right?  That's slightly a pity.
> 
> Yes, but destination qemu will also emit a more specific message.

True.

> 
> > I'm OK with the HUP as of now, but if you care about accurate CPR-stage
> > error reporting, then feel free to draft something else in the next post.
> 
> I'll think about it, but to get cpr into 9.2, this will probably need to be
> deferred as a future enhancement.

Yep that's OK.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 19:11           ` Fabiano Rosas
  2024-10-08 19:33             ` Steven Sistare
@ 2024-10-08 19:48             ` Peter Xu
  2024-10-09 18:43               ` Steven Sistare
  1 sibling, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-08 19:48 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Steven Sistare, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
> As of half an hour ago =) We could put a feature branch up and work
> together, if you have more concrete thoughts on how this would look like
> let me know.

[I'll hijack this thread with one more email, as this is not cpr-relevant]

I think I listed all the things I can think of in the wiki, so please go
ahead.

One trivial suggestion is we can start from the very simple, which is the
handshake itself, with a self-bootstrap protocol, probably feature-bit
based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
how to handshake".

Comparing to the rest requirement, IMHO we can make the channel
establishment the 1st feature, then it's already good for merging, having
feature bit 1 saying "this qemu understands named channel establishment".

Then we add new feature bits on top of the handshake feature, by adding
more feature bits.  Both QEMUs should first handshake on the feature bits
they support and enable only the subset that all support.

Or instead of bit, feature strings, etc. would all work which you
prefer. Just to say we don't need to impl all the ideas there, as some of
them might take more time (e.g. device tree check), and that list is
probably not complete anyway.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-10-08 16:26         ` Peter Xu
@ 2024-10-08 21:05           ` Steven Sistare
  2024-10-08 21:32             ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 21:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Igor Mammedov, Michael S. Tsirkin, qemu-devel, Fabiano Rosas,
	David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 10/8/2024 12:26 PM, Peter Xu wrote:
> On Tue, Oct 08, 2024 at 11:17:46AM -0400, Steven Sistare wrote:
>> On 10/7/2024 12:28 PM, Peter Xu wrote:
>>> On Mon, Oct 07, 2024 at 11:49:25AM -0400, Peter Xu wrote:
>>>> On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
>>>>> Save the memfd for anonymous ramblocks in CPR state, along with a name
>>>>> that uniquely identifies it.  The block's idstr is not yet set, so it
>>>>> cannot be used for this purpose.  Find the saved memfd in new QEMU when
>>>>> creating a block.  QEMU hard-codes the length of some internally-created
>>>>> blocks, so to guard against that length changing, use lseek to get the
>>>>> actual length of an incoming memfd.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>    system/physmem.c | 25 ++++++++++++++++++++++++-
>>>>>    1 file changed, 24 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>> index 174f7e0..ddbeec9 100644
>>>>> --- a/system/physmem.c
>>>>> +++ b/system/physmem.c
>>>>> @@ -72,6 +72,7 @@
>>>>>    #include "qapi/qapi-types-migration.h"
>>>>>    #include "migration/options.h"
>>>>> +#include "migration/cpr.h"
>>>>>    #include "migration/vmstate.h"
>>>>>    #include "qemu/range.h"
>>>>> @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
>>>>>        }
>>>>>    }
>>>>> +static char *cpr_name(RAMBlock *block)
>>>>> +{
>>>>> +    MemoryRegion *mr = block->mr;
>>>>> +    const char *mr_name = memory_region_name(mr);
>>>>> +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
>>>>> +
>>>>> +    if (id) {
>>>>> +        return g_strdup_printf("%s/%s", id, mr_name);
>>>>> +    } else {
>>>>> +        return g_strdup(mr_name);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>    size_t qemu_ram_pagesize(RAMBlock *rb)
>>>>>    {
>>>>>        return rb->page_size;
>>>>> @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>>                                            TYPE_MEMORY_BACKEND)) {
>>>>>                size_t max_length = new_block->max_length;
>>>>>                MemoryRegion *mr = new_block->mr;
>>>>> -            const char *name = memory_region_name(mr);
>>>>> +            g_autofree char *name = cpr_name(new_block);
>>>>>                new_block->mr->align = QEMU_VMALLOC_ALIGN;
>>>>>                new_block->flags |= RAM_SHARED;
>>>>> +            new_block->fd = cpr_find_fd(name, 0);
>>>>>                if (new_block->fd == -1) {
>>>>>                    new_block->fd = qemu_memfd_create(name, max_length + mr->align,
>>>>>                                                      0, 0, 0, errp);
>>>>> +                cpr_save_fd(name, 0, new_block->fd);
>>>>> +            } else {
>>>>> +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
>>>>
>>>> So this can overwrite the max_length that the caller specified..
>>>>
>>>> I remember we used to have some tricks on specifying different max_length
>>>> for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
>>>> host so the size can be bigger than src qemu's old ramblocks), so that the
>>>> MR is always large enough to reload even the new firmwares, while migration
>>>> only migrates the smaller size (used_length) so it's fine as we keep the
>>>> extra sizes empty. I think that can relevant to the qemu_ram_resize() call
>>>> of parse_ramblock().
>>
>> Yes, resizable ram block for firmware blob is the only case I know of where
>> the length changed in the past.  If a length changes in the future, we will
>> need to detect and accommodate that change here, and I believe the fix will
>> be to simply use the actual length, as per the code above.  But if you prefer,
>> for now I can check for length change and return an error. New qemu will fail
>> to start, and old qemu will recover.
>>
>>>> The reload will not happen until some point, perhaps system resets.  I
>>>> wonder whether that is an issue in this case.
>>
>> Firmware is only generated once, via this path on x86:
>>    qmp_x_exit_preconfig
>>      qemu_machine_creation_done
>>        qdev_machine_creation_done
>>          pc_machine_done
>>            acpi_setup
>>              acpi_add_rom_blob
>>                rom_add_blob
>>                  rom_set_mr
>>
>> After a system reset, the ramblock contents from memory are used as-is.
>>
>>> PS: If this is needed by CPR-transfer only because mmap() later can fail
>>> due to a bigger max_length,
>>
>> That is the reason.  IMO adjusting max_length is more robust than fiddling
>> with truncate and pretending that max_length is larger, when qemu will never
>> be able to use the phantom space up to max_length.
> 
> I thought it was not pretending, but the ROM region might be resized after
> a system reset?  I worry that your change here can violate with such
> resizing later, so that qemu_ram_resize() can potentially fail after (1)
> CPR-transfer upgrades completes, then follow with (2) a system reset.
> 
> We can observe such resizing kick off in every reboot, like:
> 
> (gdb) bt
> #0  qemu_ram_resize
> #1  0x00005602b623b740 in memory_region_ram_resize
> #2  0x00005602b60f5580 in acpi_ram_update
> #3  0x00005602b60f5667 in acpi_build_update
> #4  0x00005602b5e1028b in fw_cfg_select
> #5  0x00005602b5e105af in fw_cfg_dma_transfer
> #6  0x00005602b5e109a8 in fw_cfg_dma_mem_write
> #7  0x00005602b62352ec in memory_region_write_accessor
> #8  0x00005602b62355e6 in access_with_adjusted_size
> #9  0x00005602b6238de8 in memory_region_dispatch_write
> #10 0x00005602b62488c5 in flatview_write_continue_step
> #11 0x00005602b6248997 in flatview_write_continue
> #12 0x00005602b6248abf in flatview_write
> #13 0x00005602b6248f39 in address_space_write
> #14 0x00005602b6248fb1 in address_space_rw
> #15 0x00005602b62a5d86 in kvm_handle_io
> #16 0x00005602b62a6cb2 in kvm_cpu_exec
> #17 0x00005602b62aa37a in kvm_vcpu_thread_fn
> #18 0x00005602b655da57 in qemu_thread_start
> #19 0x00007f120224a1b7 in start_thread
> #20 0x00007f12022cc39c in clone3
> 
> Specifically, see this code clip:
> 
> acpi_ram_update():
>      memory_region_ram_resize(mr, size, &error_abort);
>      memcpy(memory_region_get_ram_ptr(mr), data->data, size);
> 
> Per my understanding, what it does is during the reset the ROM ramblock
> will resize to the new size (normally, only larger, in my memory there used
> to have a ROM grew from 256K->512K, or something like that), then the
> memcpy() injects the latest firmware that it pre-loaded into mem.
> 
> So after such system reset, QEMU might start to see new ROM code loaded
> here (not the one that got migrated anymore, which will only match the
> version installed on src QEMU).  Here the problem is the new firmware can
> be larger, so I _think_ we need to make sure max_length is not modified by
> CPR to allow resizing happen here, while if we use truncate=true here it
> should just work in all cases.
> 
> I think it could be verified with an old QEMU running with old ROM files
> (which is smaller), then CPR migrate to a new QEMU running new ROM files
> (which is larger), then reboot to see whether that new QEMU crash.  Maybe
> we can emulate that with "romfile=XXX" parameter.
> 
> I am not fluent with ROM/firmware code, but please double check..

Thank you for the detailed analysis, I was completely wrong on this one :(

I also keep forgetting that ftruncate can grow as well as shrink a file.
I agree that preserving the dest qemu max_length, and using ftruncate, is the
correct solution, as long as dest max_length >= source max_length.

However, IMO the extra memory created by ftruncate also needs to be pinned for DMA.
We disagreed on exactly what blocks needs to be pinned in previous discussions,
and to save time I would rather not re-open that debate right now.  Instead, I propose
to simply require that max_length does not change, and return an error if it does.
If it changes in some future qemu, we can reopen the discussion.

- Steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 00/13] Live update: cpr-transfer
  2024-10-08 14:33 ` [PATCH V2 00/13] Live update: cpr-transfer Vladimir Sementsov-Ogievskiy
@ 2024-10-08 21:13   ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-08 21:13 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/8/2024 10:33 AM, Vladimir Sementsov-Ogievskiy wrote:
> On 30.09.24 22:40, Steve Sistare wrote:
>> Some devices need new kernel software interfaces
>> to allow a descriptor to be used in a process that did not originally open it.
> 
> Hi Steve!
> 
> Could you please describe, which kernel version / features are required? I'm mostly interested in migration of tap and vhost-user devices.

For tap and vhost kernel, no special kernel features are required.  But in addition
to these cpr-transfer patches, you will need the "Live Update: tap and vhost" RFC
V1 that I posted, and it might not apply cleanly to this most recent cpr-transfer
series.  I will eventually update that series, but not immediately.

Also, I have never tried vhost-user, so not sure it will work without additional
changes in qemu.

- Steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-10-08 21:05           ` Steven Sistare
@ 2024-10-08 21:32             ` Peter Xu
  2024-10-31 20:32               ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-08 21:32 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Igor Mammedov, Michael S. Tsirkin, qemu-devel, Fabiano Rosas,
	David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On Tue, Oct 08, 2024 at 05:05:01PM -0400, Steven Sistare wrote:
> On 10/8/2024 12:26 PM, Peter Xu wrote:
> > On Tue, Oct 08, 2024 at 11:17:46AM -0400, Steven Sistare wrote:
> > > On 10/7/2024 12:28 PM, Peter Xu wrote:
> > > > On Mon, Oct 07, 2024 at 11:49:25AM -0400, Peter Xu wrote:
> > > > > On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
> > > > > > Save the memfd for anonymous ramblocks in CPR state, along with a name
> > > > > > that uniquely identifies it.  The block's idstr is not yet set, so it
> > > > > > cannot be used for this purpose.  Find the saved memfd in new QEMU when
> > > > > > creating a block.  QEMU hard-codes the length of some internally-created
> > > > > > blocks, so to guard against that length changing, use lseek to get the
> > > > > > actual length of an incoming memfd.
> > > > > > 
> > > > > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > > > > ---
> > > > > >    system/physmem.c | 25 ++++++++++++++++++++++++-
> > > > > >    1 file changed, 24 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/system/physmem.c b/system/physmem.c
> > > > > > index 174f7e0..ddbeec9 100644
> > > > > > --- a/system/physmem.c
> > > > > > +++ b/system/physmem.c
> > > > > > @@ -72,6 +72,7 @@
> > > > > >    #include "qapi/qapi-types-migration.h"
> > > > > >    #include "migration/options.h"
> > > > > > +#include "migration/cpr.h"
> > > > > >    #include "migration/vmstate.h"
> > > > > >    #include "qemu/range.h"
> > > > > > @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
> > > > > >        }
> > > > > >    }
> > > > > > +static char *cpr_name(RAMBlock *block)
> > > > > > +{
> > > > > > +    MemoryRegion *mr = block->mr;
> > > > > > +    const char *mr_name = memory_region_name(mr);
> > > > > > +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
> > > > > > +
> > > > > > +    if (id) {
> > > > > > +        return g_strdup_printf("%s/%s", id, mr_name);
> > > > > > +    } else {
> > > > > > +        return g_strdup(mr_name);
> > > > > > +    }
> > > > > > +}
> > > > > > +
> > > > > >    size_t qemu_ram_pagesize(RAMBlock *rb)
> > > > > >    {
> > > > > >        return rb->page_size;
> > > > > > @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> > > > > >                                            TYPE_MEMORY_BACKEND)) {
> > > > > >                size_t max_length = new_block->max_length;
> > > > > >                MemoryRegion *mr = new_block->mr;
> > > > > > -            const char *name = memory_region_name(mr);
> > > > > > +            g_autofree char *name = cpr_name(new_block);
> > > > > >                new_block->mr->align = QEMU_VMALLOC_ALIGN;
> > > > > >                new_block->flags |= RAM_SHARED;
> > > > > > +            new_block->fd = cpr_find_fd(name, 0);
> > > > > >                if (new_block->fd == -1) {
> > > > > >                    new_block->fd = qemu_memfd_create(name, max_length + mr->align,
> > > > > >                                                      0, 0, 0, errp);
> > > > > > +                cpr_save_fd(name, 0, new_block->fd);
> > > > > > +            } else {
> > > > > > +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
> > > > > 
> > > > > So this can overwrite the max_length that the caller specified..
> > > > > 
> > > > > I remember we used to have some tricks on specifying different max_length
> > > > > for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
> > > > > host so the size can be bigger than src qemu's old ramblocks), so that the
> > > > > MR is always large enough to reload even the new firmwares, while migration
> > > > > only migrates the smaller size (used_length) so it's fine as we keep the
> > > > > extra sizes empty. I think that can relevant to the qemu_ram_resize() call
> > > > > of parse_ramblock().
> > > 
> > > Yes, resizable ram block for firmware blob is the only case I know of where
> > > the length changed in the past.  If a length changes in the future, we will
> > > need to detect and accommodate that change here, and I believe the fix will
> > > be to simply use the actual length, as per the code above.  But if you prefer,
> > > for now I can check for length change and return an error. New qemu will fail
> > > to start, and old qemu will recover.
> > > 
> > > > > The reload will not happen until some point, perhaps system resets.  I
> > > > > wonder whether that is an issue in this case.
> > > 
> > > Firmware is only generated once, via this path on x86:
> > >    qmp_x_exit_preconfig
> > >      qemu_machine_creation_done
> > >        qdev_machine_creation_done
> > >          pc_machine_done
> > >            acpi_setup
> > >              acpi_add_rom_blob
> > >                rom_add_blob
> > >                  rom_set_mr
> > > 
> > > After a system reset, the ramblock contents from memory are used as-is.
> > > 
> > > > PS: If this is needed by CPR-transfer only because mmap() later can fail
> > > > due to a bigger max_length,
> > > 
> > > That is the reason.  IMO adjusting max_length is more robust than fiddling
> > > with truncate and pretending that max_length is larger, when qemu will never
> > > be able to use the phantom space up to max_length.
> > 
> > I thought it was not pretending, but the ROM region might be resized after
> > a system reset?  I worry that your change here can violate with such
> > resizing later, so that qemu_ram_resize() can potentially fail after (1)
> > CPR-transfer upgrades completes, then follow with (2) a system reset.
> > 
> > We can observe such resizing kick off in every reboot, like:
> > 
> > (gdb) bt
> > #0  qemu_ram_resize
> > #1  0x00005602b623b740 in memory_region_ram_resize
> > #2  0x00005602b60f5580 in acpi_ram_update
> > #3  0x00005602b60f5667 in acpi_build_update
> > #4  0x00005602b5e1028b in fw_cfg_select
> > #5  0x00005602b5e105af in fw_cfg_dma_transfer
> > #6  0x00005602b5e109a8 in fw_cfg_dma_mem_write
> > #7  0x00005602b62352ec in memory_region_write_accessor
> > #8  0x00005602b62355e6 in access_with_adjusted_size
> > #9  0x00005602b6238de8 in memory_region_dispatch_write
> > #10 0x00005602b62488c5 in flatview_write_continue_step
> > #11 0x00005602b6248997 in flatview_write_continue
> > #12 0x00005602b6248abf in flatview_write
> > #13 0x00005602b6248f39 in address_space_write
> > #14 0x00005602b6248fb1 in address_space_rw
> > #15 0x00005602b62a5d86 in kvm_handle_io
> > #16 0x00005602b62a6cb2 in kvm_cpu_exec
> > #17 0x00005602b62aa37a in kvm_vcpu_thread_fn
> > #18 0x00005602b655da57 in qemu_thread_start
> > #19 0x00007f120224a1b7 in start_thread
> > #20 0x00007f12022cc39c in clone3
> > 
> > Specifically, see this code clip:
> > 
> > acpi_ram_update():
> >      memory_region_ram_resize(mr, size, &error_abort);
> >      memcpy(memory_region_get_ram_ptr(mr), data->data, size);
> > 
> > Per my understanding, what it does is during the reset the ROM ramblock
> > will resize to the new size (normally, only larger, in my memory there used
> > to have a ROM grew from 256K->512K, or something like that), then the
> > memcpy() injects the latest firmware that it pre-loaded into mem.
> > 
> > So after such system reset, QEMU might start to see new ROM code loaded
> > here (not the one that got migrated anymore, which will only match the
> > version installed on src QEMU).  Here the problem is the new firmware can
> > be larger, so I _think_ we need to make sure max_length is not modified by
> > CPR to allow resizing happen here, while if we use truncate=true here it
> > should just work in all cases.
> > 
> > I think it could be verified with an old QEMU running with old ROM files
> > (which is smaller), then CPR migrate to a new QEMU running new ROM files
> > (which is larger), then reboot to see whether that new QEMU crash.  Maybe
> > we can emulate that with "romfile=XXX" parameter.
> > 
> > I am not fluent with ROM/firmware code, but please double check..
> 
> Thank you for the detailed analysis, I was completely wrong on this one :(
> 
> I also keep forgetting that ftruncate can grow as well as shrink a file.
> I agree that preserving the dest qemu max_length, and using ftruncate, is the
> correct solution, as long as dest max_length >= source max_length.
> 
> However, IMO the extra memory created by ftruncate also needs to be pinned for DMA.
> We disagreed on exactly what blocks needs to be pinned in previous discussions,
> and to save time I would rather not re-open that debate right now.  Instead, I propose
> to simply require that max_length does not change, and return an error if it does.
> If it changes in some future qemu, we can reopen the discussion.

Hmm.. why the extra memory needs to be pinned?

From QEMU memory topology POV, anything more than used_length is not
visible to the guest, afaict.

In this specific ROM example, qemu_ram_resize() on src QEMU will first
resize the ramblock (updating used_length), then set that exact same size
with memory_region_set_size() to the MR with the size of the smaller
firmware size when src QEMU boots:

qemu_ram_resize():
    unaligned_size = newsize;
    ...
    newsize = TARGET_PAGE_ALIGN(newsize);
    newsize = REAL_HOST_PAGE_ALIGN(newsize);
    ...
    block->used_length = newsize;
    ...
    memory_region_set_size(block->mr, unaligned_size);

Here a tiny detail is the two sizes are slightly different, but the MR size
is even smaller than used_length.  The MR size decides what can be visible
to the guest, when the MR that owns the ROM file is mapped into GPA range.
That's true on the src, while after CPR migrates to dest that should still
hold true, afaict, as all the rest memory (used->max) is not yet used
before a system reset.

The extra memory (used->max) can be relevant only after a system reset,
when the new firmware will be loaded, and qemu_ram_resize() can indeed
extend that MR to cover more than before.  However that should be fine too
because that means guest memory is being rebuilt, so VFIO memory listeners
should do the right things (unpin old, repin the new ROM that is larger
this time), iiuc.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-08 19:48             ` Peter Xu
@ 2024-10-09 18:43               ` Steven Sistare
  2024-10-09 19:06                 ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-09 18:43 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: qemu-devel, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 10/8/2024 3:48 PM, Peter Xu wrote:
> On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
>> As of half an hour ago =) We could put a feature branch up and work
>> together, if you have more concrete thoughts on how this would look like
>> let me know.
> 
> [I'll hijack this thread with one more email, as this is not cpr-relevant]
> 
> I think I listed all the things I can think of in the wiki, so please go
> ahead.
> 
> One trivial suggestion is we can start from the very simple, which is the
> handshake itself, with a self-bootstrap protocol, probably feature-bit
> based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
> how to handshake".
> 
> Comparing to the rest requirement, IMHO we can make the channel
> establishment the 1st feature, then it's already good for merging, having
> feature bit 1 saying "this qemu understands named channel establishment".
> 
> Then we add new feature bits on top of the handshake feature, by adding
> more feature bits.  Both QEMUs should first handshake on the feature bits
> they support and enable only the subset that all support.
> 
> Or instead of bit, feature strings, etc. would all work which you
> prefer. Just to say we don't need to impl all the ideas there, as some of
> them might take more time (e.g. device tree check), and that list is
> probably not complete anyway.

While writing a qtest for cpr-transfer, I discovered a problem that could be
solved with an early migration handshake, prior to cpr_save_state / cpr_load_state.

There is currently no way to set migration caps on dest qemu before starting
cpr-transfer, because dest qemu blocks in cpr_state_load before creating any
devices or monitors. It is unblocked after the user sends the migrate command
to source qemu, but then the migration starts and it is too late to set migration
capabilities or parameters on the dest.

Are you OK with that restriction (for now, until a handshake is implemented)?
If not, I have a problem.

I can hack the qtest to make it work with the restriction.

- Steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 18:43               ` Steven Sistare
@ 2024-10-09 19:06                 ` Peter Xu
  2024-10-09 19:59                   ` Peter Xu
  2024-10-09 20:09                   ` Steven Sistare
  0 siblings, 2 replies; 79+ messages in thread
From: Peter Xu @ 2024-10-09 19:06 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Wed, Oct 09, 2024 at 02:43:44PM -0400, Steven Sistare wrote:
> On 10/8/2024 3:48 PM, Peter Xu wrote:
> > On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
> > > As of half an hour ago =) We could put a feature branch up and work
> > > together, if you have more concrete thoughts on how this would look like
> > > let me know.
> > 
> > [I'll hijack this thread with one more email, as this is not cpr-relevant]
> > 
> > I think I listed all the things I can think of in the wiki, so please go
> > ahead.
> > 
> > One trivial suggestion is we can start from the very simple, which is the
> > handshake itself, with a self-bootstrap protocol, probably feature-bit
> > based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
> > how to handshake".
> > 
> > Comparing to the rest requirement, IMHO we can make the channel
> > establishment the 1st feature, then it's already good for merging, having
> > feature bit 1 saying "this qemu understands named channel establishment".
> > 
> > Then we add new feature bits on top of the handshake feature, by adding
> > more feature bits.  Both QEMUs should first handshake on the feature bits
> > they support and enable only the subset that all support.
> > 
> > Or instead of bit, feature strings, etc. would all work which you
> > prefer. Just to say we don't need to impl all the ideas there, as some of
> > them might take more time (e.g. device tree check), and that list is
> > probably not complete anyway.
> 
> While writing a qtest for cpr-transfer, I discovered a problem that could be
> solved with an early migration handshake, prior to cpr_save_state / cpr_load_state.
> 
> There is currently no way to set migration caps on dest qemu before starting
> cpr-transfer, because dest qemu blocks in cpr_state_load before creating any
> devices or monitors. It is unblocked after the user sends the migrate command
> to source qemu, but then the migration starts and it is too late to set migration
> capabilities or parameters on the dest.
> 
> Are you OK with that restriction (for now, until a handshake is implemented)?
> If not, I have a problem.
> 
> I can hack the qtest to make it work with the restriction.

Hmm, the test case is one thing, but if it's a problem, then.. how in real
life one could set migration capabilities on dest qemu for cpr-transfer?

Now a similar question, and also what I overlooked previously, is how
cpr-transfer should support "-incoming defer".  We need that because that's
what Libvirt uses.. with an upcoming migrate_incoming QMP command.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 19:06                 ` Peter Xu
@ 2024-10-09 19:59                   ` Peter Xu
  2024-10-09 20:18                     ` Steven Sistare
  2024-10-09 20:09                   ` Steven Sistare
  1 sibling, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-09 19:59 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Wed, Oct 09, 2024 at 03:06:53PM -0400, Peter Xu wrote:
> On Wed, Oct 09, 2024 at 02:43:44PM -0400, Steven Sistare wrote:
> > On 10/8/2024 3:48 PM, Peter Xu wrote:
> > > On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
> > > > As of half an hour ago =) We could put a feature branch up and work
> > > > together, if you have more concrete thoughts on how this would look like
> > > > let me know.
> > > 
> > > [I'll hijack this thread with one more email, as this is not cpr-relevant]
> > > 
> > > I think I listed all the things I can think of in the wiki, so please go
> > > ahead.
> > > 
> > > One trivial suggestion is we can start from the very simple, which is the
> > > handshake itself, with a self-bootstrap protocol, probably feature-bit
> > > based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
> > > how to handshake".
> > > 
> > > Comparing to the rest requirement, IMHO we can make the channel
> > > establishment the 1st feature, then it's already good for merging, having
> > > feature bit 1 saying "this qemu understands named channel establishment".
> > > 
> > > Then we add new feature bits on top of the handshake feature, by adding
> > > more feature bits.  Both QEMUs should first handshake on the feature bits
> > > they support and enable only the subset that all support.
> > > 
> > > Or instead of bit, feature strings, etc. would all work which you
> > > prefer. Just to say we don't need to impl all the ideas there, as some of
> > > them might take more time (e.g. device tree check), and that list is
> > > probably not complete anyway.
> > 
> > While writing a qtest for cpr-transfer, I discovered a problem that could be
> > solved with an early migration handshake, prior to cpr_save_state / cpr_load_state.
> > 
> > There is currently no way to set migration caps on dest qemu before starting
> > cpr-transfer, because dest qemu blocks in cpr_state_load before creating any
> > devices or monitors. It is unblocked after the user sends the migrate command
> > to source qemu, but then the migration starts and it is too late to set migration
> > capabilities or parameters on the dest.
> > 
> > Are you OK with that restriction (for now, until a handshake is implemented)?
> > If not, I have a problem.
> > 
> > I can hack the qtest to make it work with the restriction.
> 
> Hmm, the test case is one thing, but if it's a problem, then.. how in real
> life one could set migration capabilities on dest qemu for cpr-transfer?
> 
> Now a similar question, and also what I overlooked previously, is how
> cpr-transfer should support "-incoming defer".  We need that because that's
> what Libvirt uses.. with an upcoming migrate_incoming QMP command.

Just to share some more thoughts below..

So fundamentally the question is whether there's some way cpr can have a
predictable window on dest qemu that we know QMP is ready, but before
incoming migration starts.

With current design, incoming side will sequentially do: (1) cpr-uri
load(), (2) initialize rest of QEMU (migration, qmp, devices, etc.), (3)
listen port ready, then (4) close(), aka, HUP.  Looks like steps 1-4 will
have no way to control when kicked off, so after cpr-uri save() data dump
they'll happen in one shot.

It might make sense because we assumed load() of cpr-uri is during the
blackout window, and enlarge that is probably not good.

But.. why do we keep cpr_state_save/load() in the blackout window?  AFAIU
they're mostly the fds sharing so they can happen with VM still running on
src, right?

I still remember the vhost/tap issue you mentioned, but I wonder whether
that'll ever change the vhost/tap fd at all if we forbid any device change
like what we do with normal migrations. IOW, I wonder whether we can still
do the cpr_state_save/load() always during VM running (but it should still
be during an ACTIVE migration, IOW, device hotplug and stuff should be
forbidden, just like a live precopy phase).

Iff that works, then maybe there's a way out: we can make cpr-transfer two
steps:

  - DST: start QEMU dest the same, with -cpr-uri XXX, but now let's assume
    it's with -incoming defer just to give an example, and no migration
    capabilities applied yet.

  - SRC: send 'migrate' QMP command, qemu should see that cpr-transfer is
    enabled, so it triggers sending cpr states to destination only.  It
    doesn't run the rest migration logic.

    During this stage src VM will always be running, we need to make sure
    migration state machine start running (perhaps NONE->SETUP_CPR) so
    device plug/unplug will be forbidden like what happens with generic
    precopy, so as to stablize fds.  Just need to make sure
    migration_is_running() returns true.

  - DST: receives all cpr states.  When complete, it keeps running, no HUP
    is needed this time, because it'll wait for another "migrate_incoming".

    In the case of "-incoming unix:XXX" in qemu cmdline, it'll directly go
    into the listen code and wait, but still we don't need the HUP because
    we're not in blackout window, and src won't connect automatically but
    requires a command later from mgmt (see below).

  - DST: the mgmt can send whatever QMP command to dest now, including
    setup incoming port, setup migration capabilities/parameters if needed.
    Src is still running, so it can be slow.

  - SRC: do the real migration with another "migrate resume=true" QMP
    command (I simply reused postcopy's resume flag here).  This time src
    qemu should notice this is a continuation of cpr-transfer migration,
    then it moves that on (SETUP_CPR->ACTIVE), migrate RAM/device/whatever
    is left.  Same to generic migration, until COMPLETED.

Not sure whether it'll work.  We'll need to still properly handle things
like migrate_cancel, etc, when triggered during SETUP_CPR state, but
hopefully not complicated to do..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 19:06                 ` Peter Xu
  2024-10-09 19:59                   ` Peter Xu
@ 2024-10-09 20:09                   ` Steven Sistare
  2024-10-09 20:36                     ` Peter Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-09 20:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/9/2024 3:06 PM, Peter Xu wrote:
> On Wed, Oct 09, 2024 at 02:43:44PM -0400, Steven Sistare wrote:
>> On 10/8/2024 3:48 PM, Peter Xu wrote:
>>> On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
>>>> As of half an hour ago =) We could put a feature branch up and work
>>>> together, if you have more concrete thoughts on how this would look like
>>>> let me know.
>>>
>>> [I'll hijack this thread with one more email, as this is not cpr-relevant]
>>>
>>> I think I listed all the things I can think of in the wiki, so please go
>>> ahead.
>>>
>>> One trivial suggestion is we can start from the very simple, which is the
>>> handshake itself, with a self-bootstrap protocol, probably feature-bit
>>> based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
>>> how to handshake".
>>>
>>> Comparing to the rest requirement, IMHO we can make the channel
>>> establishment the 1st feature, then it's already good for merging, having
>>> feature bit 1 saying "this qemu understands named channel establishment".
>>>
>>> Then we add new feature bits on top of the handshake feature, by adding
>>> more feature bits.  Both QEMUs should first handshake on the feature bits
>>> they support and enable only the subset that all support.
>>>
>>> Or instead of bit, feature strings, etc. would all work which you
>>> prefer. Just to say we don't need to impl all the ideas there, as some of
>>> them might take more time (e.g. device tree check), and that list is
>>> probably not complete anyway.
>>
>> While writing a qtest for cpr-transfer, I discovered a problem that could be
>> solved with an early migration handshake, prior to cpr_save_state / cpr_load_state.
>>
>> There is currently no way to set migration caps on dest qemu before starting
>> cpr-transfer, because dest qemu blocks in cpr_state_load before creating any
>> devices or monitors. It is unblocked after the user sends the migrate command
>> to source qemu, but then the migration starts and it is too late to set migration
>> capabilities or parameters on the dest.
>>
>> Are you OK with that restriction (for now, until a handshake is implemented)?
>> If not, I have a problem.
>>
>> I can hack the qtest to make it work with the restriction.
> 
> Hmm, the test case is one thing, but if it's a problem, then.. how in real
> life one could set migration capabilities on dest qemu for cpr-transfer?

You will allow it via the migration handshake!
But right now, one can enable capabilities by adding -global migration.xxx=yyy
on the target command line.

> Now a similar question, and also what I overlooked previously, is how
> cpr-transfer should support "-incoming defer".  We need that because that's
> what Libvirt uses.. with an upcoming migrate_incoming QMP command.

Defer works.  Start dest qemu, issue the migrate command to source qemu.
Dest qemu finishes cpr_load_state and enters the main loop, listening for
montitor commands.

- Steve


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 19:59                   ` Peter Xu
@ 2024-10-09 20:18                     ` Steven Sistare
  2024-10-09 20:57                       ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-09 20:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/9/2024 3:59 PM, Peter Xu wrote:
> On Wed, Oct 09, 2024 at 03:06:53PM -0400, Peter Xu wrote:
>> On Wed, Oct 09, 2024 at 02:43:44PM -0400, Steven Sistare wrote:
>>> On 10/8/2024 3:48 PM, Peter Xu wrote:
>>>> On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
>>>>> As of half an hour ago =) We could put a feature branch up and work
>>>>> together, if you have more concrete thoughts on how this would look like
>>>>> let me know.
>>>>
>>>> [I'll hijack this thread with one more email, as this is not cpr-relevant]
>>>>
>>>> I think I listed all the things I can think of in the wiki, so please go
>>>> ahead.
>>>>
>>>> One trivial suggestion is we can start from the very simple, which is the
>>>> handshake itself, with a self-bootstrap protocol, probably feature-bit
>>>> based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
>>>> how to handshake".
>>>>
>>>> Comparing to the rest requirement, IMHO we can make the channel
>>>> establishment the 1st feature, then it's already good for merging, having
>>>> feature bit 1 saying "this qemu understands named channel establishment".
>>>>
>>>> Then we add new feature bits on top of the handshake feature, by adding
>>>> more feature bits.  Both QEMUs should first handshake on the feature bits
>>>> they support and enable only the subset that all support.
>>>>
>>>> Or instead of bit, feature strings, etc. would all work which you
>>>> prefer. Just to say we don't need to impl all the ideas there, as some of
>>>> them might take more time (e.g. device tree check), and that list is
>>>> probably not complete anyway.
>>>
>>> While writing a qtest for cpr-transfer, I discovered a problem that could be
>>> solved with an early migration handshake, prior to cpr_save_state / cpr_load_state.
>>>
>>> There is currently no way to set migration caps on dest qemu before starting
>>> cpr-transfer, because dest qemu blocks in cpr_state_load before creating any
>>> devices or monitors. It is unblocked after the user sends the migrate command
>>> to source qemu, but then the migration starts and it is too late to set migration
>>> capabilities or parameters on the dest.
>>>
>>> Are you OK with that restriction (for now, until a handshake is implemented)?
>>> If not, I have a problem.
>>>
>>> I can hack the qtest to make it work with the restriction.
>>
>> Hmm, the test case is one thing, but if it's a problem, then.. how in real
>> life one could set migration capabilities on dest qemu for cpr-transfer?
>>
>> Now a similar question, and also what I overlooked previously, is how
>> cpr-transfer should support "-incoming defer".  We need that because that's
>> what Libvirt uses.. with an upcoming migrate_incoming QMP command.
> 
> Just to share some more thoughts below..
> 
> So fundamentally the question is whether there's some way cpr can have a
> predictable window on dest qemu that we know QMP is ready, but before
> incoming migration starts.
> 
> With current design, incoming side will sequentially do: (1) cpr-uri
> load(), (2) initialize rest of QEMU (migration, qmp, devices, etc.), (3)
> listen port ready, then (4) close(), aka, HUP.  Looks like steps 1-4 will
> have no way to control when kicked off, so after cpr-uri save() data dump
> they'll happen in one shot.
> 
> It might make sense because we assumed load() of cpr-uri is during the
> blackout window, and enlarge that is probably not good.
> 
> But.. why do we keep cpr_state_save/load() in the blackout window?  AFAIU
> they're mostly the fds sharing so they can happen with VM still running on
> src, right?
> 
> I still remember the vhost/tap issue you mentioned, but I wonder whether
> that'll ever change the vhost/tap fd at all if we forbid any device change
> like what we do with normal migrations. IOW, I wonder whether we can still
> do the cpr_state_save/load() always during VM running (but it should still
> be during an ACTIVE migration, IOW, device hotplug and stuff should be
> forbidden, just like a live precopy phase).
> 
> Iff that works, then maybe there's a way out: we can make cpr-transfer two
> steps:
> 
>    - DST: start QEMU dest the same, with -cpr-uri XXX, but now let's assume
>      it's with -incoming defer just to give an example, and no migration
>      capabilities applied yet.
> 
>    - SRC: send 'migrate' QMP command, qemu should see that cpr-transfer is
>      enabled, so it triggers sending cpr states to destination only.  It
>      doesn't run the rest migration logic.
> 
>      During this stage src VM will always be running, we need to make sure
>      migration state machine start running (perhaps NONE->SETUP_CPR) so
>      device plug/unplug will be forbidden like what happens with generic
>      precopy, so as to stablize fds.  Just need to make sure
>      migration_is_running() returns true.
> 
>    - DST: receives all cpr states.  When complete, it keeps running, no HUP
>      is needed this time, because it'll wait for another "migrate_incoming".
> 
>      In the case of "-incoming unix:XXX" in qemu cmdline, it'll directly go
>      into the listen code and wait, but still we don't need the HUP because
>      we're not in blackout window, and src won't connect automatically but
>      requires a command later from mgmt (see below).
> 
>    - DST: the mgmt can send whatever QMP command to dest now, including
>      setup incoming port, setup migration capabilities/parameters if needed.
>      Src is still running, so it can be slow.
> 
>    - SRC: do the real migration with another "migrate resume=true" QMP
>      command (I simply reused postcopy's resume flag here).  This time src
>      qemu should notice this is a continuation of cpr-transfer migration,
>      then it moves that on (SETUP_CPR->ACTIVE), migrate RAM/device/whatever
>      is left.  Same to generic migration, until COMPLETED.
> 
> Not sure whether it'll work.  We'll need to still properly handle things
> like migrate_cancel, etc, when triggered during SETUP_CPR state, but
> hopefully not complicated to do..

Yes, I am also brainstorming along these lines, looking for more gotcha's,
but its a big design change. I don't love it so far.

These issues all creep in because of transfer mode.  Exec mode did not have this
problem, as cpr-state is written to an in-memory file.

- Steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 20:09                   ` Steven Sistare
@ 2024-10-09 20:36                     ` Peter Xu
  2024-10-10 20:06                       ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-09 20:36 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Wed, Oct 09, 2024 at 04:09:45PM -0400, Steven Sistare wrote:
> On 10/9/2024 3:06 PM, Peter Xu wrote:
> > On Wed, Oct 09, 2024 at 02:43:44PM -0400, Steven Sistare wrote:
> > > On 10/8/2024 3:48 PM, Peter Xu wrote:
> > > > On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
> > > > > As of half an hour ago =) We could put a feature branch up and work
> > > > > together, if you have more concrete thoughts on how this would look like
> > > > > let me know.
> > > > 
> > > > [I'll hijack this thread with one more email, as this is not cpr-relevant]
> > > > 
> > > > I think I listed all the things I can think of in the wiki, so please go
> > > > ahead.
> > > > 
> > > > One trivial suggestion is we can start from the very simple, which is the
> > > > handshake itself, with a self-bootstrap protocol, probably feature-bit
> > > > based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
> > > > how to handshake".
> > > > 
> > > > Comparing to the rest requirement, IMHO we can make the channel
> > > > establishment the 1st feature, then it's already good for merging, having
> > > > feature bit 1 saying "this qemu understands named channel establishment".
> > > > 
> > > > Then we add new feature bits on top of the handshake feature, by adding
> > > > more feature bits.  Both QEMUs should first handshake on the feature bits
> > > > they support and enable only the subset that all support.
> > > > 
> > > > Or instead of bit, feature strings, etc. would all work which you
> > > > prefer. Just to say we don't need to impl all the ideas there, as some of
> > > > them might take more time (e.g. device tree check), and that list is
> > > > probably not complete anyway.
> > > 
> > > While writing a qtest for cpr-transfer, I discovered a problem that could be
> > > solved with an early migration handshake, prior to cpr_save_state / cpr_load_state.
> > > 
> > > There is currently no way to set migration caps on dest qemu before starting
> > > cpr-transfer, because dest qemu blocks in cpr_state_load before creating any
> > > devices or monitors. It is unblocked after the user sends the migrate command
> > > to source qemu, but then the migration starts and it is too late to set migration
> > > capabilities or parameters on the dest.
> > > 
> > > Are you OK with that restriction (for now, until a handshake is implemented)?
> > > If not, I have a problem.
> > > 
> > > I can hack the qtest to make it work with the restriction.
> > 
> > Hmm, the test case is one thing, but if it's a problem, then.. how in real
> > life one could set migration capabilities on dest qemu for cpr-transfer?
> 
> You will allow it via the migration handshake!
> But right now, one can enable capabilities by adding -global migration.xxx=yyy
> on the target command line.

Those are for debugging only, so we shouldn't suggest them to be used in
production.. at least not the plan.

Yeah, handshake would make it work.  But it's not yet there.. :(

> 
> > Now a similar question, and also what I overlooked previously, is how
> > cpr-transfer should support "-incoming defer".  We need that because that's
> > what Libvirt uses.. with an upcoming migrate_incoming QMP command.
> 
> Defer works.  Start dest qemu, issue the migrate command to source qemu.
> Dest qemu finishes cpr_load_state and enters the main loop, listening for
> montitor commands.

Ahh yes, the HUP works with this case too, that's OK.

What's your thoughts in the other email I wrote?  That'll make QMP
available in general on dest, if I read it right.  But yeah I think this
issue is not a blocker now at least, so I'm just curious whether that's
still useful.

We may still want to understand one question I raised elsewhere on whether
cpr state save/load must be done during vm stopped.  If so, then it means
Libvirt will only go with "defer", and QMP set-capabilities might be
accounted as downtime there which can be unfortunate.. Basically, it means
if we can still drop patch 4 completely (while the vhost notifiers can
exist in the future, but hopefully not dependent on patch 4).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 20:18                     ` Steven Sistare
@ 2024-10-09 20:57                       ` Peter Xu
  2024-10-09 22:08                         ` Fabiano Rosas
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-09 20:57 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Wed, Oct 09, 2024 at 04:18:31PM -0400, Steven Sistare wrote:
> Yes, I am also brainstorming along these lines, looking for more gotcha's,
> but its a big design change. I don't love it so far.
> 
> These issues all creep in because of transfer mode.  Exec mode did not have this
> problem, as cpr-state is written to an in-memory file.

I understand.  Hopefully we're getting there very soon..

I still have concern on having -global used in productions, and meanwhile
it might still be challenging for handshake to work as early as cpr stage
even for later, because at least in my mind the handshake still happens in
the main migration channel (where it includes channel establishments etc,
which is not proper during cpr stage).

I don't really know whether that'll work at last..

So in my mind the previous two-steps proposal is so far the only one that
all seem to work, with no unpredictable side effects.

Said that, maybe we can still think about simpler solutions in the
following days or see others opinions, we don't need to make a decision
today, so maybe there's still better way to go.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 20:57                       ` Peter Xu
@ 2024-10-09 22:08                         ` Fabiano Rosas
  2024-10-10 20:05                           ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Fabiano Rosas @ 2024-10-09 22:08 UTC (permalink / raw)
  To: Peter Xu, Steven Sistare
  Cc: qemu-devel, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

Peter Xu <peterx@redhat.com> writes:

> On Wed, Oct 09, 2024 at 04:18:31PM -0400, Steven Sistare wrote:
>> Yes, I am also brainstorming along these lines, looking for more gotcha's,
>> but its a big design change. I don't love it so far.
>> 
>> These issues all creep in because of transfer mode.  Exec mode did not have this
>> problem, as cpr-state is written to an in-memory file.
>
> I understand.  Hopefully we're getting there very soon..
>
> I still have concern on having -global used in productions, and meanwhile

Agree, but for qtests it should be fine at least.

> it might still be challenging for handshake to work as early as cpr stage
> even for later, because at least in my mind the handshake still happens in
> the main migration channel (where it includes channel establishments etc,
> which is not proper during cpr stage).

I don't think any form of handshake will be implemented in a
month. Maybe it's best we keep that to the side for now.

(still, thinking about that virtio-net USO thread, an early handshake
could be a good idea, so we could perhaps inform about device
incompatibility, etc.)

>
> I don't really know whether that'll work at last..
>
> So in my mind the previous two-steps proposal is so far the only one that
> all seem to work, with no unpredictable side effects.
>
> Said that, maybe we can still think about simpler solutions in the
> following days or see others opinions, we don't need to make a decision
> today, so maybe there's still better way to go.

I thought of putting the caps on the configuration vmstate and using
that to set them on the destination, but there's a bit of a chicken and
egg problem because we need capabilities set as soon as
qemu_start_incoming_migration(). Unless we sent those via the cpr
channel. We could split migration_object_init() a bit so we can
instantiate some parts of the migration state earlier (I'm not even sure
what are the actual dependencies are).


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 22:08                         ` Fabiano Rosas
@ 2024-10-10 20:05                           ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-10 20:05 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: qemu-devel, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 10/9/2024 6:08 PM, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Wed, Oct 09, 2024 at 04:18:31PM -0400, Steven Sistare wrote:
>>> Yes, I am also brainstorming along these lines, looking for more gotcha's,
>>> but its a big design change. I don't love it so far.
>>>
>>> These issues all creep in because of transfer mode.  Exec mode did not have this
>>> problem, as cpr-state is written to an in-memory file.
>>
>> I understand.  Hopefully we're getting there very soon..
>>
>> I still have concern on having -global used in productions, and meanwhile
> 
> Agree, but for qtests it should be fine at least.
> 
>> it might still be challenging for handshake to work as early as cpr stage
>> even for later, because at least in my mind the handshake still happens in
>> the main migration channel (where it includes channel establishments etc,
>> which is not proper during cpr stage).
> 
> I don't think any form of handshake will be implemented in a
> month. Maybe it's best we keep that to the side for now.

Agreed, and a handshake in the main migration channel, which would be the
cleanest to implement, occurs too late. We should not rely on it to solve this
cpr problem.

I have a new proposal which I will send in a new thread.

- Steve

> (still, thinking about that virtio-net USO thread, an early handshake
> could be a good idea, so we could perhaps inform about device
> incompatibility, etc.)
> 
>>
>> I don't really know whether that'll work at last..
>>
>> So in my mind the previous two-steps proposal is so far the only one that
>> all seem to work, with no unpredictable side effects.
>>
>> Said that, maybe we can still think about simpler solutions in the
>> following days or see others opinions, we don't need to make a decision
>> today, so maybe there's still better way to go.
> 
> I thought of putting the caps on the configuration vmstate and using
> that to set them on the destination, but there's a bit of a chicken and
> egg problem because we need capabilities set as soon as
> qemu_start_incoming_migration(). Unless we sent those via the cpr
> channel. We could split migration_object_init() a bit so we can
> instantiate some parts of the migration state earlier (I'm not even sure
> what are the actual dependencies are).



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-09 20:36                     ` Peter Xu
@ 2024-10-10 20:06                       ` Steven Sistare
  2024-10-10 21:23                         ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-10 20:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/9/2024 4:36 PM, Peter Xu wrote:
> On Wed, Oct 09, 2024 at 04:09:45PM -0400, Steven Sistare wrote:
>> On 10/9/2024 3:06 PM, Peter Xu wrote:
>>> On Wed, Oct 09, 2024 at 02:43:44PM -0400, Steven Sistare wrote:
>>>> On 10/8/2024 3:48 PM, Peter Xu wrote:
>>>>> On Tue, Oct 08, 2024 at 04:11:38PM -0300, Fabiano Rosas wrote:
>>>>>> As of half an hour ago =) We could put a feature branch up and work
>>>>>> together, if you have more concrete thoughts on how this would look like
>>>>>> let me know.
>>>>>
>>>>> [I'll hijack this thread with one more email, as this is not cpr-relevant]
>>>>>
>>>>> I think I listed all the things I can think of in the wiki, so please go
>>>>> ahead.
>>>>>
>>>>> One trivial suggestion is we can start from the very simple, which is the
>>>>> handshake itself, with a self-bootstrap protocol, probably feature-bit
>>>>> based or whatever you prefer.  Then we set bit 0 saying "this QEMU knows
>>>>> how to handshake".
>>>>>
>>>>> Comparing to the rest requirement, IMHO we can make the channel
>>>>> establishment the 1st feature, then it's already good for merging, having
>>>>> feature bit 1 saying "this qemu understands named channel establishment".
>>>>>
>>>>> Then we add new feature bits on top of the handshake feature, by adding
>>>>> more feature bits.  Both QEMUs should first handshake on the feature bits
>>>>> they support and enable only the subset that all support.
>>>>>
>>>>> Or instead of bit, feature strings, etc. would all work which you
>>>>> prefer. Just to say we don't need to impl all the ideas there, as some of
>>>>> them might take more time (e.g. device tree check), and that list is
>>>>> probably not complete anyway.
>>>>
>>>> While writing a qtest for cpr-transfer, I discovered a problem that could be
>>>> solved with an early migration handshake, prior to cpr_save_state / cpr_load_state.
>>>>
>>>> There is currently no way to set migration caps on dest qemu before starting
>>>> cpr-transfer, because dest qemu blocks in cpr_state_load before creating any
>>>> devices or monitors. It is unblocked after the user sends the migrate command
>>>> to source qemu, but then the migration starts and it is too late to set migration
>>>> capabilities or parameters on the dest.
>>>>
>>>> Are you OK with that restriction (for now, until a handshake is implemented)?
>>>> If not, I have a problem.
>>>>
>>>> I can hack the qtest to make it work with the restriction.
>>>
>>> Hmm, the test case is one thing, but if it's a problem, then.. how in real
>>> life one could set migration capabilities on dest qemu for cpr-transfer?
>>
>> You will allow it via the migration handshake!
>> But right now, one can enable capabilities by adding -global migration.xxx=yyy
>> on the target command line.
> 
> Those are for debugging only, so we shouldn't suggest them to be used in
> production.. at least not the plan.
> 
> Yeah, handshake would make it work.  But it's not yet there.. :(
> 
>>
>>> Now a similar question, and also what I overlooked previously, is how
>>> cpr-transfer should support "-incoming defer".  We need that because that's
>>> what Libvirt uses.. with an upcoming migrate_incoming QMP command.
>>
>> Defer works.  Start dest qemu, issue the migrate command to source qemu.
>> Dest qemu finishes cpr_load_state and enters the main loop, listening for
>> montitor commands.
> 
> Ahh yes, the HUP works with this case too, that's OK.

Defer works, but it is backwards.  I believe the managers would typically send
monitor configuration commands to the dest first, then send the migrate command
to the source.  Backwards is weird.

My new proposal addresses this.

> What's your thoughts in the other email I wrote?  That'll make QMP
> available in general on dest, if I read it right.  But yeah I think this
> issue is not a blocker now at least, so I'm just curious whether that's
> still useful.
> 
> We may still want to understand one question I raised elsewhere on whether
> cpr state save/load must be done during vm stopped.  If so, then it means
> Libvirt will only go with "defer", and QMP set-capabilities might be
> accounted as downtime there which can be unfortunate.. Basically, it means
> if we can still drop patch 4 completely (while the vhost notifiers can
> exist in the future, but hopefully not dependent on patch 4).

vhost requires us to stop the vm early:
   qmp_migrate
     stop vm
     migration_call_notifiers MIG_EVENT_PRECOPY_CPR_SETUP
       vhost_cpr_notifier
         vhost_reset_device - must be after stop vm
                            - and before new qemu inits devices
       cpr_state_save
         unblocks new qemu which inits devices and calls vhost_set_owner

Thus config commands must be sent to the target during the guest pause interval :(

My new proposal addresses this.

- Steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-10 20:06                       ` Steven Sistare
@ 2024-10-10 21:23                         ` Peter Xu
  2024-10-24 21:12                           ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-10 21:23 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Thu, Oct 10, 2024 at 04:06:13PM -0400, Steven Sistare wrote:
> vhost requires us to stop the vm early:
>   qmp_migrate
>     stop vm
>     migration_call_notifiers MIG_EVENT_PRECOPY_CPR_SETUP
>       vhost_cpr_notifier
>         vhost_reset_device - must be after stop vm
>                            - and before new qemu inits devices
>       cpr_state_save
>         unblocks new qemu which inits devices and calls vhost_set_owner
> 
> Thus config commands must be sent to the target during the guest pause interval :(

I can understand it needs VM stopped, but it can still happen after
cpr_save(), am I right (IOW, fd wont change in the notifier)?  I meant
below sequence:

  - src: cpr_save(), when running, NONE->SETUP_CPR, all fds synced

  - [whatever happens..]

  - src: finally decide to switchover, vm stop

  - vhost notifier invoked. PS: it doesn't require to be named SETUP_CPR
    notifiers here, but something else..

> 
> My new proposal addresses this.

Yes, we can discuss that first.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-10 21:23                         ` Peter Xu
@ 2024-10-24 21:12                           ` Steven Sistare
  2024-10-25 13:55                             ` Peter Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Steven Sistare @ 2024-10-24 21:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/10/2024 5:23 PM, Peter Xu wrote:
> On Thu, Oct 10, 2024 at 04:06:13PM -0400, Steven Sistare wrote:
>> vhost requires us to stop the vm early:
>>    qmp_migrate
>>      stop vm
>>      migration_call_notifiers MIG_EVENT_PRECOPY_CPR_SETUP
>>        vhost_cpr_notifier
>>          vhost_reset_device - must be after stop vm
>>                             - and before new qemu inits devices
>>        cpr_state_save
>>          unblocks new qemu which inits devices and calls vhost_set_owner
>>
>> Thus config commands must be sent to the target during the guest pause interval :(
> 
> I can understand it needs VM stopped, but it can still happen after
> cpr_save(), am I right (IOW, fd wont change in the notifier)?  I meant
> below sequence:
> 
>    - src: cpr_save(), when running, NONE->SETUP_CPR, all fds synced
> 
>    - [whatever happens..]
> 
>    - src: finally decide to switchover, vm stop
> 
>    - vhost notifier invoked. PS: it doesn't require to be named SETUP_CPR
>      notifiers here, but something else..

The problem is that the first step, cpr_save, causes the dest to finish cpr_load_state
and proceed to initialize devices in qemu_create_late_backends -> net_init_clients.
This calls ioctl VHOST_SET_OWNER which fails because the device is still owned by src qemu.

src qemu releases ownership via VHOST_RESET_OWNER in the vhost notifier.

Thus the guest must be paused while config commands are sent to the target.
We could avoid that with any of:
   * do not issue config commands
   * precreate phase
   * cpr-exec mode
   * only pause if vhost is present.  (eg no pause for vfio).

- Steve


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-24 21:12                           ` Steven Sistare
@ 2024-10-25 13:55                             ` Peter Xu
  2024-10-25 15:04                               ` Steven Sistare
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Xu @ 2024-10-25 13:55 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Thu, Oct 24, 2024 at 05:12:05PM -0400, Steven Sistare wrote:
> On 10/10/2024 5:23 PM, Peter Xu wrote:
> > On Thu, Oct 10, 2024 at 04:06:13PM -0400, Steven Sistare wrote:
> > > vhost requires us to stop the vm early:
> > >    qmp_migrate
> > >      stop vm
> > >      migration_call_notifiers MIG_EVENT_PRECOPY_CPR_SETUP
> > >        vhost_cpr_notifier
> > >          vhost_reset_device - must be after stop vm
> > >                             - and before new qemu inits devices
> > >        cpr_state_save
> > >          unblocks new qemu which inits devices and calls vhost_set_owner
> > > 
> > > Thus config commands must be sent to the target during the guest pause interval :(
> > 
> > I can understand it needs VM stopped, but it can still happen after
> > cpr_save(), am I right (IOW, fd wont change in the notifier)?  I meant
> > below sequence:
> > 
> >    - src: cpr_save(), when running, NONE->SETUP_CPR, all fds synced
> > 
> >    - [whatever happens..]
> > 
> >    - src: finally decide to switchover, vm stop
> > 
> >    - vhost notifier invoked. PS: it doesn't require to be named SETUP_CPR
> >      notifiers here, but something else..
> 
> The problem is that the first step, cpr_save, causes the dest to finish cpr_load_state
> and proceed to initialize devices in qemu_create_late_backends -> net_init_clients.
> This calls ioctl VHOST_SET_OWNER which fails because the device is still owned by src qemu.
> 
> src qemu releases ownership via VHOST_RESET_OWNER in the vhost notifier.

I think the block drives have similar issue before on ownership when disk
is shared on both sides, and that ownership was only passed over to dest
until switchover, rather than dest qemu init.  In the CPR routines it'll be
also during switchover rather than cpr_save().

Maybe it's just harder for vhost, as I assume vhost was never designed to
work with using in shared mode.  Otherwise logically the net_init_clients()
could do the rest initialization, but provide a facility to SET_OWNER at a
later point. I'm not sure if it's possible.

For block it could be easier, IIRC it was mostly about the file lock and
who owns it (e.g. on a NFS share, to make sure no concurrent writters to
corrupt the file).

> 
> Thus the guest must be paused while config commands are sent to the target.
> We could avoid that with any of:
>   * do not issue config commands
>   * precreate phase
>   * cpr-exec mode
>   * only pause if vhost is present.  (eg no pause for vfio).

OK.  I hope precreate will work out if that can solve this too.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 13/13] migration: cpr-transfer mode
  2024-10-25 13:55                             ` Peter Xu
@ 2024-10-25 15:04                               ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-25 15:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 10/25/2024 9:55 AM, Peter Xu wrote:
> On Thu, Oct 24, 2024 at 05:12:05PM -0400, Steven Sistare wrote:
>> On 10/10/2024 5:23 PM, Peter Xu wrote:
>>> On Thu, Oct 10, 2024 at 04:06:13PM -0400, Steven Sistare wrote:
>>>> vhost requires us to stop the vm early:
>>>>     qmp_migrate
>>>>       stop vm
>>>>       migration_call_notifiers MIG_EVENT_PRECOPY_CPR_SETUP
>>>>         vhost_cpr_notifier
>>>>           vhost_reset_device - must be after stop vm
>>>>                              - and before new qemu inits devices
>>>>         cpr_state_save
>>>>           unblocks new qemu which inits devices and calls vhost_set_owner
>>>>
>>>> Thus config commands must be sent to the target during the guest pause interval :(
>>>
>>> I can understand it needs VM stopped, but it can still happen after
>>> cpr_save(), am I right (IOW, fd wont change in the notifier)?  I meant
>>> below sequence:
>>>
>>>     - src: cpr_save(), when running, NONE->SETUP_CPR, all fds synced
>>>
>>>     - [whatever happens..]
>>>
>>>     - src: finally decide to switchover, vm stop
>>>
>>>     - vhost notifier invoked. PS: it doesn't require to be named SETUP_CPR
>>>       notifiers here, but something else..
>>
>> The problem is that the first step, cpr_save, causes the dest to finish cpr_load_state
>> and proceed to initialize devices in qemu_create_late_backends -> net_init_clients.
>> This calls ioctl VHOST_SET_OWNER which fails because the device is still owned by src qemu.
>>
>> src qemu releases ownership via VHOST_RESET_OWNER in the vhost notifier.
> 
> I think the block drives have similar issue before on ownership when disk
> is shared on both sides, and that ownership was only passed over to dest
> until switchover, rather than dest qemu init.  In the CPR routines it'll be
> also during switchover rather than cpr_save().
> 
> Maybe it's just harder for vhost, as I assume vhost was never designed to
> work with using in shared mode.  Otherwise logically the net_init_clients()
> could do the rest initialization, but provide a facility to SET_OWNER at a
> later point. I'm not sure if it's possible.

net_init_clients cannot do any initialization that issues vhost ioctls,
because the dest process does not yet own the vhost device.

- Steve

> For block it could be easier, IIRC it was mostly about the file lock and
> who owns it (e.g. on a NFS share, to make sure no concurrent writters to
> corrupt the file).
> 
>>
>> Thus the guest must be paused while config commands are sent to the target.
>> We could avoid that with any of:
>>    * do not issue config commands
>>    * precreate phase
>>    * cpr-exec mode
>>    * only pause if vhost is present.  (eg no pause for vfio).
> 
> OK.  I hope precreate will work out if that can solve this too.
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
  2024-10-08 21:32             ` Peter Xu
@ 2024-10-31 20:32               ` Steven Sistare
  0 siblings, 0 replies; 79+ messages in thread
From: Steven Sistare @ 2024-10-31 20:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Igor Mammedov, Michael S. Tsirkin, qemu-devel, Fabiano Rosas,
	David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 10/8/2024 5:32 PM, Peter Xu wrote:
> On Tue, Oct 08, 2024 at 05:05:01PM -0400, Steven Sistare wrote:
>> On 10/8/2024 12:26 PM, Peter Xu wrote:
>>> On Tue, Oct 08, 2024 at 11:17:46AM -0400, Steven Sistare wrote:
>>>> On 10/7/2024 12:28 PM, Peter Xu wrote:
>>>>> On Mon, Oct 07, 2024 at 11:49:25AM -0400, Peter Xu wrote:
>>>>>> On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
>>>>>>> Save the memfd for anonymous ramblocks in CPR state, along with a name
>>>>>>> that uniquely identifies it.  The block's idstr is not yet set, so it
>>>>>>> cannot be used for this purpose.  Find the saved memfd in new QEMU when
>>>>>>> creating a block.  QEMU hard-codes the length of some internally-created
>>>>>>> blocks, so to guard against that length changing, use lseek to get the
>>>>>>> actual length of an incoming memfd.
>>>>>>>
>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> ---
>>>>>>>     system/physmem.c | 25 ++++++++++++++++++++++++-
>>>>>>>     1 file changed, 24 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>>> index 174f7e0..ddbeec9 100644
>>>>>>> --- a/system/physmem.c
>>>>>>> +++ b/system/physmem.c
>>>>>>> @@ -72,6 +72,7 @@
>>>>>>>     #include "qapi/qapi-types-migration.h"
>>>>>>>     #include "migration/options.h"
>>>>>>> +#include "migration/cpr.h"
>>>>>>>     #include "migration/vmstate.h"
>>>>>>>     #include "qemu/range.h"
>>>>>>> @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
>>>>>>>         }
>>>>>>>     }
>>>>>>> +static char *cpr_name(RAMBlock *block)
>>>>>>> +{
>>>>>>> +    MemoryRegion *mr = block->mr;
>>>>>>> +    const char *mr_name = memory_region_name(mr);
>>>>>>> +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
>>>>>>> +
>>>>>>> +    if (id) {
>>>>>>> +        return g_strdup_printf("%s/%s", id, mr_name);
>>>>>>> +    } else {
>>>>>>> +        return g_strdup(mr_name);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>>     size_t qemu_ram_pagesize(RAMBlock *rb)
>>>>>>>     {
>>>>>>>         return rb->page_size;
>>>>>>> @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>>>>                                             TYPE_MEMORY_BACKEND)) {
>>>>>>>                 size_t max_length = new_block->max_length;
>>>>>>>                 MemoryRegion *mr = new_block->mr;
>>>>>>> -            const char *name = memory_region_name(mr);
>>>>>>> +            g_autofree char *name = cpr_name(new_block);
>>>>>>>                 new_block->mr->align = QEMU_VMALLOC_ALIGN;
>>>>>>>                 new_block->flags |= RAM_SHARED;
>>>>>>> +            new_block->fd = cpr_find_fd(name, 0);
>>>>>>>                 if (new_block->fd == -1) {
>>>>>>>                     new_block->fd = qemu_memfd_create(name, max_length + mr->align,
>>>>>>>                                                       0, 0, 0, errp);
>>>>>>> +                cpr_save_fd(name, 0, new_block->fd);
>>>>>>> +            } else {
>>>>>>> +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
>>>>>>
>>>>>> So this can overwrite the max_length that the caller specified..
>>>>>>
>>>>>> I remember we used to have some tricks on specifying different max_length
>>>>>> for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
>>>>>> host so the size can be bigger than src qemu's old ramblocks), so that the
>>>>>> MR is always large enough to reload even the new firmwares, while migration
>>>>>> only migrates the smaller size (used_length) so it's fine as we keep the
>>>>>> extra sizes empty. I think that can relevant to the qemu_ram_resize() call
>>>>>> of parse_ramblock().
>>>>
>>>> Yes, resizable ram block for firmware blob is the only case I know of where
>>>> the length changed in the past.  If a length changes in the future, we will
>>>> need to detect and accommodate that change here, and I believe the fix will
>>>> be to simply use the actual length, as per the code above.  But if you prefer,
>>>> for now I can check for length change and return an error. New qemu will fail
>>>> to start, and old qemu will recover.
>>>>
>>>>>> The reload will not happen until some point, perhaps system resets.  I
>>>>>> wonder whether that is an issue in this case.
>>>>
>>>> Firmware is only generated once, via this path on x86:
>>>>     qmp_x_exit_preconfig
>>>>       qemu_machine_creation_done
>>>>         qdev_machine_creation_done
>>>>           pc_machine_done
>>>>             acpi_setup
>>>>               acpi_add_rom_blob
>>>>                 rom_add_blob
>>>>                   rom_set_mr
>>>>
>>>> After a system reset, the ramblock contents from memory are used as-is.
>>>>
>>>>> PS: If this is needed by CPR-transfer only because mmap() later can fail
>>>>> due to a bigger max_length,
>>>>
>>>> That is the reason.  IMO adjusting max_length is more robust than fiddling
>>>> with truncate and pretending that max_length is larger, when qemu will never
>>>> be able to use the phantom space up to max_length.
>>>
>>> I thought it was not pretending, but the ROM region might be resized after
>>> a system reset?  I worry that your change here can violate with such
>>> resizing later, so that qemu_ram_resize() can potentially fail after (1)
>>> CPR-transfer upgrades completes, then follow with (2) a system reset.
>>>
>>> We can observe such resizing kick off in every reboot, like:
>>>
>>> (gdb) bt
>>> #0  qemu_ram_resize
>>> #1  0x00005602b623b740 in memory_region_ram_resize
>>> #2  0x00005602b60f5580 in acpi_ram_update
>>> #3  0x00005602b60f5667 in acpi_build_update
>>> #4  0x00005602b5e1028b in fw_cfg_select
>>> #5  0x00005602b5e105af in fw_cfg_dma_transfer
>>> #6  0x00005602b5e109a8 in fw_cfg_dma_mem_write
>>> #7  0x00005602b62352ec in memory_region_write_accessor
>>> #8  0x00005602b62355e6 in access_with_adjusted_size
>>> #9  0x00005602b6238de8 in memory_region_dispatch_write
>>> #10 0x00005602b62488c5 in flatview_write_continue_step
>>> #11 0x00005602b6248997 in flatview_write_continue
>>> #12 0x00005602b6248abf in flatview_write
>>> #13 0x00005602b6248f39 in address_space_write
>>> #14 0x00005602b6248fb1 in address_space_rw
>>> #15 0x00005602b62a5d86 in kvm_handle_io
>>> #16 0x00005602b62a6cb2 in kvm_cpu_exec
>>> #17 0x00005602b62aa37a in kvm_vcpu_thread_fn
>>> #18 0x00005602b655da57 in qemu_thread_start
>>> #19 0x00007f120224a1b7 in start_thread
>>> #20 0x00007f12022cc39c in clone3
>>>
>>> Specifically, see this code clip:
>>>
>>> acpi_ram_update():
>>>       memory_region_ram_resize(mr, size, &error_abort);
>>>       memcpy(memory_region_get_ram_ptr(mr), data->data, size);
>>>
>>> Per my understanding, what it does is during the reset the ROM ramblock
>>> will resize to the new size (normally, only larger, in my memory there used
>>> to have a ROM grew from 256K->512K, or something like that), then the
>>> memcpy() injects the latest firmware that it pre-loaded into mem.
>>>
>>> So after such system reset, QEMU might start to see new ROM code loaded
>>> here (not the one that got migrated anymore, which will only match the
>>> version installed on src QEMU).  Here the problem is the new firmware can
>>> be larger, so I _think_ we need to make sure max_length is not modified by
>>> CPR to allow resizing happen here, while if we use truncate=true here it
>>> should just work in all cases.
>>>
>>> I think it could be verified with an old QEMU running with old ROM files
>>> (which is smaller), then CPR migrate to a new QEMU running new ROM files
>>> (which is larger), then reboot to see whether that new QEMU crash.  Maybe
>>> we can emulate that with "romfile=XXX" parameter.
>>>
>>> I am not fluent with ROM/firmware code, but please double check..
>>
>> Thank you for the detailed analysis, I was completely wrong on this one :(
>>
>> I also keep forgetting that ftruncate can grow as well as shrink a file.
>> I agree that preserving the dest qemu max_length, and using ftruncate, is the
>> correct solution, as long as dest max_length >= source max_length.
>>
>> However, IMO the extra memory created by ftruncate also needs to be pinned for DMA.
>> We disagreed on exactly what blocks needs to be pinned in previous discussions,
>> and to save time I would rather not re-open that debate right now.  Instead, I propose
>> to simply require that max_length does not change, and return an error if it does.
>> If it changes in some future qemu, we can reopen the discussion.
> 
> Hmm.. why the extra memory needs to be pinned?
> 
>  From QEMU memory topology POV, anything more than used_length is not
> visible to the guest, afaict.
> 
> In this specific ROM example, qemu_ram_resize() on src QEMU will first
> resize the ramblock (updating used_length), then set that exact same size
> with memory_region_set_size() to the MR with the size of the smaller
> firmware size when src QEMU boots:
> 
> qemu_ram_resize():
>      unaligned_size = newsize;
>      ...
>      newsize = TARGET_PAGE_ALIGN(newsize);
>      newsize = REAL_HOST_PAGE_ALIGN(newsize);
>      ...
>      block->used_length = newsize;
>      ...
>      memory_region_set_size(block->mr, unaligned_size);
> 
> Here a tiny detail is the two sizes are slightly different, but the MR size
> is even smaller than used_length.  The MR size decides what can be visible
> to the guest, when the MR that owns the ROM file is mapped into GPA range.
> That's true on the src, while after CPR migrates to dest that should still
> hold true, afaict, as all the rest memory (used->max) is not yet used
> before a system reset.
> 
> The extra memory (used->max) can be relevant only after a system reset,
> when the new firmware will be loaded, and qemu_ram_resize() can indeed
> extend that MR to cover more than before.  However that should be fine too
> because that means guest memory is being rebuilt, so VFIO memory listeners
> should do the right things (unpin old, repin the new ROM that is larger
> this time), iiuc.

So, we actually agree!  I said the extra memory needs to be pinned.  You said
the memory listeners will pin the extra memory.  Cool.

I will add ftruncate to this patch to grow the mr.

- Steve



^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2024-10-31 20:34 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
2024-09-30 19:40 ` [PATCH V2 01/13] machine: alloc-anon option Steve Sistare
2024-10-03 16:14   ` Peter Xu
2024-10-04 10:14     ` David Hildenbrand
2024-10-04 12:33       ` Peter Xu
2024-10-04 12:54         ` David Hildenbrand
2024-10-04 13:24           ` Peter Xu
2024-10-07 16:23             ` David Hildenbrand
2024-10-07 19:05               ` Peter Xu
2024-10-07 15:36   ` Peter Xu
2024-10-07 19:30     ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 02/13] migration: cpr-state Steve Sistare
2024-10-07 14:14   ` Peter Xu
2024-10-07 19:30     ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 03/13] migration: save cpr mode Steve Sistare
2024-10-07 15:18   ` Peter Xu
2024-10-07 19:31     ` Steven Sistare
2024-10-07 20:10       ` Peter Xu
2024-10-08 15:57         ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 04/13] migration: stop vm earlier for cpr Steve Sistare
2024-10-07 15:27   ` Peter Xu
2024-10-07 20:52     ` Steven Sistare
2024-10-08 15:35       ` Peter Xu
2024-10-08 19:13         ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 05/13] physmem: preserve ram blocks " Steve Sistare
2024-10-07 15:49   ` Peter Xu
2024-10-07 16:28     ` Peter Xu
2024-10-08 15:17       ` Steven Sistare
2024-10-08 16:26         ` Peter Xu
2024-10-08 21:05           ` Steven Sistare
2024-10-08 21:32             ` Peter Xu
2024-10-31 20:32               ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 06/13] hostmem-memfd: preserve " Steve Sistare
2024-10-07 15:52   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile Steve Sistare
2024-10-07 16:06   ` Peter Xu
2024-10-07 16:35     ` Daniel P. Berrangé
2024-10-07 18:12       ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 08/13] migration: VMSTATE_FD Steve Sistare
2024-10-07 16:36   ` Peter Xu
2024-10-07 19:31     ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 09/13] migration: cpr-transfer save and load Steve Sistare
2024-10-07 16:47   ` Peter Xu
2024-10-07 19:31     ` Steven Sistare
2024-10-08 15:36       ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 10/13] migration: cpr-uri parameter Steve Sistare
2024-10-07 16:49   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 11/13] migration: cpr-uri option Steve Sistare
2024-10-07 16:50   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 12/13] migration: split qmp_migrate Steve Sistare
2024-10-07 19:18   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 13/13] migration: cpr-transfer mode Steve Sistare
2024-10-07 19:44   ` Peter Xu
2024-10-07 20:39     ` Steven Sistare
2024-10-08 15:45       ` Peter Xu
2024-10-08 19:12         ` Steven Sistare
2024-10-08 19:38           ` Peter Xu
2024-10-08 18:28       ` Fabiano Rosas
2024-10-08 18:47         ` Peter Xu
2024-10-08 19:11           ` Fabiano Rosas
2024-10-08 19:33             ` Steven Sistare
2024-10-08 19:48             ` Peter Xu
2024-10-09 18:43               ` Steven Sistare
2024-10-09 19:06                 ` Peter Xu
2024-10-09 19:59                   ` Peter Xu
2024-10-09 20:18                     ` Steven Sistare
2024-10-09 20:57                       ` Peter Xu
2024-10-09 22:08                         ` Fabiano Rosas
2024-10-10 20:05                           ` Steven Sistare
2024-10-09 20:09                   ` Steven Sistare
2024-10-09 20:36                     ` Peter Xu
2024-10-10 20:06                       ` Steven Sistare
2024-10-10 21:23                         ` Peter Xu
2024-10-24 21:12                           ` Steven Sistare
2024-10-25 13:55                             ` Peter Xu
2024-10-25 15:04                               ` Steven Sistare
2024-10-08 19:29           ` Steven Sistare
2024-10-08 14:33 ` [PATCH V2 00/13] Live update: cpr-transfer Vladimir Sementsov-Ogievskiy
2024-10-08 21:13   ` Steven Sistare

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).